aimode.news
Published on

Nvidia GPU VRAM Linux Use as a swap space above.

Authors

Use the VRAM of your GPU NVIDIA as swap space under Linux.

Designed for hybrid graphics laptops with soldered memory and no upgrade path. The display works with the integrated AMD/ATI GPU. The NVIDIA card remains idle most of the time, with its VRAM completely unused. This makes VRAM function as a high priority swap.

Tested on: AMD/ATI + RTX 3070 laptop (GA104M, 16 GB RAM, 8 GB VRAM), driver 580.159.03, kernel 6.17, Pop!_OS. Allocated 7 GB for the exchange. Bottom line, including zram and SSD swap: approximately 46 GB of total addressable memory, tripled from stock. Overflow order: RAM fills up, then VRAM absorbs the spill (PCIe), then zram compresses the rest (CPU), then SSD only if everything else is exhausted.

A small daemon allocates VRAM via the CUDA API driver, then serves it as a block device using the Network Block Device (NBD) protocol over a Unix socket. The nbd integrated into the kernel

the driver connects to it and exposes /dev/nbdX

. From there it's a normal swap device.

Datapath: kernel swap subsystem - /dev/nbdX - nbd kernel driver - Unix socket - nbd-vram daemon - cuMemcpyHtoD/DtoH - GPU VRAM.

No kernel modules to write or maintain. No NVIDIA kernel symbol. Survives kernel and driver updates without rebuilding anything.

The "obvious" approach is nvidia_p2p_get_pages_persistent

, which pins VRAM pages in BAR1 so that the CPU can access them directly via ioremap_wc

. All existing projects that have attempted this route hit the same wall: NVIDIA driver returns EINVAL

on mainstream GeForce GPUs. Persistent and non-persistent variants, both flag values. It is limited to RM level for Quadro/datacenter SKUs only, regardless of driver version.

The other approach - directly ioremap_wc

the physical address BAR1 without going through the P2P API - doesn't work either. The GPU's internal page tables only mapped about 16 MiB of BAR1 (just the display framebuffer). The remainder readings return zeros. mkswap

seems to succeed, then swapon

fails because the swap header is not actually there.

The NBD approach avoids all of this. cumemmcpyHtoD

and cuMemcpyDtoH

Work on any CUDA GPU without any special permissions.

- NVIDIA GPU with CUDA support (any consumer RTX/GTX card)

- NVIDIA driver with

libcuda.so.1

(no CUDA toolkit needed) - Linux kernel 3.0+ (nbd module, integrated into most distributions)

nbd-client

packagegcc

, do

git clone https://github.com/c0dejedi/nbd-vram

cd nbd-vram

sudo ./install.sh

sudo systemctl start vram-swap-nbd

Check:

swapon --show

# NAME TYPE USED SIZE PRIO

# /dev/nbd0 partition 7G 0B 1500

The service is activated during installation, so it appears automatically every time you start.

Edit /etc/systemd/system/vram-swap-nbd.service

:

Environment=VRAM_SETUP_SIZE_MB=7168 # how much VRAM to use

Environment=VRAM_SWAP_PRIORITY=1500 # swap priority (higher = used first)

The daemon first tries the requested size and steps backwards by 512 MiB if the GPU runs out of memory - so it will grab as much as it can even if the display composer is already loaded. VRAM_SETUP_SIZE_MB

is the cap, not a strict requirement.

After editing, run sudo systemctl daemon-reload && sudo systemctl restart vram-swap-nbd

.

The installer asks whether to enable power-aware management during the first installation. If enabled, the service automatically stops when you unplug the mains (or when the battery falls below a threshold) and restarts when power is restored. Manual system shutdown

is always respected and will not be ignored.

To change settings after installation, edit /etc/nbd-vram.conf

. Changes take effect on the next poll (within 60 seconds) or immediately on the next mains connection/disconnect event.

sudo bash test-nbd.sh

Allocates VRAM, connects the NBD device, performs a 1 MiB write/read check, enables swap, then prints the unmount instructions. installer.sh

automatically handles unmounting if a test instance is running.

To stress the entire partition after passing the smoke test:

sudo bash test-fill.sh

Writes the entire VRAM partition with zeros, checks a reread sample, then automatically restores swap on exit.

Tested on RTX 3070 laptop (8 GB VRAM), kernel 6.17, Pop!_OS. Compared to NVMe cryptswap (dm-crypt, PCIe 4.0). All tests run with O_DIRECT to bypass the page cache.

Three benchmarks are in benchmarks/

. Each runs NVMe first, then starts the VRAM service and runs the same test on the block device. The state is restored upon exit.

sudo bash benchmarks/bench-throughput.sh # sequential read/write (dd, 2 GiB, O_DIRECT)

sudo bash benchmarks/bench-iops.sh # 4K random IOPS (fio, libaio, iodegree=32)

sudo bash benchmarks/bench-latency.sh # latency per operation (ioping, 20 requests)

fio

and ioping

are installed automatically if they are missing.

| Device | Write | Read |

|---|---|---|

| NVMe | 2.7 GB/s | 2.9 GB/s |

| VRAM (nbd) | 1.1 GB/s | 2.3 GB/s |

VRAM is slower for large sequential transfers. The bottleneck is the NBD + CUDA userspace round trip - each block passes through a Unix socket and a cuMemcpy

call, which adds overhead that NVMe's direct kernel block path does not pay. Sequential throughput is not the primary swap workload (the kernel swaps individual 4 KB pages, not 4 MiB streams) - see IOPS and latency tests below.

| Device | Read IOPS | Write IOPS | Average latency |

|---|---|---|---|

| NVMe | 45.4k | 45.3k | 343 us |

| VRAM (nbd) | 28.7k | 28.7k | 550 us |

NVMe wins for sustained random I/O. At iodegree=32, NVMe can have 32 requests truly in flight simultaneously; the NBD+CUDA path serializes them through the daemon, so the depth advantage is reduced. The VRAM daemon also adds CPU overhead that the NVMe path does not pay for. For high-speed continuous swap pressure, NVMe is faster.

The picture changes for sporadic access - see latency test below.

| Device | Min | Avg. | Max |

|---|---|---|---|

| NVMe | 120 us | 9.05ms | 10.1ms |

| VRAM (nbd) | 134 us | 335 us | 490 us |

VRAM has an average latency 27 times faster. The NVMe drive is physically capable of ~112 us (visible on the warm-up request) but APST (Autonomous Power State Transitions) puts it to sleep between requests. At 1 request per second - the sporadic swap access rate - it wakes up cold almost every time and pays a penalty of around 9ms. VRAM has no power state and responds consistently between 133 and 490 us.

This is the scenario that matters most in practice. Memory pressure on a laptop rarely results in a sustained flood of GB/s: it's individual 4K page faults arriving every few seconds. Each of these errors hangs while waiting for the swap device to respond. At 9 ms by default, the NVMe swap is felt. At 335 US, VRAM swap is not.

sudo bash uninstall.sh

MIT - Sean Lobjoit (c0dejedi)

Nvidia GPU VRAM Linux Use as a swap space above. | aimode.news