Teknoman117
today at 2:28 AM
Gen 4.0 x16 is 32 GB/s in each direction, but the way this is implemented is not the way you'd go about this if you wanted high performance.
Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.
First of all, it's a userspace program hooking the nbd driver, which is known for being slow. It also uses a bounce buffer in userspace before transferring to the GPU. So when the kernel needs to swap a page, it has to first copy it into a userspace facing buffer. The userspace program that has to wake back up and issue the cuda operation to copy the page into device memory.
nbd also doesn't really do a good job of supporting high queue depth or merging adjacent accesses. So if the kernel is issuing a bunch of 4K page swaps without any coalescing, you're going to end up with at least million kernel/userspace context switches per second just to handle 4 GB/s (4 GB / 4K page), let alone 64 GB/s. And that's just the NBD portion, forget the mess that is the NVIDIA driver. PCIe can move a lot of data, but in order to get anything even resembling the full bandwidth, you have to have use DMA engines with long page lists. Having to set up a transfer for every 4K page over PCIe will not reach full saturation of the bus.
Swapping to NVMe is a very optimized path -> the swapper can submit lists of pages directly to the NVMe driver and the controller can DMA them directly out of RAM, no copies or context switches CPU side at all.
This could probably be improved by migrating to the ublk driver as it might let you avoid the userspace bounce buffer. It'd also be able to have multiple write queues to at least set up CUDA copies in parallel.
yup. it's nbd and userspace making it slow. zram on the other hand adds little.
one can get rid of zram and just reimplement some compression in shaders but I think that would be a pointless optimization.