Virtualizing Nvidia Hgx B200 Gpus with Open Source
Key topics
Virtualizing Nvidia's powerful HGX B200 GPUs with open-source tools just got a whole lot more interesting, thanks to a recent blog post that dives into the nitty-gritty of preserving NVLink bandwidth while maintaining isolation. The author, ben_s, reveals that the biggest challenge was virtualizing GPUs with NVLink, and commenters jump in to discuss the nuances of SR-IOV, MIG, and vGPU technologies. While some debate the feasibility of passing MIG devices into KVM VMs, others share insights on Nvidia's hardware capabilities, including SR-IOV support on Ampere and higher architectures. The conversation highlights the complexities of GPU virtualization and the trade-offs between performance, isolation, and granularity.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
31m
Peak period
28
0-12h
Avg / period
10.3
Based on 31 loaded comments
Key moments
- 01Story posted
Dec 18, 2025 at 9:04 AM EST
15 days ago
Step 01 - 02First comment
Dec 18, 2025 at 9:34 AM EST
31m after posting
Step 02 - 03Peak activity
28 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 24, 2025 at 2:36 AM EST
9 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
For me, the hardest part was virtualizing GPUs with NVLink in the mix. It complicates isolation while trying to preserve performance.
AMA if you want to dig into any of the details.
This was a long time ago, so I honestly have no idea if any of this is true.
But I thought MIG did do the job of chopping a GPU that's too big for most individual users into something that behaves very close to a literal array of smaller GPUs stuffed into the same PCIe card form factor? Think how a Tesla K80 was pretty much just two GK210 "GPUs" on a PLX "PCIe switch" which connects them to each other and to the host. Obviously trivial to give one to each of two VMs (at least if the PLX didn't interfere with IOMMU separation or such.... for mere performance isolation it certainly sufficed (once you block a heavy user from power budget throttling the sibling, at least).
https://kubevirt.io/user-guide/compute/host-devices/
Still needs some host SW to drive it but actually does static partitioning
IIRC it's usable through using the MIG-marked vGPU types
Also, how strong are the security boundaries among multiple tenants when configured in this way? I know, for example, that AWS is extremely careful about how hardware resources are shared across tenants of a physical host.
On isolation: in Shared NVSwitch Multitenancy mode, isolation is enforced at multiple layers. Fabric Manager programs the NVSwitch routing tables so GPUs in different partitions cannot exchange NVLink traffic, and each VM receives exclusive ownership of its assigned GPUs via VFIO passthrough. Large providers apply additional hardening and operational controls beyond what we describe here. We're not claiming this is equivalent to AWS's internal threat model, but it does rely on NVIDIA's documented isolation mechanisms.
After skimming the article I noticed a large chunk of this article (specifically the bits on deattaching/attaching drivers, qemu and vfio) applies more or less to general GPU virtualization under Linux too!
Like it says something about mmaping 256 GB of per GPU. But wouldn't it waste 2T of RAM? or do I fail in my understanding of what "mmap" is as well..
EDIT: yes, seems like my understanding of mmap wasn't good, it wastes not RAM but address space
this term can be used at a couple different points (including mappings from physical addresses to physical hardware in the memory network), but a PCI BAR is a register in the configuration space that tells the card what PCI host addresses map to internal memory regions in the card. one BAR per region.
The Debian package rocm-qemu-support ships scripts that facilitate most of this. I've since generalized this by adding NVIDIA support, but I haven't uploaded the new gpuisol-qemu package [2] to the official Archive yet. It still needs some polishing.
Just dumping this here, and more more references (especially the further reading section, the Gentoo and Arch wikis had a lot of helpful data).
[1]: https://salsa.debian.org/rocm-team/community/team-project/-/...
[2]: https://salsa.debian.org/ckk/gpu-isolation-tools
https://github.com/amd/MxGPU-Virtualization/issues/6
https://github.com/amd/MxGPU-Virtualization/issues/16
Our Navi 21 would almost always go AWOL after a test run had been completed, requiring a full reboot. At some point, I noticed that this only happened when our test runner was driving the test; I never had an issue when testing interactively. I eventually realized that our test driver was simply killing the VM when the test was done, which is fine for a CPU-based test, but this messed with the GPU's state. When working interactively, I was always shutting down the host cleanly, which apparently resolved this. A patch to our test runner to cleanly shut down VMs fixed this.
And I've had no luck with iGPUs, as referenced by the second issue.
From what I understand, I don't think that consumer AMD GPUs can/will ever be fully supported, because the GPU reset mechanisms of older cards are so complex. That's why things like vendor-reset [3] exist, which apparently duplicate a lot of the in-kernel driver code but ultimately only twiddle some bits.
[3]: https://github.com/gnif/vendor-reset