The Race to Build a Distributed GPU Runtime
Posted4 months agoActive4 months ago
voltrondata.comTechstory
calmmixed
Debate
60/100
GPU ComputingDistributed SystemsData Processing
Key topics
GPU Computing
Distributed Systems
Data Processing
The article discusses the development of distributed GPU runtimes, and the discussion revolves around the challenges, potential solutions, and implications of such technology.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3d
Peak period
45
72-84h
Avg / period
19
Comment distribution57 data points
Loading chart...
Based on 57 loaded comments
Key moments
- 01Story posted
Sep 4, 2025 at 4:18 PM EDT
4 months ago
Step 01 - 02First comment
Sep 7, 2025 at 4:42 PM EDT
3d after posting
Step 02 - 03Peak activity
45 comments in 72-84h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 9, 2025 at 2:50 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45131784Type: storyLast synced: 11/20/2025, 4:50:34 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Currently the fastest way to get data from node a to node b is to RDMA it. which means that any node can inject anything into your memory space.
I'm not really sure how Theseus guards against that.
Back in grad school I remember we did something fairly simple but clearly illegal and wedged the machine so bad the out of band management also went down!
Now thats living the dream of a shared cluster!
This is hazy now, but I do remember a massive outage of a lustre cluster, which I think was because there was a dodgy node injecting crap into everyone's memory space via the old lustre fast filesystem kernel driver. I think they switched to NFS export nodes after that. (for the render farm and desktops at least.)
> There’s a strong argument to be made that RAPIDS cuDF/RAPIDS libcudf drives NVIDIA’s CUDA-X Data Processing stack, from ETL (NVTabular) and SQL (BlazingSQL) to MLOps/security (Morpheus) and Spark acceleration (cuDF-Java).
Yeah this seems like the core indeed, libcudf.
Focus here is on TCP & GPUDirect (Nvidia's pci-p2p, letting for example RDMA without CPU involvement across a full GPU -> NIC -> switch -> nic -> GPU happen).
Personally it feels super dangerous to just trust Nvidia on all of this, to just buy the solution available. Pytorch nicely sees this somewhat, adopted & took over Facebook/Meta's gloo project, which wraps a lot of the rdma efforts. But man there's just so so many steps ahead that Theseus is here with figuring out & planning what to do with these capabilities, these ultra efficient links, figuring out how to not need to use them if possible! The coordination problems keep growing in computing. I think of RISC-V with its arbitrary vector-based alternative to conventional x86 simd, going from a specific instruction for each particular operation to instructions being parameterized over different data lengths & types. https://github.com/pytorch/gloo
I'd really like to see a concerted effort to around Ultra Ethernet emerge, fast. Hardware isnt really available, and it's going to start out being absurdly expensive. But Ultra Ethernet looks like a lovely mix of collision-less credit-based Infiniband RDMA and Ethernet, with lots of other niceties (transport security). Deployments just starting (AMD Pensando Pollara 400 at Oracle). I worry that without broader availability & interest, without mass saturation, AI is going to stay stuck on libcudf forever; getting hardware out there & getting software stacos using it is a chicken & egg problem that big players need to spend real effort accelerating UET or else. https://www.tomshardware.com/networking/amd-deploys-its-firs...
But this article in particular is emphasizing extreme performance ambitions for columnar data processing with hardware acceleration. Relevant to many ML training scenarios, but also other kinds of massive MapReduce-style (or at least scale) workloads. There are lots of applications of "magic massive petabyte plus DataFrame" (which is not I think solved in the general case).
How about ten billion chickens?
Is it that hard to do, or is the software lock-in so great?
I mean efforts like rust-gpu: https://github.com/Rust-GPU/rust-gpu/
Combine such language with Vulkan (using Rust as well) and why would you need CUDA?
Probably just needs a couple short decades of refinement…
Hence they bought PGI, and improved their compiler.
Intel eventually did the same with Open API (which isn't plain OpenCL, rather an extension with Intel goodies).
I was on a Khronos webminar where the panel showed disbelief why anyone would care about Fortran, oh well.
Which in the end was worthless because both Intel and AMD botched all OpenCL 2.x efforts.
Hence why OpenCL 3.0 is basically OpenCL 1.0 rebranded, and SYSCL went its own way.
It took a commercial company, Codeaplay a former compiler vendor for games consoles, to actually come up with a good tooling for SYSCL.
Which Intel in the middle of extending SYSCL with their Data Paralell C++, eventually acquired.
Those products are in the foundation of One API, and naturally go beyond what barebones OpenCL happens to be.
The mismanagement Khronos has done with OpenCL is one of the reasons Apple lost ties with Khronos.
Besides, I'd say Rust is a nicer language than CUDA dialects.
Khronos standards, CUDA, ROCm, One API, Metal, none of them has Rust on their sights.
World did not back OpenCL, because it was stuck on a primitive C99 text based tooling, without an ecosystem.
Also Google decided to push their Renderscript C99 dialect instead, while Intel and AMD were busy delivering janky tools and broken drivers.
Also SPIR worked so great for OpenCL 2.x, that Khronos rebooted the whole mess back to OpenCL 1.x with OpenCL 3.0 rebranding.
Cross compiling Rust into PTX is not enough to make researchers leave CUDA.
Being language agnostic is also not the task of the language, but task of IR. There is already a bunch of languages, such as Slang. The point is to use Rust itself for it.
Slang belongs to NVidia, and was nicely given to Khronos, because almost everyone started relying on HLSL, given that Khronos decided not to spend any additional resources on GLSL.
Just like with Mantle and Vulkan, Khronos seems that without external help they aren't able to produce anything meaningful since Long Peak days.
The language is general, but the current focus is really on programming GPUs.
The reason why people continue to use CUDA and Pytorch and so on is because they are literally too stupid and too lazy to do it any other way
When you're done, you can create IDE plugins, and a graphical debugger with feature parity to NInsights.
The argument you are making sounds to me like, "well good luck making a Vulkan application without cmake, ninja, meson, git, visual studio, clion" etc, when in reality a 5 line bash script to gcc works just fine
Nvidia's own people are the ones who have made Vulkan performance so close to CUDA's. AMD is behind, but the data shows that they're off in performance proportional to the cost of the device. If they implement coop mat 2, then they would bridge the gap.
99.9% of people who use Pytorch and so on could achieve good enough performance using a "simple vulkan backend" for whatever Python stuff they're used to writing. That would strip out millions of lines of code.
The reason nobody has done this outside of a few github projects that Nvidia themselves have contributed to, is because there isn't a whole lot of money in iterative performance gains, when in reality better algorithmic approaches are being invented quite near every month or so.
Lacking understanding is doomed to failure.
The non-nvidia hardware vendors really don't want cuda to win. AMD went for open source + collaborative in a big way, opencl then hsa. Both broadly ignored. I'm not sure what Intel are playing at with spirv - that stack doesn't make any sense to me whatsoever.
Cuda is alright though, in a kind of crufty obfuscation over SSA sense. Way less annoying than opencl certainly. You can run it on amdgpu hardware if you want to - https://docs.scale-lang.com/stable/ and https://github.com/vosen/ZLUDA already exist. I'm hacking on scale these days.
But the concrete qualities of CUDA and Nvidia's offerings generally is a move toward general purpose parallel computing. Parallel processing is "the future" and approach of just do loop and have each iteration be parallel is dead simple.
Which is to say Nvidia has invested a lot in making "easy things easy along with hard things no harder".
In contrast, other chip makers seem to be acculturated to the natural lock-in of having a dumb, convoluted interface to compensate for a given chip being high performance.
Notably, AMD funded a CUDA clone, ZLUDA, and then quashed it[1]. Comments at the time here involved a lot of "they would always be playing catch up".
I think the mentality of chip makers generally is that they'd rather control a small slice of a market than fight competitively for a large slice. It makes sense in that they invest years in advance and expect those investments to pay high profits.
[1] https://www.tomshardware.com/pc-components/gpus/amd-asks-dev...
> it's 20 years worth of institutional knowledge with a stable external api
> There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.
To me that sounds like massive investment
They overlook CUDA is a polyglot ecosystem composed by C, C++ and Fortran as main languages, with Python JIT DSL since this year, compiler infrastructure for any compiler backend that wishes to target it of which there are a few including strange stuff like Haskell, IDE integration with Eclipse and Visual Studio, graphical debugging just like on the CPU.
It is like when Khronos puts out those spaghetti riddled standards, expecting each vendor/open source community, to create some kind of SDK, versus the vertical integration of console devkits and proprietary APIs, and then asking why professional studios have no qualms with proprietary tooling.
But you're right there was already something in place.
[0] https://github.com/google-research/kauldron
This _should_ breakdown as running optimised programs on their runtime makes things worse and running less-carefully-structured ones makes things better, where many programs out there turn out to be either quite naive or obsessively optimised for an architecture that hasn't existed for decades. I'd expect this runtime to be difficult to build but with high value on success. Interesting project, thanks for posting it.
3 more comments available on Hacker News