The Race to Build a Distributed GPU Runtime

Posted4 months agoActive4 months ago

jonbaer

79 points

60 comments

voltrondata.comTechstory

calmmixed

Debate

60/100

GPU ComputingDistributed SystemsData Processing

Key topics

GPU Computing

Distributed Systems

Data Processing

The article discusses the development of distributed GPU runtimes, and the discussion revolves around the challenges, potential solutions, and implications of such technology.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

72-84h

Avg / period

Comment distribution57 data points

Loading chart...

Based on 57 loaded comments

Key moments

01Story posted
Sep 4, 2025 at 4:18 PM EDT
4 months ago
Step 01
02First comment
Sep 7, 2025 at 4:42 PM EDT
3d after posting
Step 02
03Peak activity
45 comments in 72-84h
Hottest window of the conversation
Step 03
04Latest activity
Sep 9, 2025 at 2:50 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (60 comments)

Showing 57 comments of 60

KaiserPro

4 months ago

1 reply

One thing thats not addressed here is that the bigger you scale your shared memory cluster the closer to 100% chance that one node fucks up and corrupts your global memory space.

Currently the fastest way to get data from node a to node b is to RDMA it. which means that any node can inject anything into your memory space.

I'm not really sure how Theseus guards against that.

buildbot

4 months ago

1 reply

I’m not sure any system prevents RDMA from ruining your day :(

Back in grad school I remember we did something fairly simple but clearly illegal and wedged the machine so bad the out of band management also went down!

KaiserPro

4 months ago

> wedged the machine so bad the out of band management also went down!

Now thats living the dream of a shared cluster!

This is hazy now, but I do remember a massive outage of a lustre cluster, which I think was because there was a dodgy node injecting crap into everyone's memory space via the old lustre fast filesystem kernel driver. I think they switched to NFS export nodes after that. (for the render farm and desktops at least.)

jauntywundrkind

4 months ago

1 reply

Lot of hype, but man does Voltron Data keep blowing me away with what they bring out. Mad respect.

> There’s a strong argument to be made that RAPIDS cuDF/RAPIDS libcudf drives NVIDIA’s CUDA-X Data Processing stack, from ETL (NVTabular) and SQL (BlazingSQL) to MLOps/security (Morpheus) and Spark acceleration (cuDF-Java).

Yeah this seems like the core indeed, libcudf.

Focus here is on TCP & GPUDirect (Nvidia's pci-p2p, letting for example RDMA without CPU involvement across a full GPU -> NIC -> switch -> nic -> GPU happen).

Personally it feels super dangerous to just trust Nvidia on all of this, to just buy the solution available. Pytorch nicely sees this somewhat, adopted & took over Facebook/Meta's gloo project, which wraps a lot of the rdma efforts. But man there's just so so many steps ahead that Theseus is here with figuring out & planning what to do with these capabilities, these ultra efficient links, figuring out how to not need to use them if possible! The coordination problems keep growing in computing. I think of RISC-V with its arbitrary vector-based alternative to conventional x86 simd, going from a specific instruction for each particular operation to instructions being parameterized over different data lengths & types. https://github.com/pytorch/gloo

I'd really like to see a concerted effort to around Ultra Ethernet emerge, fast. Hardware isnt really available, and it's going to start out being absurdly expensive. But Ultra Ethernet looks like a lovely mix of collision-less credit-based Infiniband RDMA and Ethernet, with lots of other niceties (transport security). Deployments just starting (AMD Pensando Pollara 400 at Oracle). I worry that without broader availability & interest, without mass saturation, AI is going to stay stuck on libcudf forever; getting hardware out there & getting software stacos using it is a chicken & egg problem that big players need to spend real effort accelerating UET or else. https://www.tomshardware.com/networking/amd-deploys-its-firs...

latchkey

4 months ago

Our MI300x boxes have had 8x400G Thor2 RDMA working great for a year now.

anonymousDan

4 months ago

1 reply

Can someone tell me if the challenges the article describes and indeed the frameworks they mention are mostly relevant for training or also for inference?

benreesman

4 months ago

The fast interconnect between nodes has aaplications in inference at scale (big KV caches and other semi-durable state, multi-node tensor parallelism on mega models).

But this article in particular is emphasizing extreme performance ambitions for columnar data processing with hardware acceleration. Relevant to many ML training scenarios, but also other kinds of massive MapReduce-style (or at least scale) workloads. There are lots of applications of "magic massive petabyte plus DataFrame" (which is not I think solved in the general case).

smj-edison

4 months ago

1 reply

This reminds me a lot of Seymour Cray's two maxims of supercomputing: get the data where it needs to be when it needs to be there, and get the heat out. Still seems to apply today!

CamperBob2

4 months ago

1 reply

Calls to mind his other famous quote, "Would you rather plow your field with two strong oxen or 1024 chickens?"

How about ten billion chickens?

smj-edison

4 months ago

Yeah. I feel like he's still partially vindicated with things like the dragonfly topology, as a lot of problems don't nicely map onto a 2D or 3D topology (so longest distance is still the limiting factor). But the chicken approach certainly scales better, and I feel like since Cray's time there are more local-aware algorithms around.

RachelF

4 months ago

8 replies

I find it odd that given the billions of dollars involved, no competitor has managed to replicate the functions of CUDA.

Is it that hard to do, or is the software lock-in so great?

shmerl

4 months ago

4 replies

A better question is why there is no stronger push for a nicer GPU language that's not tied to any particular GPU and serves any purpose of GPU usage (whether it's compute or graphics).

I mean efforts like rust-gpu: https://github.com/Rust-GPU/rust-gpu/

Combine such language with Vulkan (using Rust as well) and why would you need CUDA?

bee_rider

4 months ago

1 reply

I think Intel Fortran has some ability to offload to their GPUs now. And Nvidia has some stuff to run(?) CUDA from Fortran.

Probably just needs a couple short decades of refinement…

pjmlp

4 months ago

2 replies

One of the reasons CUDA won over OpenCL, was that NVidia, contrary to Khronos, saw a value in helping those HPC researchers move their Fortran code into the GPU.

Hence they bought PGI, and improved their compiler.

Intel eventually did the same with Open API (which isn't plain OpenCL, rather an extension with Intel goodies).

I was on a Khronos webminar where the panel showed disbelief why anyone would care about Fortran, oh well.

bee_rider

4 months ago

1 reply

That’s actually pretty surprising to me. Of course, there are always jokes about Fortran being some language that people don’t realize is still kicking. But I’d expect a standards group that is at least parallel computing adjacent to know that it is still around.

pjmlp

4 months ago

Yet not only they joked about Fortran, it took CUDA adoption success, for them to take C++ seriously and come up with SPIR as counterpoint to PTX.

Which in the end was worthless because both Intel and AMD botched all OpenCL 2.x efforts.

Hence why OpenCL 3.0 is basically OpenCL 1.0 rebranded, and SYSCL went its own way.

It took a commercial company, Codeaplay a former compiler vendor for games consoles, to actually come up with a good tooling for SYSCL.

Which Intel in the middle of extending SYSCL with their Data Paralell C++, eventually acquired.

Those products are in the foundation of One API, and naturally go beyond what barebones OpenCL happens to be.

The mismanagement Khronos has done with OpenCL is one of the reasons Apple lost ties with Khronos.

kristianp

4 months ago

It's insane how big the NVidia dev kit is. They've got a library for everything. It seems like they have as broad software support as possible.

JonChesterfield

4 months ago

1 reply

I like Julia for this. Pretty language, layered on LLVM like most things. Modular are doing interesting things with Mojo too. People seem to like cuda though.

shmerl

4 months ago

1 reply

CUDA is just DOA as a nice language being Nvidia only (not counting efforts like ZLUDA).

JonChesterfield

4 months ago

1 reply

That's a compiler problem. Once could start from clang -xcuda and hack onwards. Or work in the intersection of CUDA and HIP which is relatively broad if a bit of a porting nuisance.

shmerl

4 months ago

1 reply

May be, but who is working on that compiler? And the whole ecosystem is controlled by a nasty company. You don't want to deal with that.

Besides, I'd say Rust is a nicer language than CUDA dialects.

JonChesterfield

4 months ago

2 replies

Chris and Nick originally, a few more of us these days. Spectral compute. We might have a nicer world if people had backed opencl instead of cuda but whatever. Likewise rust has a serious edge over c++. But to the compiler hacker, this is all obfuscated SSA form anyway, it's hard to get too emotional about the variations.

pjmlp

4 months ago

1 reply

Until Rust gets into any of industry compute standards, being a nicer language alone doesn't help.

Khronos standards, CUDA, ROCm, One API, Metal, none of them has Rust on their sights.

World did not back OpenCL, because it was stuck on a primitive C99 text based tooling, without an ecosystem.

Also Google decided to push their Renderscript C99 dialect instead, while Intel and AMD were busy delivering janky tools and broken drivers.

shmerl

4 months ago

1 reply

That's simply not true, because standard level should operate on the IR level, not on the language. You have to generate some IR from your language, at that level it makes sense to talk about standards. The only exception is probably WebGPU where Apple pushed using a fixed language instead of IR which is was a limiting idea.

pjmlp

4 months ago

None of those standards are about IR.

Also SPIR worked so great for OpenCL 2.x, that Khronos rebooted the whole mess back to OpenCL 1.x with OpenCL 3.0 rebranding.

shmerl

4 months ago

Language would matter more for those who actually would want to write some programs in it. So I'd say rust-gpu is something that should get more backing.

pjmlp

4 months ago

1 reply

Tooling and ecosystem, that is why.

shmerl

4 months ago

1 reply

Rust has great tooling and ecosystem. The point here is more of interest of those who want better alternatives to CUDA. AMD would be an obvious beneficiary to back the above, so I'm surprised about some lack of interest from their likes.

pjmlp

4 months ago

1 reply

It has zero CUDA tooling, that is what is relevant when positioning itself as alternative to C, C++, Fortran, Python JIT, PTX based compilers, compute libraries, Visual Studio and Eclipse integration, graphical debugger.

Cross compiling Rust into PTX is not enough to make researchers leave CUDA.

shmerl

4 months ago

1 reply

And CUDA has zero non CUDA tooling. That's a pointless circular argument which doesn't mean anything. Rust has Rust tooling and it's very good.

Being language agnostic is also not the task of the language, but task of IR. There is already a bunch of languages, such as Slang. The point is to use Rust itself for it.

pjmlp

4 months ago

Where is the graphical debugging experience for Rust, given that it so great tooling?

Slang belongs to NVidia, and was nicely given to Khronos, because almost everyone started relying on HLSL, given that Khronos decided not to spend any additional resources on GLSL.

Just like with Mantle and Vulkan, Khronos seems that without external help they aren't able to produce anything meaningful since Long Peak days.

fnands

4 months ago

Mojo might be what you are looking for: https://docs.modular.com/mojo/manual/gpu/intro-tutorial/

The language is general, but the current focus is really on programming GPUs.

chickenzzzzu

4 months ago

1 reply

Vulkan is at 95% of CUDA performance already. The remaining 5% is CUDA's small dispatch logic.

The reason why people continue to use CUDA and Pytorch and so on is because they are literally too stupid and too lazy to do it any other way

pjmlp

4 months ago

1 reply

With zero tooling, hence why no one cares about Vulkan, other than Valve and Google.

chickenzzzzu

4 months ago

1 reply

What tooling do you need? I'll make it for you for free

pjmlp

4 months ago

1 reply

Great, lets start with a Fortran compiler like CUDA has.

When you're done, you can create IDE plugins, and a graphical debugger with feature parity to NInsights.

chickenzzzzu

4 months ago

2 replies

Ok, that's a good retort. How many months of work do those things save you, compared to actually solving the problem you want to solve without those tools?

The argument you are making sounds to me like, "well good luck making a Vulkan application without cmake, ninja, meson, git, visual studio, clion" etc, when in reality a 5 line bash script to gcc works just fine

triknomeister

4 months ago

1 reply

Wrong analogy. You have no idea how wrong you are. Just look at the difference in performance analysis tools for AMD and Nvidia for GPUs. Nvidia makes it simple for people to write GPU programs.

chickenzzzzu

4 months ago

I do have an idea of how wrong I am.

Nvidia's own people are the ones who have made Vulkan performance so close to CUDA's. AMD is behind, but the data shows that they're off in performance proportional to the cost of the device. If they implement coop mat 2, then they would bridge the gap.

99.9% of people who use Pytorch and so on could achieve good enough performance using a "simple vulkan backend" for whatever Python stuff they're used to writing. That would strip out millions of lines of code.

The reason nobody has done this outside of a few github projects that Nvidia themselves have contributed to, is because there isn't a whole lot of money in iterative performance gains, when in reality better algorithmic approaches are being invented quite near every month or so.

pjmlp

4 months ago

First step is to understand why proprietary technology gets adoption.

Lacking understanding is doomed to failure.

JonChesterfield

4 months ago

1 reply

I'm pretty sure it's a political limitation, not a technical one. Implementing it is definitely a pain - it's a mix of hardcore backwards compatibility (i.e. cruft) and a rapidly moving target - but it's also obviously just a lot of carefully chosen ascii written down in text files.

The non-nvidia hardware vendors really don't want cuda to win. AMD went for open source + collaborative in a big way, opencl then hsa. Both broadly ignored. I'm not sure what Intel are playing at with spirv - that stack doesn't make any sense to me whatsoever.

Cuda is alright though, in a kind of crufty obfuscation over SSA sense. Way less annoying than opencl certainly. You can run it on amdgpu hardware if you want to - https://docs.scale-lang.com/stable/ and https://github.com/vosen/ZLUDA already exist. I'm hacking on scale these days.

joe_the_user

4 months ago

The thing that's also worth saying is that everyone speaks vaguely about CUDA's "institutional memory" and investment and so forth.

But the concrete qualities of CUDA and Nvidia's offerings generally is a move toward general purpose parallel computing. Parallel processing is "the future" and approach of just do loop and have each iteration be parallel is dead simple.

Which is to say Nvidia has invested a lot in making "easy things easy along with hard things no harder".

In contrast, other chip makers seem to be acculturated to the natural lock-in of having a dumb, convoluted interface to compensate for a given chip being high performance.

joe_the_user

4 months ago

2 replies

CUDA does involve a massive investment for Nvidia. It's not that it's impossible to replicate the functionality. But once a company has replicated that functionality, that company basically is going to be selling at competitive prices, which isn't a formula for high profits.

Notably, AMD funded a CUDA clone, ZLUDA, and then quashed it[1]. Comments at the time here involved a lot of "they would always be playing catch up".

I think the mentality of chip makers generally is that they'd rather control a small slice of a market than fight competitively for a large slice. It makes sense in that they invest years in advance and expect those investments to pay high profits.

[1] https://www.tomshardware.com/pc-components/gpus/amd-asks-dev...

noosphr

4 months ago

1 reply

Cuda isn't a massive investment, it's 20 years worth of institutional knowledge with a stable external api. There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.

Izikiel43

4 months ago

> Cuda isn't a massive investmen

> it's 20 years worth of institutional knowledge with a stable external api

> There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.

To me that sounds like massive investment

triknomeister

4 months ago

1 reply

ZLUDA was quashed due to concerns about infringement /violating terms of use.

joe_the_user

4 months ago

That was the story but the legality of cloning an API/ABI/etc is well established by, for example Google vs Oracle (though with gotchas that might Nvidia to put a legal fight).

pjmlp

4 months ago

1 reply

Because most fail to understand what makes CUDA great, and keep only trying to replicate C++ API.

They overlook CUDA is a polyglot ecosystem composed by C, C++ and Fortran as main languages, with Python JIT DSL since this year, compiler infrastructure for any compiler backend that wishes to target it of which there are a few including strange stuff like Haskell, IDE integration with Eclipse and Visual Studio, graphical debugging just like on the CPU.

It is like when Khronos puts out those spaghetti riddled standards, expecting each vendor/open source community, to create some kind of SDK, versus the vertical integration of console devkits and proprietary APIs, and then asking why professional studios have no qualms with proprietary tooling.

oivey

4 months ago

1 reply

Slight correction: CUDA Python JIT has existed for a very long time. Warp is a late comer.

pjmlp

4 months ago

1 reply

Kind of, none of those are at the integration level of CUTLASS 4, and the new cu tile architecture, introduced at GTC 2025.

But you're right there was already something in place.

oivey

4 months ago

I took a closer look at some of that and it’s pretty cool. Definitely neat to have some good higher level abstractions than the old C-style CUDA syntax that Numba was built on.

triknomeister

4 months ago

It is hard to do in the sense that it requires a very good taste about programming languages, which in turn requires really listening to the customers, and that requires huge number of people who are skilled. And no one has really invested that much money into their software ecosystem yet.

melodyogonna

4 months ago

Cuda is so many things I'm not sure it is even possible to replicate it.

ronsor

4 months ago

The problem is that CUDA is tightly integrated with NVIDIA hardware. You don't just have to replicate CUDA (which is a lot of tedious work at best), but you also need the hardware to run your "I can't believe it's not CUDA"

tucnak

4 months ago

I don't understand, doesn't kauldron[0] already exist?

[0] https://github.com/google-research/kauldron

JonChesterfield

4 months ago

The underlying problem here is real and legitimately difficult. Shunting data around a cluster (ideally as parts of it fall over) to minimise overall time, in an application independent fashion, is a definable dataflow problem and also a serious discrete optimisation challenge. The more compute you spend on trying to work out where to move the data around, the less you have left over for the application. Also tricky working out what the data access patterns seem to be. Very like choosing how much of the runtime budget to spend on a JIT compiler.

This _should_ breakdown as running optimised programs on their runtime makes things worse and running less-carefully-structured ones makes things better, where many programs out there turn out to be either quite naive or obsessively optimised for an architecture that hasn't existed for decades. I'd expect this runtime to be difficult to build but with high value on success. Interesting project, thanks for posting it.

up2isomorphism

4 months ago

As of today GPU is just too expensive for data processing. The direction they took makes it a very hard sell.

3 more comments available on Hacker News

View full discussion on Hacker News

ID: 45131784Type: storyLast synced: 11/20/2025, 4:50:34 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN