Why Cuda Translation Wont Unlock Amd
Posted2 months agoActiveabout 2 months ago
eliovp.comTechstory
calmnegative
Debate
0/100
CudaAmdGPU Computing
Key topics
Cuda
Amd
GPU Computing
The article argues that translating CUDA to AMD GPUs won't unlock their full potential, highlighting the limitations of such an approach.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
7d
Peak period
33
168-180h
Avg / period
13.7
Comment distribution41 data points
Loading chart...
Based on 41 loaded comments
Key moments
- 01Story posted
Nov 12, 2025 at 1:47 PM EST
2 months ago
Step 01 - 02First comment
Nov 19, 2025 at 7:04 PM EST
7d after posting
Step 02 - 03Peak activity
33 comments in 168-180h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 20, 2025 at 2:59 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45904178Type: storyLast synced: 11/20/2025, 9:01:17 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Google isn't internally, so far as we know. Google's hyperscaler products have long offered CUDA options, since the demand isn't limited to AI/tensor applications that cannibalize TPU's value prop: https://cloud.google.com/nvidia
OpenAI OTOH is big enough that the vendor lock-in is actually hurting them, and them making that massive deal with AMD may finally push the needle for AMD and improve things in the ecosystem to make AMD a smooth experience.
Google Cloud does have a lot of NVidia, but that’s for their regular cloud customers, not internal stuff.
For example, Deepseek R-1 released optimized for running on Nvidia HW, and needed some adaption to run as well on ROCm. This was for the exact same reasons that ROCm code will beat generic code compiled into ROCm, in the same way. Basically the Deepseek team, for their own purposes, created R-1 to fit Nvidia's way of doing things (because Nvidia is market-dominant) on their own. Once they released, someone like Elio or AMD would have to do the work of adapting the code to run best on ROCm.
For more established players who weren't out-of-left-field surprises like Deepseek, e.g. Meta's Llama series, mostly coordinate with AMD ahead of release day, but I suspect that AMD still has to pay for the engineering work themselves while Meta does the work to make it run on Nvidia themselves. This simple fact, that every researcher makes their stuff work on CUDA themselves, but AMD or someone like Elio has to do the work to move it over to get it to be as performant on ROCm, that is what keeps people in the CUDA universe.
It isn't just the model, it is the engine to run it. From what I understand this model works with sglang, but not with vLLM.
Really good model providers send a launch-day patch to llama.cpp and vllm to make sure people can run their model instantly.
https://tinygrad.org/ is the only viable alternative to CUDA that I have seen popup in the past few years.
https://x.com/clattner_llvm/status/1982196673771139466?s=61
They are not even remotely equivalent. tinygrad is a toy.
If you are serious, I would be interested to hear how you see tinygrad replacing CUDA. I could see a tiny grad zealot arguing that it is gong to replace torch, but CUDA??
Have you looked into AMD support in torch? I would wager that like for like, a torch/amd implementation of a models is going to run rings around a tinygrad/amd implementation.
The article frames this as "CUDA translation bad, AMD-native good" but misses the strategic value of compatibility layers: they lower switching costs and expand the addressable market. NVIDIA's moat isn't just technical—it's the ecosystem inertia. A translation layer that gets 80% of NVIDIA performance might be enough to get developers to try AMD, at which point AMD-native optimization becomes worth the investment.
The article is essentially a product pitch for Paiton disguised as technical analysis. The real question isn't "should AMD hardware pretend to be CUDA?" but rather "what's the minimum viable compatibility needed to overcome ecosystem lock-in?" PostgreSQL didn't win by being incompatible—it won by being good AND having a clear migration path from proprietary databases.
Their bottom line summed it up perfectly.
"We’re not saying “never use CUDA-on-AMD compilers or CUDA-to-HIP translators”. We’re saying don’t judge AMD based on them."
My take on it is fairly well summed up at the bottom of Elio's post. In essence, Elio is taking the view of "we would never use scale-lang for llms because we have a product that is native AMD" and Michael is taking the view of "there is a ton of CUDA code out there that isn't just AI and we can help move those people over to AMD... oh and by the way, we actually do know what we are doing, and we think we have a good chance at making this perform."
At the end of the day, both companies (my friends) are trying to make AMD a viable solution in a world dominated by an ever growing monopoly. Stepping back a bit and looking at the larger picture, I feel this is fantastic and want to support both of them in their efforts.
We actually think solutions like theirs are good for the ecosystem, they make it easier for people to at least try AMD without throwing away their CUDA code.
Our point is simply this: if you want top-end performance (big LLMs, specific floating point support, serious throughput/latency), translation alone is not enough. At that point you have to focus on hardware-specific tuning: CDNA kernel shapes, MFMA GEMMs, ROCm-specific attention/TP, KV-cache, etc.
That’s the layer we work on: we don’t replace people’s engines, we just push the AMD hardware as hard as it can go.
Ok, point made Nvidia. Kudos.
ATI had their moment in the sun before ASIC ate their cryptocurrency lunch. So both still had/have relevance outside gaming. But, I see Intel is starting to take GPU space seriously and they shouldn't be ruled out.
And as mentioned elsewhere in the comments, there is Vulkan. There is also this idea of virtualized GPU as now the bottleneck isn't CPU... it's now GPU. As I mentioned there are Tensors, Moore's Law thresholds coming back again with 1 nanometer manufacturing, there is going to be a point where we hit a threshold again with current chips and we will have a change in technology - again.
So while Nvidia is living the life - unless they have a crystal ball of how tensors are going to go that they can move CUDA towards, there is going to be a "co-processor" future coming up and with that the next step towards NPUs will be taken. This is where Apple is aligning itself because, after all, they had the money and just said "Nope, we'll license this round out..."
AMD isn't out yet. They, along with Intel and others, just need to figure out where the next bottlenecks are and build those toll bridges.
Training etc still happens on NVDA but inference is somewhat easy to do on vLLM et al with a true ROCm backend with little effort?
38 more comments available on Hacker News