Cuda Tile Open Sourced
Posted14 days agoActive3d ago
github.comstoryHigh profile
informativeneutral
GPU AllocationCudaOpen-Source Software
Key topics
GPU Allocation
Cuda
Open-Source Software
CUDA Tile: CUDA Tile Open Sourced
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
6d
Peak period
90
Day 7
Avg / period
21.6
Comment distribution108 data points
Loading chart...
Based on 108 loaded comments
Key moments
- 01Story posted
Dec 19, 2025 at 3:49 PM EST
14 days ago
Step 01 - 02First comment
Dec 25, 2025 at 2:30 PM EST
6d after posting
Step 02 - 03Peak activity
90 comments in Day 7
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 30, 2025 at 10:53 AM EST
3d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 46330732Type: storyLast synced: 12/26/2025, 3:40:37 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Google leading XLA & IREE, with awesome intermediate representations, used by lots of hardware platforms, and backing really excellent Jax & Pytorch implementations, having tools for layout & optinization folks can share: they really build an amazing community.
There's still so much room for planning/scheduling, so much hardware we have yet to target. RISC-V has really interesting vector instructions, for example, and it seems like there's so much exploration / work to do to better leverage that.
this is nicely illustrated by this recent article:
https://news.ycombinator.com/item?id=46366998
Ah, and Nsight debugging also supports Python CUDA Tiles debugging.
https://developer.nvidia.com/blog/simplify-gpu-programming-w...
non-exclusive license actually.
IREE hasn't been at G for >2 years.
I lost count at five or six. Define your acronyms on first use, people.
Get better at computers, people!
When confusion gets framed as "this is substandard writing", it rewards showing up and performing a lack of context rather than engaging with the substance or asking clarifying questions. Over time that creates pressure to write to the lowest common denominator, instead of the audience the author is clearly aiming at.
HN already operates on an implicit baseline (CUDA, open source, LLVM, etc.) and mostly lets comments fill in gaps. That usually produces better discussions than treating every unfamiliar term as an author failure, especially when someone is just trying to share or explain something they care about.
So yeah, I am genuinely curious why you see personal unfamiliarity as something the entire discussion should reorganize itself around.
(Shrug) The fact is that all major style guides -- APA, MLA, AP, Chicago, probably some others -- call for potentially-unfamiliar acronyms to be defined on first use. For some reason, though, essentially nobody who writes about this particular topic agrees with that.
Which is cool, it's not my field, so I don't really GAF. I'm mostly just remarking on how unusually difficult it was to drill down on this particular term. I'll avoid derailing the topic further than I already have.
HN already assumes a baseline of technical literacy. When something falls outside that baseline, the usual move is to ask for context or links, not to reframe personal unfamiliarity as an author failure.
So please, don’t normalize treating "I don’t know this yet" as a failure of the post.
I won't argue, but there is a middle ground between articles consisting of pure JAFAs and this:
> accommodate readers who won’t even type four letters into a search bar
I think it helps if acronyms are expanded at least once or in a footnote so that the potential new reader can follow along and does not need to guess what ACMV^ means.
^: Awesome Combobulating Method by VTimofeenko, patent pending.
Stop carrying water for poor documentation practice.
Wikipedia gets the job done, but these days, Wikipedia is often a long way down the Google search results list. I think they downranked it when they started force-feeding the AI answers (which also didn't help).
Just say to the AI, "Explain THIS".
Also HN: "Not like that"
How close was I?
Second-rate libraries like OpenCL succeeded where Mojo failed because they were open. They went through standards committees and cooperated with the rest of the industry (even Nvidia) to hear-out everyone's needs. Lattner gave up on appealing to that crowd the moment he told Khronos to pound sand. Nobody should be wondering why Apple or Nvidia won't touch Mojo with a thirty-nine and a half foot pole.
CUDA Tile was exactly designed to give parity to Python in writing CUDA kernels, acknowledging the relevance of Python, while offering a path researchers don't need to mess with C++.
It was announced at this years GTC.
NVidia has no reason to use Mojo.
Julia, Python GPU JITs work great on Windows, and many people only get Windows systems as default at work.
1) Install Linux
2) Summon Chris Lattner to play you a sad song on the world's smallest violin in honor of the Windows devs that refuse to install WSL.
What about that outcome?
When is the Year of NPUs on Linux?
https://www.pcspecialist.de/kundenspezifische-laptops/nvidia...
Which as usual, kind of work but not really, in GNU/Linux.
I'm tired of people shilling things they don't understand.
We don't have to wait for singular companies or foundations to fix ecosystem problems. Only the means of coordination are needed. https://prizeforge.com isn't there yet, but it is already capable of bootstrapping its own development. Matching funds, joining the team, or contributing on MuTate will all make the ball pick up speed faster.
Geohot has been working on this for about a year, and every roadblock he's encountered he has had to damn near pester Lisa Su about getting drivers fixed. If you want the CUDA replacement that would work on AMD, you need to wait on AMD. If there is a bug in the AMD microcode, you are effectively "stopped by AMD".
https://tinygrad.org/#tinybox
You can see in his table he calls out his AMD system as having "Good" GPU support, vs. "Great" for nvidia. So, yes, I would argue he is doing the work to platform and organize people, on a professional level to sell AMD systems in a sustainable manner - everything you claim that needs to be done and he is still bottlenecked by AMD.
A single early-stage company is not ecosystem-scale organization. It is instead the legacy benchmark to beat. This is what we do today because the best tools in our toolbox are a corporation or a foundation.
Whether AMD stands to benefit from doing more or less, we are likely in agreement that Tinygrad is a small fraction of the exposed interest and that if AMD were in conversation with a more organized, larger fraction of that interest, that AMD would do more.
I'm not defending AMD doing less. I am insisting that ecosystems can do more and that the only reason they don't is because we didn't properly analyze the problems or develop the tools.
The Tile dialect is pretty much independent of the nvidia ecosystem so all it takes is one good set of MLIR transform passes to run anything on the CUDA stack that compiles to tile out of the nvidia ecosystem prison.
So if anything this is actually a massive opportunity to escape vendor lock in if it catches on in the CUDA ecosystem.
They have had about 15 years to move beyond C99, stone age workflows to compile GLSL and C99 with their drivers, no libraries ecosystem, and printf debugging.
Eventually some of the issues have been fixed, after they started seeing only hardliners would put with such development experience, and then it was too late.
OneAPI builds on top of SYSCL, is basically Intel's CUDA, which it is already the second attempt to have C++ in OpenCL, during OpenCL 2.x, an effort that worked so well, that OpenCL 3.0 is basically a reboot back to OpenCL 1.0.
Also even SYSCL only got a proper kick-off after CodePlay came up with its implementation, nowadays they sell oneAPI support and tooling, after being acquired by Intel.
It is surely not equivalent as of today.
You'll need a bespoke scheduling IR to drive the CUDA Tile IR for output to ThunderKittens/ThunderMittens/HipKittens, since they main differences between them are size/scheduling related.
Wrap the final step in an optimizer loop to find the best size/schedules; profit.
We’d all prefer cross platform programming, but if you’re going to do platform specific, I prefer open source to closed source.
Thank you NVIDIA!