Show HN: Luminal – Open-source, search-based GPU compiler
github.comGreat that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!
Nonetheless this project looks very cool, and I hope they can continue improving it to the point where it indeed beats human-led optimizations.
Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?
And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include
- Different ways to partition workloads among GPU threads and blocks
- Varying memory access strategies (using shared memory, global memory)
- Various instruction-level optimizations or reordering
- Alternative loop unroll factors or vectorization strategies
The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.
In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?
This is such an interesting technique.
This obviously creates a combinatorial problem that we mitigate with smarter search.
The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.
The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.
aka "a heuristic"
yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.
Lol np-hard is still np-hard no matter how you slice it (especially given vague objective functions).
I'm curious as to the breadth of possibilities that could be searched. I would imagine something like this could invent flash attention if it cast its net wide enough, but that is a pretty broad net. [Edit: I scrolled back and saw flash attention was explicitly mentioned, cool stuff]
[1] https://samuelcoward.co.uk/assets/pdf/Thesis_Imperial.pdf
Also, what about CUDA alternatives like ROCm?
- Can you give some more insight on why 12 ops suffice for representing your input program?
- With such a small number of ops, isn't your search space full of repeat patterns? I understand the will to have no predefined heuristics, but it seems that learning some heuristics/patterns would massively help reduce the space.
the search does common subexpression elimination by default. if two patterns are unioned in the search space, it applies that union to every occurrence of that pattern at the same time, so using e-graphs it helps keep the search space smaller.
This is insanely cool.
But then there are performance tradeoffs in reusing intermediates vs recomputing that I think you can't represent.
Some of these may affect numerical stability btw. See eg https://herbie.uwplse.org/
There is so much potential in this project.
right now since we're profiling kernels, and we have a reference output of the unoptimised version, we can directly measure deviation of profiled outputs "for free" since we're already computing them for runtime. tbh this isn't what i want long term, i want to bake numerical stability natively into the search space to only extract dags that would produce stable outputs. hopefully that'll be solved soon.
Also, how do you ensure that newly generated kernels are correct w.r.t. the original naive kernel that you use as specification?
the search space is designed to remain logically equivalent at all times, by virtue of how its built (applying rewrite rules we know dont change the logical equivalence).
all the optimizations for matmul so far have been straightforward trajectories from naive (tiling, smem caching, tensor core offload, etc.)
https://cacm.acm.org/research/stochastic-program-optimizatio...
Would Luminal be capable of rediscovering this trick?