HipKittens: Fast and furious AMD kernels
Mood
excited
Sentiment
positive
Category
tech
Key topics
AMD
kernel development
high-performance computing
HipKittens is a new project that aims to develop fast and efficient AMD kernels, with a related blog post discussing the project's goals and progress.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
20h
Peak period
38
Day 2
Avg / period
33
Based on 66 loaded comments
Key moments
- 01Story posted
11/14/2025, 2:27:20 AM
5d ago
Step 01 - 02First comment
11/14/2025, 10:38:42 PM
20h after posting
Step 02 - 03Peak activity
38 comments in Day 2
Hottest window of the conversation
Step 03 - 04Latest activity
11/15/2025, 4:47:26 PM
3d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Competitors now only need to optimize for a narrow set of algorithms. If a vendor can run vLLM and Transformers efficiently, a massive market becomes available. Consequently, companies like AMD or Huawei should be able to catch up easily. What, then, is Nvidia’s moat? Is InfiniBand enough?"
There's a ton of pressure on the market to decouple nvidia's proprietary software from literally everything important to AI, and they will either gracefully transition and control it, or it will reach a breaking point and someone else will do it for (and to) them. I'm sure they've got finance nerds and quants informing and minmaxing their strategy, so they probably know to the quarter when they'll pivot and launch their FOSS, industry leading standards narrative (or whatever the strategy is.)
I thought this too, in 2015. OpenCL looked really promising, but Apple bailed and neither AMD nor Intel had the funding to keep up with Nvidia's research. It sorta floundered, even though Nvidia GPUs smugly ran OpenCL code with benchmark-leading performance.
Nvidia won the datacenter because of hardware. You could release a perfect CUDA-to-Vulkan translator tomorrow, and they still wouldn't be dethroned until better hardware replaced it. Intel is swirling the drain, Qualcomm is hedging their bets on mobile, AMD is (still) too underfunded - Apple is the only company with the design chops and TSMC inroads to be a serious threat, and they can't release a datacenter product to save their life. It's understandable why people think Nvidia is a monopoly, Team Green is pulling a full-on "Luigi wins by doing nothing" in 2025: https://knowyourmeme.com/memes/luigi-wins-by-doing-absolutel...
The market has almost no pressure to decouple from Nvidia - nobody else has mature solutions. It requires a preestablished player to make a similarly risky play, which might rule out everyone who's sitting at the table.
Uh, Flash died because Apple refused to support it on mobile Safari. Perhaps Flash would have died anyway, but that is the proximate cause. And Apple's competitors were falling over themselves to market Flash support as a competitive advantage vs. iPhone.
Nvidias valuation and moat are centered around data center class GPUs used for training. I don't think they effectively have that space to themselves for much longer. Google is already using their own TPUs at scale for both training and inference. They still use some Nvidia stuff but they seem to be able to keep that off the critical path for anything that needs to run at "Google scale". OpenAI just ordered a bunch of AMD hardware. A lot of AI engineers use Apple laptops that rely on the M series hardware.
In short, the Cuda moat is shrinking. It's still relevant of course and there are a lot of tooling and frameworks that depend on it. That's why everybody still uses it. But not exclusively. And there's a lot of extremely well funded and active development to cut loose from it. AMD of course wants in. So does Intel. And so does everybody else. This HipKittens thing looks like it makes some big steps towards a more neutral software ecosystem.
Plus strategic partnerships with cloud providers.
And InfinityBand, yes
Apple didn’t really “win” out against Android, and it would be a very wrong way of measuring what actually happened. Yet, Apple could have been seen as more premium during various points of that timeline. The truth of the matter was, it was never a swimming race at any point in that smartphone timeline. It was simply a flood that you could convince yourself was an orderly race.
I believe the same is happening now, and it’s in Nvidias interest to maintain the narrative that there is a race and they are winning it. Believing something like this during the smartphone era would have been foolish.
AI is millions of times slower than optimal algorithms for most things.
That’s a complete institutional and leadership failure.
Ironically, building chips is the actual _hard_ part. The software and the compilers are not trivial but the iteration speed is almost infinite by comparison.
It goes to show that some companies just don’t “get” software. Not even AMD!
Why oversimplify the premise and frame your take as some 'proof'. Just use the term counter-argument/example
HipKittens is an improvement but AMD does not have the ability to understand or track kernel performance so it'll be ignored.
This isn't fixable overnight. Company-wide DevOps and infrastructure is outsourced to TCS in India who have no idea what they're doing. Teams with good leadership maintain their own shadow IT teams. ROCm didn't have such a team until hyperscalers lost their shit over our visibly poor development practices.
Even if AMD did extend an offer to hire all the people in the article, it would be below-market since the company benchmarks against Qualcomm, Broadcom, and Walmart, instead of Google, Nvidia, or Meta.
We haven't had a fully funded bonus in the past 4+ years.
It's not only not fixable overnight, but it's not fixable at all if the leadership thinks they can coast on simply being not as bad as Intel, and Intel has a helluva lot of inertia and ability to simply sell OEM units on autopilot.
Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
But the real issue is we don't want to invest in beating Nvidia on quality. Otherwise we wouldn't be doing stock buybacks and instead use the money on poaching engineers.
The mindset is that we maintain a comfortable second place by creating a shittier but cheaper product. That is how AMD has operated since 1959 as a second source to Fairchild Semiconductor and Intel. It's going to remain the strategy of the company indefinitely with Nvidia. Attempting to become better would cost too much.
> Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
Knocking out Lisa Su would be stupid, since she has the loyalty of the whole company and is generally competent.
What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers. Or phase in the same over a longer period of time. The company is full of people that do nothing because we've paid under market for so long. That's fine when competing against Intel, it's not acceptable when competing against Microsoft, Amazon, OpenAI, Google, and Nvidia.
Lisa Su is the only CEO in the S&P500 who can get away with mass layoffs and still have the loyalty of the rest of the employees.
I was part of company with a similar problem. If AMD’s situation is similar to what I dealt with, it’s more complicated. When you start doing deep cut layoffs at the IC level combined with expectations of big salary increases for those who remain, the office politics escalate to a level I didn’t know was possible.
All of those people who do nothing find a way to join forces with those people who are showing those inflated benchmarks to execs and before you know it the layoffs are about as accurate as random chance when it comes to cutting the dead weight from the company.
In my experience, the change needs to start closer to the top: Upper layers of management need to be shaken up. Middle management audited by new upper management hires who have fresh eyes and aren’t afraid to make honest evaluations. High performing teams who are stuck under management hell need to be identified and rotated into other projects that are critical for the company but have become occupied by fiefdom-building managers. Hiring needs to ramp up to bring in new talent that was previously priced out by the low comp.
It’s hard. I wish there was an easy way to cut the low performers, but they have an amazing way of teaming up with the bad managers. Maybe because they have so much free time to do office politics because they’re not doing much work.
I mean, isn’t that always the way? Honestly, I feel like you could do a lot worse than just firing most of the people who demonstrate above average social skills. Sure, some would be fired unnecessarily, but I can’t think of any engineers that have seemed almost pathologically shy that also didn’t want to work hard.
> What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers.
Tell me you're an engineer without telling me you're an engineer. The problem is they don't know which half and they can't know. It's an issue of legibility and transparency - put yourself into the shoes of the C-suite. You're staring down a complete black box of, what, 5,000 people. How can you possibly know who's good and who's not? Think of the information they have at hand - what the chain of command tells them. What if the chain of command itself is the problem? Think about how you yourself could protect a bad employee if you were a manager. You could! How can they possibly find the truth?
People rightly hate stack ranking, but you can see why ideas like that exist - attempts to come up with organizational pruning algorithms that are resistant to the managers themselves being the problem.
And this is also why CEOs incoming with a turnaround mission often do a clean sweep and stack the c-suite with all their friends. Not because they're giving jobs to their mates - although sure, that does happen - but because they're trying to establish at least a single layer of trust, which can then in time hopefully be extended downwards. But it all takes time, and for some organizations, they never do manage it. When unlimited orgs all compete for the same limited number of good managers - well, some of them are going to lose.
Ironically I'm bullish on AI being able to greatly help with all of this. Maybe running on AMD GPUs...
They've paid serious amounts in RSUs over the last six years. Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs. Bonus might be team dependent, I remember being annoyed and nicely surprised by it in different years.
The aql profiler confuses me quite a lot but it's definitely a tool for measuring performance.
It depends on team, we have some testing, and progress is being made. But it's not "working" or comprehensive as we get complaints from our big customers. We should be replicating their setup internally and not have them catch problems.
> Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs.
We need to pay top of market to steal people from our competitors. We can't pay less than Nvidia and outcompete them. Paying less is a signal we're aiming for second and to copy the market leader.
With regards to performance, there are some things tracked carefully and other things that are not tracked at all. I suspect that is why some folks think we're really good at it and others think we're terrible. There's lots of room for improvement, though. Excitement over trivial performance regressions is more a sign of immaturity than of good tracking.
Yes, this is true. Painfully true.
This is WILD to hear considering how well it appears AMD is executing from the outside.
Well said, their Instinct parts are actually, at a hardware level, very very capable pieces of kit that - ignoring software/dev ecosystem - are very competitive with NVidia.
Problem is, AMD has a terrible history of supporting it's hardware (either just outright lack of support, cough Radeon VII; or constantly scrapping things and starting over and thus the ecosystem never matured) and is at a massive deficit behind the CUDA ecosystem meaning that a lot of that hardware's potential is squandered by the lack of compatibility with CUDA and/or a lack of investment in comparable alternative. Those factors has given NVidia the momentum it has because most orgs/devs will look at the support/ecosystem delta, and ask themselves why they'd expend the resources reinventing the CUDA wheel to leverage AMD hardware when they can just spend that money/time investing in CUDA and NVidia instead.
To their credit, AMD it seems has learned it's lesson as they're actually trying to invest in ROCm and their Instinct ecosystem and seem to be sticking to their guns on it and we're starting to see people pick it up but they're still far behind Nvidia and CUDA.
One key area that Nvidia is far ahead of AMD on in the hardware space is networking.
ROCM pre Rock, suffers from the ossification in the engineering organization. The Rock seeks to completely change that, and the team driving it is amazing. Try out the pre-alpha installer. It is already better than the default installer.
There is hope.
Indeed. For clarity, I agree the performance is certainly there. My comment about being behind was in the context of marketshare and ecosystem maturity compared to CUDA. In fact, I'd say there's more than just hope but actual meaningful progress and commitment being made there, and I'm happy to see it.
AMD hires talented people at below-market and doesn't promote them or give raises. This causes employees to aim at resume-driven development by reinventing the wheel so they can get a job somewhere else.
It's a similar problem to Google, except at Google it's because promotions are explicitly for people that ship new products.
I think I may need to reduce the number of architectures it's built for to successfully compile it on the official Debian buildd infrastructure, but my (unverified) understanding is that most of its reverse dependencies only need the header-only parts of the library anyway.
I'm told they're working on improving the build times via a few different methods.
Things are turning around for AMD. If you have an AMD card, go to pytorch.org, click Linux+ROCm and install PyTorch. 3 years ago, this was hopeless. Today, most mainline things work. I ran nanochat on MI300X and it just worked. I think that's true about MI350X now too. The MI350X machine is stable.
They are clearly behind NVIDIA, nobody doubts that. And a lot of investment into software will be required to catch up, ecosystem, compiler, and driver. But 2 years ago they seemed hopeless, now they don't. Things take time. HipKittens is a great codebase to study to see where AMD's LLVM backend is still lacking; compare it to the CUDA Kittens.
For training, it's NVIDIA and Google in first. AMD in second. And nobody in third. Intel and Tenstorrent are not remotely close. Huawei examples segfaulted. Groq gave up selling chips. Cerebras isn't available anywhere. Trainium had a 5 day wait time to get one instance and I lost interest.
The out of box experience can be a bit rough around the edges on bleeding edge stuff, but it isn't anything near as bad as it used to be. For example, a month ago nanochat wasn't working well and now it is. The important thing is that people now care enough to make it work.
At the end of the day, AI does need viable options. Having a monopoly on all AI hardware and software might be a good thing for share holders, but isn't a good thing for what is looking like a fundamental technology, akin to the internet.
I like your bet though. The difference between NVDA and AMD has never really existed on a hardware level for decades. AMD has always been on par, and software is software, it will catch up.
AMD will be a stock many people will miss because the opportunity has presented itself at the height of AI bubble talk, and this will leave many in the dust. Doubling and tripling of their market cap is pretty much a forgone conclusion.
George was very smart, $500k in the $90's. I saw it coming even earlier than him, but that's cause I was already aware the hardware was good from my own experiences.
[0] https://www.amd.com/en/products/accelerators/instinct/eval-r...
Right now AI support on AMD is officially only on specific models. But they are working hard to turn this around to have broader support. And making progress.
1. data layouts to avoid local memory bank conflicts
2. read patterns from global memory to optimize L2 cache reuse
3. warp specialisation
How complex is it to add these into tinygrad?For example, the following laptop which I'm thinking of picking up, has both a strong AMD CPU/IGPU and a RTX 5080. Could we see the AMD side competing with the RTX?
I know a dedicated gpu will always be faster though.
>HP OMEN MAX 16-ak0003nr 16" Gaming Laptop Computer - Shadow Black Aluminum AMD Ryzen AI 9 HX 375 (2.0GHz) Processor; NVIDIA GeForce RTX 5080 16GB GDDR7; 32GB DDR5-5600 RAM; 1TB Solid State Drive
It's not quite as fast as like Sonnet 4 from an API, but it's really not that bad.
It's really great for quick questions so I don't have to google stuff, and it's probably Sonnet4 level of competency at achieving coding tasks.
No API served model has been fast enough to remove the urge to do something else while waiting for bigger tasks, so the UX is more or less the same in that regard.
Opencode + ollama + Qwen3 Coder has been a very reasonable alternative to ClaudeCode with Sonnet4.
That is amazing for something running locally.
It is possible that if you actually need AI to be doing all your coding, that you're going to feel differently about the setup. But as a small assistant it's great.
21 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.