Reverse-Engineering the Rk3588 Npu: Hacking Limits to Run Vision Transformers
Posted17 days agoActive16 days ago
amohan.devTech Discussionstory
informativepositive
Debate
20/100
Npu OptimizationDetrsAI-Powered Support
Key topics
Npu Optimization
Detrs
AI-Powered Support
Discussion Activity
Moderate engagementFirst comment
2h
Peak period
6
2-4h
Avg / period
4.5
Key moments
- 01Story posted
Dec 16, 2025 at 4:18 PM EST
17 days ago
Step 01 - 02First comment
Dec 16, 2025 at 6:35 PM EST
2h after posting
Step 02 - 03Peak activity
6 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 17, 2025 at 1:10 PM EST
16 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 46294626Type: storyLast synced: 12/17/2025, 1:00:50 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
For what it's worth, it seems like there's a bunch of open source NPU work in progress too. There's a layer "TEFLON" for Gallium3D shared by most of these drivers, that TensorFlow can use. Then hardware drivers for Rockchip (via ROCKET driver), and Vivante (with their Etnaviv drivers). It'd be extra interesting now to see how (or if?) they've dealt with the system constraints (small scratchpad size) here. https://www.phoronix.com/news/Gallium3D-Teflon-Merged https://www.phoronix.com/news/Rockchip-NPU-Linux-Mesa https://www.phoronix.com/news/Two-NPU-Accel-Drivers-2026
> *The main reason I stuck with the closed-source `rknn` stack for this specific project was operator support for Transformers. Teflon is getting great at standard CNN ops (Fused ReLU, Convs, etc.), but the SigLIP vision encoder relies on massive Transposes and unbounded GELU activations that currently fall off the 'happy path' in the open stack.*
> *To your point on the system constraints (small scratchpad): I suspect the current open-source drivers would hit the exact same 32KB SRAM wall I found. The hardware simply refuses to tile large matrices automatically. My 'Nano-Tiling' fix was a software-level patch; porting that logic into the Mesa driver itself would probably be the 'Holy Grail' fix here.*