GPU Prefix Sums: a Nearly Complete Collection
Key topics
The GPU Prefix Sums collection has sparked a lively debate about the most important applications of prefix sums in today's world. While some commenters argue that extracting non-zero elements from sparse vectors/matrices is crucial, others claim that radix sort is the most significant use case, with one commenter even questioning whether radix sort is more important than matrix multiplication. The discussion highlights the versatility of prefix sums, with applications ranging from game development to GPU sorting, and reveals that the break-even point for GPU sorting versus CPU sorting is still unclear, with some users relying on GPU sorting due to the high cost of data transfer between CPU and GPU.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
46m
Peak period
5
4-6h
Avg / period
2.2
Based on 20 loaded comments
Key moments
- 01Story posted
Aug 28, 2025 at 8:49 AM EDT
4 months ago
Step 01 - 02First comment
Aug 28, 2025 at 9:35 AM EDT
46m after posting
Step 02 - 03Peak activity
5 comments in 4-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 29, 2025 at 6:15 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
I would love to be enlightened about some real-world applications of radix sort I may have missed though, since it's a cool algorithm. Hence my question above.
LLMs are made from dense matrices, aren't they?
[1] https://developer.nvidia.com/blog/mastering-llm-techniques-i...
[2] https://arxiv.org/html/2405.15525v1
But then during a break the other day I read up on Radix sort and then right thereafter implemented a prefix sum for spatial partitioning that also incorporates a bit table, CAS operations for doing multithreaded modifications etc. After learning the core Radix concept I sort of came up with the idea of using it that way myself which was quite pleasing.
Props to the author, I'll definitely be spending some time scanning the collection to find some alternate options.
They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?
https://github.com/mooman219/fontdue/blob/master/src/platfor...