GPU Prefix Sums: a Nearly Complete Collection

Posted4 months agoActive4 months ago

coffeeaddict1

83 points

20 comments

github.comProgrammingstory

informativepositive

Debate

0/100

GPU AllocationPrefix SumsParallel Algorithms

Key topics

GPU Allocation

Prefix Sums

Parallel Algorithms

https://dl.acm.org/doi/10.1145/3694906.3743326

The GPU Prefix Sums collection has sparked a lively debate about the most important applications of prefix sums in today's world. While some commenters argue that extracting non-zero elements from sparse vectors/matrices is crucial, others claim that radix sort is the most significant use case, with one commenter even questioning whether radix sort is more important than matrix multiplication. The discussion highlights the versatility of prefix sums, with applications ranging from game development to GPU sorting, and reveals that the break-even point for GPU sorting versus CPU sorting is still unclear, with some users relying on GPU sorting due to the high cost of data transfer between CPU and GPU.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

46m

Peak period

4-6h

Avg / period

2.2

Comment distribution20 data points

Loading chart...

Based on 20 loaded comments

Key moments

01Story posted
Aug 28, 2025 at 8:49 AM EDT
4 months ago
Step 01
02First comment
Aug 28, 2025 at 9:35 AM EDT
46m after posting
Step 02
03Peak activity
5 comments in 4-6h
Hottest window of the conversation
Step 03
04Latest activity
Aug 29, 2025 at 6:15 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (20 comments)

Showing 20 comments

genpfault

4 months ago

1 reply

https://en.wikipedia.org/wiki/Prefix_sum#Applications

almostgotcaught

4 months ago

1 reply

this is missing the most important one (in today's world): extracting non-zero elements from a sparse vector/matrix

https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

merope14

4 months ago

2 replies

Not even close. The most important application (in today's world) is radix sort.

WJW

4 months ago

3 replies

What specific application do you have in mind that radix sort is more important than matrix multiplication?

otherjason

4 months ago

1 reply

I think they were trying to say “radix sort is a more important application of prefix sum than extraction of values from a sparse matrix/vector is.”

WJW

4 months ago

1 reply

I understand what GP meant, but extraction of values from a sparse matrix is an essential operation of multiplying two sparse matrices. Sparse matmult in turn is an absolutely fundamental operation in everything from weather forecasting to logistics planning to electric grid control to training LLMs. Radix sort on the other hand is very nice but (as far as I know) not nearly used as widely. Matrix multiplication is just super fundamental to the modern world.

I would love to be enlightened about some real-world applications of radix sort I may have missed though, since it's a cool algorithm. Hence my question above.

littlestymaar

4 months ago

2 replies

> to training LLMs

LLMs are made from dense matrices, aren't they?

WJW

4 months ago

Not always, or rather not exclusively. For example, some types of distillation benefit from sparse-ifying the dense-ish matrices the original was made of [1]. There's also a lot of benefit to be had from sparsity in finetuning [2]. LLMs were merely one of the examples though, don't focus too much on them. The point was that sparse matmul makes up the bulk of scientific computations and a huge amount of industrial computations too. It's probably second only to the FFT in importance, so it would be wild if radix sort managed to eclipse it somehow.

[1] https://developer.nvidia.com/blog/mastering-llm-techniques-i...

[2] https://arxiv.org/html/2405.15525v1

almostgotcaught

4 months ago

Almost all performant kernels employ structured sparsity

woadwarrior01

4 months ago

1 reply

Top K sampling comes to mind, although it's nowhere nearly as important as matmult.

almostgotcaught

4 months ago

ranking models benefit from gpu impls of sort but yup they're not nearly as common/important as spmm/spmv

m-schuetz

4 months ago

Is that relevant for 4x4 multiplications? Because at least for me, radix sort is way more important than multiplying matrices beyond 4x4. E.g. for Gaussian Splatting.

animal531

4 months ago

I'm working on a game that has a lot of units and I used to use the old Sebastian Lague + NVidia approach where you use 2d binning -> cells/keys -> sort -> being able to search for neighbours efficiently (along with some modifications such as using Morton encoding and so on that I added over time).

But then during a break the other day I read up on Radix sort and then right thereafter implemented a prefix sum for spatial partitioning that also incorporates a bit table, CAS operations for doing multithreaded modifications etc. After learning the core Radix concept I sort of came up with the idea of using it that way myself which was quite pleasing.

Props to the author, I'll definitely be spending some time scanning the collection to find some alternate options.

coffeeaddict1Author

4 months ago

1 reply

Related paper by the authors: https://dl.acm.org/doi/10.1145/3694906.3743326

dang

4 months ago

We'll put that link in the top text too. Thanks!

m-schuetz

4 months ago

1 reply

That and https://github.com/b0nes164/GPUSorting have been a tremendous help for me, since CUB does not nicely work with the Cuda Driver Api. The author is doing amazing work.

mfabbri77

4 months ago

1 reply

At what order of magnitude in the number of elements to be sorted (I'm thinking to the overhead of the GPU setup cost) is the break-even point reached, compared to a pure CPU sort?

m-schuetz

4 months ago

No idea unfortunately. For me it's mandatory to sort on the GPU because the data already resides on GPU, and copying it to CPU (and the results back to GPU) would be too costly.

luizfelberti

4 months ago

This looks amazing, I've been shopping for an implementation of this I could play around with for a while now

They mention promising results on Apple Silicon GPUs and even cite the contributions from Vello, but I don't see a Metal implementation in there and the benchmark only shows results from an RTX 2080. Is it safe to assume that they're referring to the WGPU version when talking about M-series chips?

mooman219

4 months ago

Oh! I have a prefix sum laying around in SIMD in Rust, I use it for bitmap rasterization for fonts. Looking at the comments I guess this isn't a popular usecase, but useful nonetheless. Doing it on the GPU looks really fun

https://github.com/mooman219/fontdue/blob/master/src/platfor...

View full discussion on Hacker News

ID: 45051542Type: storyLast synced: 11/20/2025, 3:29:00 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN