High-Performance 2d Graphics Rendering on the CPU Using Sparse Strips [pdf]
Key topics
A master's thesis presents a high-performance 2D graphics rendering method on the CPU using sparse strips, sparking discussion on its potential, comparisons to other rendering methods, and future directions for 2D graphics rendering.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
12
2-4h
Avg / period
3.9
Based on 35 loaded comments
Key moments
- 01Story posted
Nov 10, 2025 at 5:05 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 10, 2025 at 6:14 PM EST
1h after posting
Step 02 - 03Peak activity
12 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 11, 2025 at 1:26 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
1) for the given processor set, where each process holds an object 'spawn' a processor in a new set, one processor for each span. (a) spawn operation consists of the source processor setting the number of nodes in the new domain, then performing an add-scan, then sending the total allocation back to the front end the front end then allocates a new power-of-2 shape than can hold those the object-set then uses general communication to send scan information to the first of these in the strip-set (the address is left over from the scan) (b) in the strip-set, use a mask-copy-scan to get all the parameters to all the the elements of the scan set. (c) each of these elements of the strip set determine the pixel location of the leftmost element (d) use a general send to seed the strip with the parameters of the strip (e) scan those using a mask-copy-scan in the pixel-set (f) apply the shader or the interpolation in the pixel-set
note that steps (d) and (e) also depend on encoding the depth information in the high bits and using a max combiner to perform z-buffering.
Edit: there must have been an additional span/scan in a pixel space that is then sent to image space with z buffering, otherwise strip seeds could collide, and be sorted by z which may miss pixels from the losing strip
There are also things like interpretting (conflating) coverage as alpha for analytical antialiasing methods, which lead to visible hairline cracks.
For realtime rendering a common thing to do is to benchmark against a known-good offline renderer (e.g. Arnold, Octane)
[0] https://blog.chromium.org/2025/07/introducing-skia-graphite-...
We have definitely thought about having the CPU renderer while the shaders are being compiled (shader compilation is a problem) but haven't implemented it.
>geometry on CPU but the pixel painting on GPU
Wow. Is this akin to running just the vertex shader on the CPU?
One commercial product is:
https://eshop.macsales.com/item/NewerTech/ADP4KHEAD/
But I seem to recall there are dirt cheap hacks to do same. I may be conflating it with "resister jammed into DVI port" which worked back in the VGA and DVI days. Memory unlocked - did this to an old Mac Mini in a closet for some reason.
Or with old VGA, the display RAM was mapped to known system RAM addresses and the CPU would write directly to it. (you could write to an off-screen buffer and flip for double/triple buffering)
On a PC, the CPU typically has exclusive access to system RAM, while the GPU has its own dedicated VRAM. The graphics driver runs code on both the CPU and the GPU since the GPU has its own embedded processor so data is constantly being copied back and forth between the two memory pools.
Mobile platforms like the iPhone or macOS laptops are different: they use unified memory, meaning the CPU and GPU share the same physical RAM. That makes it possible to allocate a Metal surface that both can access, so the CPU can modify it and the GPU can display it directly.
However, you won’t get good frame rates on a MacBook if you try to draw a full-screen, pixel-perfect surface entirely on the CPU it just can’t push pixels that fast. But you can write a software renderer where the CPU updates pixels and the GPU displays them, without copying the surface around.
The actual process of fine rasterization happens in quads, so there's a simple vertex shader that runs on GPU, sampling from the geometry buffers that are produced on CPU and uploaded.
On a unified memory architecture (eg: Apple Silicon), that's not an expensive operation. No copy required.
and then says
> Since a single strip has a memory footprint of 64 bytes and a single alpha value is stored as u8, the necessary storage amounts to around 259 ∗ 64 + 7296 ≈ 24KB
am I missing something, or is it actually 259*8 + 7296 ≈ 9KB?
Whilst it's still very possible this was a simple mistake, an alternate explanation could be that each strip is allocated to a unique cache line. On modern x86_64 systems, a cache line is 64 bytes. If the renderer is attempting to mitigate false sharing, then it may be allocating each strip in its own cache line, instead of contiguously in memory.
Alternatively, you can also check the results from the official Blend2D benchmarks: https://blend2d.com/performance.html
Or my version where I added some more renderers to the existing ones: https://laurenzv.github.io/vello_chart/
[0] https://www.youtube.com/watch?v=rmyA9AE3hzM