Edge264 – Minimalist, High-Performance Software Decoder for H.264/avc Video
Posted3 months agoActive3 months ago
github.comTechstory
supportivepositive
Debate
20/100
H.264/avc Video DecodingSoftware OptimizationOpen-Source
Key topics
H.264/avc Video Decoding
Software Optimization
Open-Source
Edge264 is a high-performance software decoder for H.264/AVC video, and the community discusses its features, potential use cases, and comparisons to existing solutions.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
31m
Peak period
16
0-6h
Avg / period
5.3
Comment distribution37 data points
Loading chart...
Based on 37 loaded comments
Key moments
- 01Story posted
Oct 1, 2025 at 5:00 PM EDT
3 months ago
Step 01 - 02First comment
Oct 1, 2025 at 5:31 PM EDT
31m after posting
Step 02 - 03Peak activity
16 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 3, 2025 at 9:48 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45443462Type: storyLast synced: 11/20/2025, 3:50:08 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
CachyOS is a whole distro compiled with these flags, if possible, which is appealing.
[0] https://github.com/tvlabs/edge264#compiling-and-testing
[0] https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning....
So it does have some limitations like not being inlined, same as any other external function.
You have bigger binaries, but the logistics are simplified compared to shipping multiple binaries and you should get the same speed as multiple binaries with fully inlined code.
Since they don't seem to be doing that, my question is: what's the caveat I'm missing? (Or are the bigger binaries enough of a caveat by themselves?)
It can be useful to duplicate the entire code for 8-bit vs 10-bit pixels because that does affect nearly everything.
A relocatable call within the same DSO can be a PC-relative relocation, which is not a relocation at all when you load the DSO and ends up as a plain PC-relative branch or call.
Ideally you should just multiversion the topmost exported symbol, everything below that should either directly inlined, or, as the architecture variant is known statically by the compiler, variants and a direct call generated. I know at least GCC can do this variant generation for things like constant propagation over static function boundaries, so /assume/ it can do the same for other optimization variants like this, but admittedly haven't checked.
error: inlining failed in call to 'always_inline' 'float _mm512_reduce_add_ps(__m512)': target specific option mismatch
(See also Cisco's openh264, which supports decoding)
But as a software decoder which is specifically made to not use hardware APIs for decoding, I am not sure why they skipped ARM64 on non-linux platforms.
https://github.com/eiln/avd
This really heavily depends on the device, though. There are all sorts of "hardware" video decoders ranging from fairly generic vector coprocessors running firmware to "pure" HDL/VLSI level implementations. Usually on more modern or advanced hardware you'll see more and more become more general purpose, since a lot of the later stages can be shared across codecs, saving area vs. a pure hardware implementation.
https://github.com/tvlabs/edge264/blob/5a3c19fc0ccacb03f9841...
Anyway, you can just use libavcodec, which is faster (because of frame based multithreading) and doesn't operate on the mistaken belief that it's a good idea to use SIMD intrinsics.
But according to the repo, this project also uses both slice and frame multi-threading (as does ffmpeg, with all the tradeoffs).
And SIMD usage is basically table-stakes, and libavcodec uses SIMD all over the place?
Oh, I missed that since it doesn't have a separate file. In that case they're likely very similar performance-wise. H.264 wasn't well-designed for CPUs because the arithmetic coding could've been done better, but it's not that challenging these days.
> And SIMD usage is basically table-stakes, and libavcodec uses SIMD all over the place?
SIMD _intrinsics_. libavcodec doesn't write DSP functions in assembly for historical reasons - it's because it's just better! It's faster, just as maintainable, at least as easy to read and write, and not any less portable (since it already isn't portable…). They're basically a poor way to generate the code you want, interfere with other optimizations like autovectorization, and you might as well write your own code generator instead.
The downsides are it's harder to debug and analyzers like ASan don't work.
Also, hi FFmpeg twitter.
Access to left/top macroblock values is done with direct offsets in memory instead of copying their values to a buffer beforehand.
I made use of this technique too, so I think it's not particularly novel nor non-obvious. The performance-sensitivity of video decoding necessarily means avoiding any extraneous data movement whenever possible.
Also worth noting: H.264 patents have already expired in most of the world: https://meta.wikimedia.org/wiki/Have_the_patents_for_H.264_M...