Av2 Video Codec Delivers 30% Lower Bitrate Than Av1, Final Spec Due in Late 2025
Posted3 months agoActive3 months ago
videocardz.comTechstoryHigh profile
calmmixed
Debate
60/100
Video CodecsAv2Streaming
Key topics
Video Codecs
Av2
Streaming
The AV2 video codec promises a 30% lower bitrate than AV1, with the final spec due in late 2025, sparking discussions on its implications for streaming services, hardware compatibility, and the future of video compression.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
56m
Peak period
84
0-6h
Avg / period
16
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 11, 2025 at 4:19 AM EDT
3 months ago
Step 01 - 02First comment
Oct 11, 2025 at 5:15 AM EDT
56m after posting
Step 02 - 03Peak activity
84 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 14, 2025 at 5:36 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45547537Type: storyLast synced: 11/20/2025, 8:32:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Is this just people being clever or is it also more processing power being thrown at the problem when decoding / encoding?
It's true you could still accidentally violate a patent but that minefield is clearing out as those patents simply have to become more esoteric in nature.
But that's not my main point. My main point is that we are going down a fitting path with codecs which makes it hard to come up with general patents that someone might stumble over. That makes patents developed by the MPEG group far less likely to apply to AOM. A lot of those more generally applicable patents, like the DCT for example, have expired.
1) it harms interoperability
2) I thought math wasn’t patentable?
Better codecs are an overall win for everyone involved.
They also get increased power usage, lesser battery life, higher energy bills, and potentially earlier device failures.
> Better codecs are an overall win for everyone involved.
Right.
But, I mean, your expectation is not that unreasonable, computers were quite good by 2013. It is just an eye-opening framing.
I like how you padded this list by repeating the same thing thrice. Like, increased power usage is obviously going to lead to higher energy bills.
And it’s especially weird because it’s not true? The current SOTA codec AV1 is at a sweet spot for both compression and energy demand (https://arxiv.org/html/2402.09001v1). Consumers are not worse off!
Mobile/power constrained devices don't use software decoding, that just a path to miserable experience. Hardware decoding is basically required.
Meanwhile my desktop can SW decode 4k youtube with 3% reported cpu usage.
I don’t remember ever watching a movie and wishing for a better codec, in the last 10 years
I do wish ATSC1 would adopt a newer codec (and maybe they will), most of the broadcasters cram too many subchannels in their 20mbps and a better codec would help for a while. ATSC3 has a better video codec and more efficient physical encoding, but it also DRM and a new proprietary audio codec, so it's not helpful for me.
And there’s no transfer of effort to the user. Compute complexity of video codecs is asymmetric. The decode is several order of magnitude cheaper to compute than the encode. And in every case, the principal barrier to codec adoption has been hardware acceleration. Pretty much every device on earth has a hardware-accelerated h264 decoder.
For example, changes from one frame to the next are encoded in rectangular areas called "superblocks" (similar to a https://en.wikipedia.org/wiki/Macroblock). You can "move" the blocks (warp them), define their change in terms of other parts of the same frame (intra-frame prediction) or by referencing previous frames (inter-frame prediction), and so on... but you have to do it within a block, as that's the basic element of the encoding.
The more tightly you can define blocks around the areas that are actually changing from frame to frame, the better. Also, it takes data to describe where these blocks are, so there are special limitations on how blocks are defined, to minimise how many bits are needed to describe them.
AV2 now lets you define blocks differently, which makes it easier to fit them around the areas of the frame that are changing. It has also doubled the size of the largest block, so if you have some really big movement on screen, it takes fewer blocks to encode that.
That's just one change, the headline improvement comes from all the different changes, but this is an important one.
There is new cleverness in the encoders, but they need to be given the tools to express that cleverness -- new agreement about what types of transforms, predictions, etc. are allowed and can be encoded in the bitstream.
https://youtu.be/Se8E_SUlU3w?t=242
Is there a reason codec's don't use the previous frame(s) as stored textures, and remap them on the screen? I can move a camera through room and a lot of the texture is just reprojectivetransformed.
I mean, that's more or less how it works already. But you still need a unit of granularity for the remapping. So the frame will store eg this block moves by this shift, this block by that shift etc.
This is exactly what I question. Why should there be block shaped units of granularity? defining a UV-textured 3D mesh that moves and carries previous decoded pixel values should have much less seams, with a textured mesh instead of blocks the only de novo pixel values would be the seams between reusable parts of the mesh, for example when an object rotates and reveals a newly visible part of its surface.
Having worked in the field of photogrammetry, I can tell you that it is a really complex task.
That's what AV1 calls global motion and warped motion. Motion deltas (translation/rotation/scaling) can be applied to the whole frame, and blocks can be sheared vertically/horizontally as well as moved.
Consider a scene with a couple of cars moving on a background, one can imagine a number of vertices around the contour of each car, and reusing the previous car, it makes no sense to force the shape of blocks. The smaller the seams between shapes (reusing previous frames as textures), the fewer pixels it needs to reconstitute de novo. The more accurate the remapping xy_old(x_prev,y_prev)-><x,y>, the lower the error signal that needs to be reconstructed.
Also the majority of new contour vertex locations can be reused as the old contour locations in the next frame decoding. Then only changes in contour vertexes over time need to be encoded, like when a new shape enters the scene, or a previously static object starts moving. So there is a lot of room for compression.
At the absolute compression limit, it's no longer video, but a machine description of the scene conceptually equivalent to a textual script.
It feels like we’re losing something, a shared experience, in favor of an increasingly narcissistic attitude that everything needs to shapeable to individual preferences instead of accepting things as they are.
I’d be somewhat interested in something like a git that generates movies, that my friends can push to.
Extremely widespread mass media fiction broadcast are sort of an aberration of the last 75 years or so. I mean, you’d have works in ancient times—the Odyssey—that are shared across a culture. But, that was still a story customize by each teller, and those sorts of stories were rare. Canon was mainly a concern of religions.
It’s just for fun, we give it far too much weight nowadays.
I find the idea fun, kinda like using snapchat filters on characters, but in practice I'm sure it'll be used to cut corners and prevent the actual creative vision from being shown which saddens me.
All of this requires a significant amount of extra logic gates/silicon area for hardware decoders, but the bit rate reduction is worth it.
For CPU decoders, the additional computational load is not so bad.
The real additional cost is for encoding because there’s more prediction tools to choose from for optimal compression. That’s why Google only does AV1 encoding for videos that are very popular: it doesn’t make sense to do it on videos that are seen by few.
Clever matters a lot more for encoding. If you can determine good ways to figure out the motion information without trying them all, that gets you faster encoding speed. Decoding doesn't tend to have as much room for cleverness; the stream says to calculate the output from specific data, so you need to do that.
I don’t understand why 60fps never became ubiquitous, a pan scene in 30fps is horrible, its almost stroboscopic to me.
It doesn't look like AV2 does any of that yet though fortunately (except film grain synthesis but I think that's fine).
[1]: https://bellard.org/nncp/
I imagine e.g. a picture of an 8x8 circle actually takes more bits to encode than a mathematical description of the same circle
I wonder if there are codecs with provisions for storing common shapes. Text comes to mind - I imagine having a bank of 10 most popular fonts an encoding just the difference between source and text + distortion could save quite a lot of data on text heavy material. Add circles, lines, basic face shapes.
There also seems to be a fair bit of attention on that problem space from the real-time comms vendors with Cisco [1], Microsoft [2] and Google [3] already leaning on model based audio codecs. With the advantages that provides both around packet loss mitigation and shifting costs to end user (aka free) compute and away from central infra I can't see that not extending to the video channel too.
[0]: https://mtisoftware.com/understanding-ai-upscaling-how-dlss-...
[1]: https://www.webex.com/gp/webex-ai-codec.html
[2]: https://techcommunity.microsoft.com/blog/microsoftteamsblog/...
[3]: https://research.google/blog/lyra-a-new-very-low-bitrate-cod...
Not quite yet as shown in H.267. But at some point the computational requirement vs bandwidth saving benefits would no longer make sense.
[1]https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
So it seems like they checked that all their ideas could be implemented efficiently in hardware as they went along, with advice from real hardware producers.
Hopefully AV2-capable hardware will appear much quicker than AV1-capable hardware did.
Providing a production grade verified RTL implementation would obviously be useful but also entire companies exist to do that and they charge a lot of money for it.
And what hobbyist is sending off decoding chips to be fabbed? If this exists, it sounds interesting if incredibly impractical.
A h265 or AV1 decoder requires millions of logic gates (and DRAM memory bandwidth.) Only high-end FPGAs provide that.
The complexity of video decoders has been going up exponentially and AV2 is no exception. Throwing more tools (and thus resources) at it is the only way to increase compression ratio.
Take AV1. It has CTBs that are 128x128 pixels. For intra prediction, you need to keep track of 256 neighboring pixels above the current CTB and 128 to the left. And you need to do this for YUV. For 420, that means you need to keep track of (256+128 + 2x(128+64)) = 768 pixels. At 8 bits per component, that's 8x768=6144 flip-flops. That's just for neighboring pixel tracking, which is only a tiny fraction of what you need to do, a few % of the total resources.
These neighbor tracking flip-flops are followed by a gigantic multiplexer, which is incredibly inefficient on FPGAs and it devours LUTs and routing resources.
A Lattice ECP5-85 has 85K LUTs. The FFs alone consume 8% of the FPGA. The multiplier probably another conservative 20%. You haven't even started to calculate anything and your FPGA is already almost 30% full.
FWIW, for h264, the equivalent of that 128x128 pixel CTB is 16x16 pixel MB. Instead of 768 neighboring pixels, you only need 16+32+2*(8+16)=96 pixels. See the difference? AV2 retains the 128x128 CTB size of AV1 and if it adds something like MRL of h.266, the number of neighbors will more than double.
H264 is child's play compared later codecs. It only has a handful of angular prediction modes, it has barely any pre-angular filtering, it has no chroma from luma prediction, it only has a weak deblocking filter and no loop filtering. It only has one DCT mode. The coding tree is trivial too. Its entropy decoder and syntax processing is low in complexity compared to later codecs. It doesn't have intra-block copy. Etc. etc.
Working on a hardware video decoder is my day job. I know exactly what I'm talking about, and, with all due respect, you clearly do not.
Your argument about your large amount of flops is odd. You would only store data that way if you needed everything on the same cycle. You say there's a multiplexor after that. Data storage + multiplexor is just memory. Could use a bram or lutram which would cut down on that dramatically, big if there's a need based on later processing which you haven't defined. And even then, that's for AV1 which isn't AV2 and may change
Wait, I just discovered GPUs, nevermind. [giggles]
Still, the ability to do specialized work should probably be offloaded to specialized but pluggable hardware. I wonder what the economics of this would be...
They're called GPUs... They're ASICs rather than FPGAs, but it's easy to update the driver software to handle new video codecs. The difficulty is motivating GPU manufacturers to do so... They'd rather sell you a new one with newer codec support as a feature.
But often a new codec requires decoders to know how to work with new things that the fixed function hardware likely can't do.
Encoding might actually be different. If your encoder hardware can only do fixed block sizes, and can only detect some types of motion, a driver change might be able to package it up as the new codec. Probably not a lot of benefit, other than ticking a box... but might be useful sometimes. Especially if you say offload motion detection, but the new codec needs different arithmetic encoding, you'd need to use cpu (or general purpose gpu) to do the arithmetic encoding and presumably get a size saving over the old codec.
The main point of having ASICs for video codecs these days is efficiency, not being able to real-time decode a stream at all (as even many embedded CPUs can do that at this point).
While it worked, I don't think it ever left my machine. Never moved past software decoding -- I was a broke teen with no access to non-standard hardware. But the idea has stuck with me and feels more relevant than ever, with the proliferation of codecs we're seeing now.
It has the Sufficiently Smart Compiler problem baked in, but I tried to define things to be SIMD-native from the start (which could be split however it needed to be for the hardware) and I suspect it could work. Somehow.
If true, that would be amazing.
Maybe that’s what we did in the past and it was a bad idea. It’d be useful to know if you can read the file by looking only at its extension
That's pretty much always been the case. File extensions are just not expressive enough to capture all the nuances of audio and video codecs. MIME types are a bit better.
Audio is a bit of an exception with the popularity of MP3 (which is both a codec and a relatively minimal container format for it).
> It’d be useful to know if you can read the file by looking only at its extension
That would be madness, and there's already a workaround - the filename itself.
For most people, all that matters is an MKV file is a video file, and your configured player for this format is VLC. Only in a small number of cases does it matter about an "inner" format, or choice of parameter - e.g. for videos, what video codec or audio codec is in use, what the bitrate is, what the frame dimensions are.
For where it _matters_, people write "inner" file formats in the filename, e.g. "Gone With The Wind (1939) 1080p BluRay x265 HEVC FLAC GOONiES.mkv", to let prospective downloaders choose what to download from many competing encodings of exactly the same media, on websites where a filename is the _only_ place to write that metadata (if it were a website not standardised around making files available and searching only by filenames, it could just write it in the link description and filename wouldn't matter at all)
Most people don't care, for example, that their Word document is A4 landscape, so much that they need to know _in the filename_.
AVIF is also a container format, and I believe should be adaptable to AV2, even if the name stands for "AV1 image format". It could simply just be renamed to AOMedia Video Image Format for correctness.
AI video could mean that essential elements are preserved (actors?) but other elements are generated locally. Hell, digital doubles for actors could also mean only their movements are transmitted. Essentially just sending the mo-cap data. The future is gonna be weird
> It would be interesting to see how far you could get using deepfakes as a method for video call compression.
> Train a model locally ahead of time and upload it to a server, then whenever you have a call scheduled the model is downloaded in advance by the other participants.
> Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end. When the tech is a little further along, it should be possible to get good quality video using only a fraction of the bandwidth.
— https://news.ycombinator.com/item?id=22907718
Specifically for voice, this was mentioned:
> A Real-Time Wideband Neural Vocoder at 1.6 Kb/S Using LPCNet
— https://news.ycombinator.com/item?id=19520194
You could probably also transmit a low res grayscale version of the video to “map” any local reproduction to. Kinda like how a low resolution image could be reasonably reproduced if an artist knew who the subject was.
It works amazingly well with text compression, for example: https://bellard.org/nncp/
I have a top-of-the-line 4K TV and gigabit internet, yet the compression artifacts make everything look like putty.
Honestly, the best picture quality I’ve ever seen was over 20 years ago using simple digital rabbit ears.
You especially notice the compression on gradients and in dark movie scenes.
And yes — my TV is fully calibrated, and I’m paying for the highest-bandwidth streaming tier.
Not my tv, but a visual example: https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
Right now, Netflix can say stuff like "we think the 4K video we're serving is just as good." If they offer a real-4K tier, it's hard to make that argument.
The “best” quality of streaming you have is Sony Core https://en.wikipedia.org/wiki/Sony_Pictures_Core but it has a rather limited library.
Pricing, if I am reading the site correctly: $7k-ish for a server (+$ for local disks, one assumes), $2-5k per client. So you download the movie locally to your server and play it on clients scattered throughout your mansion/property.
Not out of the world for people who drop 10s of thousands on home theater.
I wonder if that's what the Elysium types use in their NZ bunkers.
No true self-respecting, self-described techie (Scotsman) would use it instead of building their own of course.
Also the whole "you can hear more with lossless audio" is just straight up a lie.
That's not a correctly calibrated TV. The contrast is tuned WAY up. People do that to see what's going on in the dark, but you aren't meant to really be able to see those colors. That's why it's a big dark blob. It's supposed to be barely visible on a well calibrated display.
A lot of video codecs will erase details in dark scenes because those details aren't supposed to be visible. Now, I will say that streaming services are tuning that too aggressively. But I'll also say that a lot of people have miscalibrated displays. People simply like to be able to make out every detail in the dark. Those two things come in conflict with one another causing the effect you see above.
Someone needs to tell filmmakers. They shoot dark scenes because they can - https://www.youtube.com/watch?v=Qehsk_-Bjq4 - and it ends up looking like shit after compression that assumes normal lighting levels.
i disagree completely. i watch a movie for the filmmakers story, i don’t watch movies to marvel at compression algorithms.
it would be ridiculous to watch movies shot with only bright scenes because streaming service accountants won’t stop abusing compression to save some pennies.
> …ends up looking like shit after compression that assumes normal lighting levels.
it’s entirely normal to have dark scenes in movies. streaming services are failing if they’re using compression algorithms untuned to do dark scenes when soooo many movies and series are absolutely full of night shots.
It should be noted, as well, that this generally isn't a "not enough bits" problem. There are literally codec settings to tune which decide when to start smearing the darkness. On a few codecs (such as VP1) those values are pretty badly set by default. I suspect streaming services aren't far off from those defaults. The codec settings are instead prioritizing putting bits into the lit parts of a scene rather than sparing a few for the darkness like you might like.
The issue is just that we don't code video with nearly enough bits. It's actually less than 8-bit since it only uses 16-235.
That’s why, presumably, Netflix came up with the algorithm for removing camera grain and adding synthetically generated noise on the client[0], and why YouTube shorts were recently in the news for using extreme denoising[1]. Noise is random and therefore difficult to compress while preserving its pleasing appearance, so they really like the idea of serving everything denoised as much as possible. (The catch, of course, is that removing noise from live camera footage generally implies compromising the very fine details captured by the camera as a side effect.)
[0] https://news.ycombinator.com/item?id=44456779
[1] https://news.ycombinator.com/item?id=45022184
h.264 only had FGS as an afterthought, introduced years after the spec was ratified. No wonder it wasn’t widely adopted.
VP9, h.265 and h.266 don’t have FGS.
1. camera manufacturers and film crews both do their best to produce a noise-free image 2. in post-production, they add fake noise to the image so it looks more "cinematic" 3. to compress better, streaming services try to remove the noise 4. to hide the insane compression and make it look even slightly natural, the decoder/player adds the noise back
Anyone else finding this a bit...insane?
This is not correct, camera manufacturers and filmakers engineer _aesthetically pleasing_ noise (randomized grains appear smoother to the human eye than clean uniform pixels). The rest is still as silly as it sounds.
Does this explain why i dislike 4K content on a 4K TV? Where some series and movies look too realistic, what in turn gives me a amateur film feeling (like somebody made a movie with a smartphone).
https://en.wikipedia.org/wiki/Soap_opera_effect
Which is generally associated with excess denoisong rather than with excess grain.
This comment that I replied to is almost a textbook description of the soap opera effect.
The interpolation adds more FPS, which is traditionally a marker of film vs TV production.
May be more data and numbers. Including Encoding Complexity increase, decoding complexity. Hardware Decoder roadmap. Compliance and Test kits. Future Profile. Involvement and improvement to both AVIF the format and the AV2 image codec. Better than JPEG-XL? Are the ~30% BDRATE compared to current best AV1 encoder or AV1 1.0 as anchor point? Live Encoding improvements?
[1] https://aomedia.org/events/live-session-the-future-of-innova...
60 more comments available on Hacker News