Expected Attention: Kv Cache Compression by Estimating Attention

Posted3 months agoActive3 months ago

sonabinu

20 points

3 comments

arxiv.orgTechstory

calmpositive

Debate

20/100

Artificial IntelligenceLarge Language ModelsAttention Mechanism

Key topics

Artificial Intelligence

Large Language Models

Attention Mechanism

A new method for compressing KV cache in large language models by estimating attention is proposed, sparking interest and discussion among HN users about its potential applications and implications.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

1-2h

Avg / period

1.5

Key moments

01Story posted
Oct 6, 2025 at 11:22 AM EDT
3 months ago
Step 01
02First comment
Oct 6, 2025 at 12:46 PM EDT
1h after posting
Step 02
03Peak activity
2 comments in 1-2h
Hottest window of the conversation
Step 03
04Latest activity
Oct 6, 2025 at 1:36 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (3 comments)

Showing 3 comments

tripplyons

3 months ago

1 reply

Great work! I wonder if there is a way to combine similar cache items instead of dropping unlikely ones. Could the proposed attention estimation be used for that?

yorwba

3 months ago

Yes, for example https://arxiv.org/pdf/2506.05410 merges two neighboring tokens with the lowest sum of past attention scores, and this method would enable using future expected attention instead.

yalok

3 months ago

The paper only mentions evals for Ruler 4K and 16K - I wish they’d go further and measure for longer context windows. I was wondering if there would be some gain as compared to baseline (no compression) for this method - their results for Qwen with Ruler 16K seem to allude to that - at small compression ratios the evals look better than baseline - which means they are not just improving inference speed/memory, but addressing attenuation dilution problem…

View full discussion on Hacker News

ID: 45492376Type: storyLast synced: 11/20/2025, 4:47:35 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN