Not Hacker News Logo

Not

Hacker

News!

Home
Hiring
Products
Companies
Discussion
Q&A
Users
Not Hacker News Logo

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Hiring
  • Products
  • Companies
  • Discussion
  • Q&A

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

Not Hacker News Logo

Not

Hacker

News!

Home
Hiring
Products
Companies
Discussion
Q&A
Users
  1. Home
  2. /Discussion
  3. /Kimi Linear: An Expressive, Efficient Attention Architecture
  1. Home
  2. /Discussion
  3. /Kimi Linear: An Expressive, Efficient Attention Architecture
Last activity 24 days agoPosted Oct 30, 2025 at 8:07 PM EDT

Kimi Linear: an Expressive, Efficient Attention Architecture

blackcat201
217 points
47 comments

Mood

excited

Sentiment

positive

Category

other

Key topics

AI Architecture
Attention Mechanism
Large Language Models
Debate intensity60/100

The Kimi Linear attention architecture is a new AI model that has garnered significant interest on HN, with discussions revolving around its potential, comparisons to existing models, and concerns about AI's environmental impact.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

3h

Peak period

42

Day 1

Avg / period

15.7

Comment distribution47 data points
Loading chart...

Based on 47 loaded comments

Key moments

  1. 01Story posted

    Oct 30, 2025 at 8:07 PM EDT

    27 days ago

    Step 01
  2. 02First comment

    Oct 30, 2025 at 11:19 PM EDT

    3h after posting

    Step 02
  3. 03Peak activity

    42 comments in Day 1

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    Nov 2, 2025 at 4:51 PM EST

    24 days ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (47 comments)
Showing 47 comments
textembedding
27 days ago
3 replies
125 upvotes with 2 comments is kinda sus
muragekibicho
27 days ago
3 replies
Lots of model releases are like this. We can only upvote. We can't run the model on our personal computers. We can neither test their 'Efficient Attention' concept on our personal computers.

Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).

Der_Einzige
27 days ago
1 reply
People here absolutely can afford the ~2 dollars an hour of cloud rental costs for an H100 or even 8 (OCI has cheap H100 nodes). Most people are too lazy to even try and thank goodness for it because I prefer my very high salaries as someone who isn’t too lazy to spin up a cloud instance.
embedding-shape
27 days ago
Not to mention some of us have enough disposable income to buy a RTX Pro 6000 so we can run our stuff locally and finally scale up our model training a little bit.
samus
27 days ago
1 reply
We very much can, especially such a Mixture of Experts model with only 3B activated parameters.

With an RTX 3070 (7GB GRAB VRAM), 32 GB RAM and an SSD I can run such models at speeds tolerable for casual use.

embedding-shape
27 days ago
1 reply
How many tok/s are you getting (with any runtime) with either the Kimi-Linear-Instruct or Kimi-Linear-Base on your RTX 3070?
samus
26 days ago
With a Qwen3-32B-A3B (Q8) I'm getting 10-20 t/sec on KoboldAI, e.g., llama cpp. Faster than I can read, so good enough for hobby use. I expect this model to be significantly faster, but llama.cpp-based software probably doesn't support it yet.
danielbln
27 days ago
Your have a 9Mbit downlink? I'm not sure you're trying much of anything with that Internet connection, no offense.
actionfromafar
27 days ago
1 reply
I'm hoping someone will explain what this release even means.
lhl
27 days ago
We do live in an age of frontier LLMs... For fun, I'll just use Kimi K2 (on Kagi Assistant).

> Can you explain what this means and its significance? Assume that I'm a layperson with no familiarity with LLM jargon so explain all of the technical terms, references, names. https://github.com/MoonshotAI/Kimi-Linear

Imagine your brain could only “look at” a few words at a time when you read a long letter. Today’s big language models (the AI that powers chatbots) have the same problem: the longer the letter gets, the more scratch paper they need to keep track of it all. That scratch paper is called the “KV cache,” and for a 1 000 000-word letter it can fill a small library.

Kimi Linear is a new way for the AI to read and write that throws away most of that scratch paper yet still understands the letter. It does this by replacing the usual “look at every word every time” trick (full attention) with a clever shortcut called linear attention. The shortcut is packaged into something they call Kimi Delta Attention (KDA).

What the numbers mean in plain English

    51.0 on MMLU-Pro: on a 4 000-word school-test set, the shortcut scores about as well as the old, slow method.
    84.3 on RULER at 128 000 words: on a much longer test it keeps the quality high while running almost four times faster.
    6 × faster TPOT: when the AI is writing its reply, each new word appears up to six times sooner than with the previous best shortcut (MLA).
    75 % smaller KV cache: the scratch paper is only one-quarter the usual size, so you can fit longer conversations in the same memory.
Key pieces explained

    Full attention: the old, accurate but slow “look back at every word” method.
    KV cache: the scratch paper that stores which words were already seen.
    Linear attention: a faster but traditionally weaker way of summarising what was read.
    Gated DeltaNet: an improved linear attention trick that keeps the most useful bits of the summary.
    Kimi Delta Attention (KDA): Moonshot’s even better version of Gated DeltaNet.
    Hybrid 3:1 mix: three layers use the fast KDA shortcut, one layer still uses the old reliable full attention, giving speed without losing smarts.
    48 B total, 3 B active: the model has 48 billion total parameters but only 3 billion “turn on” for any given word, saving compute.
    Context length 1 M: it can keep track of about 1 000 000 words in one go—longer than most novels.
Bottom line Kimi Linear lets an AI read very long documents or hold very long conversations with far less memory and much less waiting time, while still giving answers as good as—or better than—the big, slow models we use today.
WhereIsTheTruth
27 days ago
The Chinese century ain't gonna build itself /s
eXpl0it3r
27 days ago
2 replies
For the uninitiated, what's a "hybrid linear attention architecture"?
quotemstr
27 days ago
1 reply
1/4 of their layers are conventional quadratic attention
meowface
27 days ago
3 replies
Could someone explain every term in this subthread in a very simple way to someone who basically only knows "transformers are a neural network architecture that use something called 'attention' to consider the entire input the whole time or something like that", and who does not understand what "quadratic" even means in a time complexity or mathematical sense beyond that "quad" has something to do with the number four.

I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.

moffkalast
27 days ago
Afaik there are two types of attention, cross and self attention. It's quadratic becase you have to process one set of tokens with another, like calculating a matrix product. Originally designed for translation, you'd take tokens in one language on one side and the other language on the other, then compute the relevance of each word with each other which the model then uses further to more accurately generate the translation.

With self attention you compute every token in a sequence with every other token in that same sequence, figuring out which word references which other word (e.g. "George is sitting in the park. He's reading a book.", "He" would correlate with "George", letting the model know what it refers to). Of course these are also trained layers so what the model thinks correlates with what and how that info is used in the DNN perceptron part is depends wholly on the training process.

There is no free lunch with this and with only 1/4 of layers having it, the model will perform significantly worse at identifying relevant info and likely decohere a lot compared to having it on every layer. But since you get rid of the quadratic complicity, it'll be much faster. Think "I'm doing 1000 calculations per second and they're all wrong" meme. So far there have been lots of attempts at doing linear-ish attention (e.g. Google doing the sliding window hackery that only computes a part of the vectors and hopes for good locality, mamba combinations with RNNs, Meta removing positional encodings in attention in the trainwreck that was LLama4, etc.) and they've mostly failed, so the holy grail is finding a way to make it work since you get the best of both worlds. The top performing models today all use fully quadratic attention or combine it with sliding windows in some layers to claw back some speed in long context scenarios at the cost of some accuracy.

Zacharias030
27 days ago
Transformers try to give you capabilities by doing two things interleaved (in layers) multiple times:

- apply learned knowledge from its parameters to every part of the input representation („tokenized“, ie, chunkified text).

- apply mixing of the input representation with other parts of itself. This is called „attention“ for historical reasons. The original attention computes mixing of (roughly) every token (say N) with every other (N). Thus we pay a compute cost relative to N squared.

The attention cost therefore grows quickly in terms of compute and memory requirements when the input / conversation becomes long (or may even contain documents).

It is a very active field of research to reduce the quadratic part to something cheaper, but so far this has been rather difficult, because as you readily see this means that you have to give up mixing every part of the input with every other.

Most of the time mixing token representations close to each other is more important than those that are far apart, but not always. That’s why there are many attempts now to do away with most of the quadratic attention layers but keeping some.

What to do during mixing when you give up all-to-all attention is the big research question because many approaches seem to behave well only under some conditions and we haven’t established something as good and versatile as all-to-all attention.

If you forgo all-to-all you also open up so many options (eg. all-to-something followed by something-to-all as a pattern, where something serves as a sort of memory or state that summarizes all inputs at once. You can imagine that summarizing all inputs well is a lossy abstraction though, etc.)

hexaga
26 days ago
There are different varieties of attention, which just amounts to some kind of learned mixing function between tokens in a sequence.

For an input of length N (tokens), the standard kind of attention requires N squared operations (hence, quadratic - it scales with the square of input length). You have to check how every token attends to every other token.

There are a bunch of alternative mixing functions which are instead linear with respect to N. Every additional token costs the same amount of work. The typical method is to have a constant size state manipulated recurrently, which necessarily implies some level of lossy compression in the state (quadratic attention doesn't really have state in this sense - it computes and checks every possible relation always).

Linear attentions kind of suck in comparison to quadratic attention but the efficiency is very attractive, especially at inference time where you don't need more VRAM to store more context.

TLDR; conventional attentions scale N^2 time, N space (kv cache), and are exact. linear attentions scale N time, constant space (recurrent state), and are lossy.

bn-l
27 days ago
Hey thanks for asking this question. It lead to good replies
amoskvin
27 days ago
1 reply
any hardware recommendations? how much memory do we need to this?
uniqueuid
27 days ago
1 reply
You will effectively want a 48GB card or more for quantized versions, otherwise you won't have meaningful space left for the KV cache. Blackwell and above is generally a good idea to get faster hardware support for 4b (some recent models took some time to ship for older architectures, gpt-oss IIRC).
samus
27 days ago
This is a Mixture of Experts model with only 3B activated parameters. But I agree that for the intended usage scenario VRAM for the KV cache is the real limitation.
logicartisan
27 days ago
1 reply
Amazing how fast AI keeps improving, every new model feels like a big step forward
hirako2000
27 days ago
4 replies
It solely is improving on efficiency. While it is extremely valuable given the disproportionate (to value) costs of these things, your statement almost sounds like it has improved an even more challenging aspect, pushing performance.
embedding-shape
27 days ago
1 reply
It's a generic comment that I don't think is even specifically about Kimi Linear or this submission, you could leave the same comment on almost any AI/ML submission and it'd say the same amount and be as relevant/irrelevant.
hirako2000
27 days ago
Agreed it would apply to any method or system that improved on efficiency. It doesn't diminish the feat. Not trying to minimize the impact of Kimi linear's gain. It is a novel and outstanding benefit applicable to LLMs.
giancarlostoro
27 days ago
1 reply
I've uh said this a few times. But AI is a bunch of people overpaying CS students to implement old algorithms and then realizing that they need Software Engineers to optimize the existing known systems. Most of AI (if not ALL of it) as we know it today has been coded for decades, we just never had the hardware for it.

A lot of the optimizations are not some ground breaking new way to program, they're known techniques to any Software Engineer or Systems Engineer.

embedding-shape
27 days ago
> A lot of the optimizations are not some ground breaking new way to program

Hindsight is a bitch huh? Everything looks simple now once people proved it kind of works, but I think you over-simplify "how easy it is".

Lots of stuff in ML, particularly recent ~5 years or so, haven't been "implementing old algorithms" although of course everything is based on the research that happened in the past, we're standing on the shoulders of giants and all that.

acuozzo
26 days ago
> It solely is improving on efficiency.

Consider the implications of increases in efficiency *when you hold compute constant*.

The win is far more obvious when it's "we can do more with what we have" instead of "we can do the same with less".

naasking
26 days ago
There has been work that has pushed performance too, like tiny recursive models. Applying LLMs in recursive loops also improves output, so efficiency improvements make this viable, which can count as improvements in performance.
oxqbldpxo
27 days ago
4 replies
I switched from chatgpt to Perplexity; and now to Kimi K2, after reading an article here explaining that all the fear around some of the Chinese models spying and so on.. is simply not true. I have to say that in my experience Kimi K2 is way better than perplexity. I hope we can get our act together. Seems that building this Ai's requires a level of collaboration that is in opposition to greed.
lostmsu
27 days ago
1 reply
Why do you think either of Perplexity or Kimi are better than GPT-5?
embedding-shape
27 days ago
More importantly: for what?

Each model, the tooling you used and even what prompts you use for what model, impacts a lot of the quality of responses you get from the models.

wongarsu
27 days ago
My default assumption would be that every model is spying (or rather: is being spied on). The data is just way too juicy, every major intelligence agency has to be salivating at the though of getting this degree of insight into people.

Of course with Kimi there is fear because the Chinese government can easily pressure Moonshot AI into sharing the data, and other countries have to work to stealthily siphon data off without being caught by Chinese counterintelligence. As opposed to GPT5 where the American government can easily pressure OpenAI and every other country has to stealthily siphon data off without being caught by American counterintelligence. The only way to be reasonably certain that you aren't spied on is to run your own models or rent GPU time to run models.

The bigger worry imho is whether the models are booby-trapped to give poisoned answers when they detect certain queries or when they detect that you work for a competitor or enemy of China. But that has to be reasonably stealthy to work

nh43215rgb
26 days ago
> Chinese models spying and so on.. is simply not true.

They all must be doing a great favor to humanity in a good will then.

Sorry, but seriously -- Chinese government, controlled by the Chinese Communist Party (CCP), can effectively seize or shut down internet services and infrastructure at will within its borders under its national security laws.

No need to read the TOS; it's in the law.

embedding-shape
27 days ago
> after reading an article here explaining that all the fear around some of the Chinese models spying and so on.. is simply not true

Not doubting they aren't spying on people, but regardless, how would you really know? Are you basing this on that no Chinese police has visited you, or how would you really know if it's "simply true" or not?

With that said, I use plenty of models coming out of China too with no fear, but I'm also using them locally, not cloud platforms.

lostmsu
27 days ago
1 reply
Any comparison with existing models on common benchmarks? Text? Coding? MMLU?
ted_dunning
24 days ago
1 reply
Did you even look at the article?

Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:

• Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].

• Code Generation: LiveCodeBench v6 4 [44], EvalPlus [60].

• Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en.

• Long-context: MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].

• Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55].

lostmsu
24 days ago
What article? README and the linked Tech Report don't list MMLU results.
softwaredoug
27 days ago
4 replies
Everyone is worried about AI data centers destroying the planet with their extreme energy needs. Though it seems we have a big learning curve still to make AI inference and training more efficient.

How likely are we to NOT see the AI data center apocalypse through better algorithms?

wongarsu
27 days ago
1 reply
We have already seen huge efficiency increases over the last two years. Small models have become increasingly capable, the minimum viable model size for simple tasks keeps shrinking, and proprietary model providers have long stopped talking about new milestones in model sizes and instead achieved massive price cuts through methods they largely keep quiet about (but that almost certainly include smaller models and intelligent routing to different model sizes)

But so far this has just lead to more induced demand. There are a lot of things we would use LLMs for if it was just cheap enough, and every increase in efficiency makes more of those use cases viable

naasking
26 days ago
At some threshold, efficiency gains let models move out of the data center though.
m00x
26 days ago
1 reply
I don't think this worry is widespread, or even warranted. China has been able to more than double the US in energy production without massive effects on the environment by using nuclear, solar, and hydro.

If anything, the US is massively underproducing.

ted_dunning
24 days ago
The folks choking on the air pollution in Beijing might think differently.
naasking
26 days ago
> How likely are we to NOT see the AI data center apocalypse through better algorithms?

Near certain IMO. Algorithmic improvements have outpaced hardware improvements for decades. We're already seeing the rise of small models and how simple tweaks can make small models very capable problem solvers, better even than state of the art large models. Data center scaling is nearing its peak IMO as we're hitting data limits which cap model size anyway.

simgt
27 days ago
Without policies, gains in efficiency are always compensated by increased demand. Global energy consumption by source is a good example, we've never consumed as much coal as now even though we have alternatives.

https://ourworldindata.org/global-energy-200-years

andai
26 days ago
How does Gemini have a million token context window?
adt
27 days ago
https://lifearchitect.ai/models-table/
View full discussion on Hacker News
ID: 45766937Type: storyLast synced: 11/20/2025, 5:02:38 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read ArticleView on HN
Not Hacker News Logo

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Hiring
  • Products
  • Companies
  • Discussion
  • Q&A

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.