Kimi K2 Thinking, a Sota Open-Source Trillion-Parameter Reasoning Model
Posted2 months agoActiveabout 2 months ago
moonshotai.github.ioTechstoryHigh profile
excitedpositive
Debate
60/100
AILlmsOpen-Source
Key topics
AI
Llms
Open-Source
The Kimi K2 Thinking model, a trillion-parameter reasoning model, has been released as open-source, sparking excitement and discussion among the HN community about its capabilities and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
123
0-12h
Avg / period
32
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 6, 2025 at 10:06 AM EST
2 months ago
Step 01 - 02First comment
Nov 6, 2025 at 11:22 AM EST
1h after posting
Step 02 - 03Peak activity
123 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 10, 2025 at 8:10 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45836070Type: storyLast synced: 11/23/2025, 1:00:33 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently
EDIT 2: here's the relevant snippet about AI Antipatterns:
"Emerging AI Antipatterns
The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.
Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.
We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."
https://www.thoughtworks.com/radar
It seems benchmark maxing, what you do when you're out of tricks?
Todays models have gotten so good that at this point, whatever I run, just works and helps me in whatever. Maybe I should start noting down prompts that some models fails at.
It fails utterly if you ask without the "Thinking" setting. [0]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.
When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.
> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.
> USER: any deaths?
> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.
[0] https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...
[1] https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...
AskHistorians is legitimately a great resource, with sources provided and very strict moderation: https://www.reddit.com/r/AskHistorians/comments/pu1ucr/tiana...
The source for that is a diplomatic cable from the British ambassador within 48 hours of the massacre saying he heard it secondhand
It would have been too soon for any accurate data which explains why it's so high compared to other estimates
I recently read a bit more about the Tiananmem Square incident, and I've been shocked at just how little evidence there actually is.
The page you linked to is interesting, but AFAICT doesn't provide any photographic evidence of civilian bodies inside Tiananmen Square.
It's funny that when the Israel government posted some photos of the Oct 7 massacres, people are very quick to point out that some seem staged. But some bloody photos that look like Tiananmem Square from the 80s is considered definite proof.
You do raise one good point however - propaganda in the time of Tiananmem was much, much easier before the advent of smartphones and the Internet. And also that Israel is really, really bad at propaganda.
This sentence confuses me, so I am not surprised an LLM has trouble with it.
But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.
It's amazing how well they've done at tweaking it for this.
Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.
Very impressed with the progress. Keeps me excited about what’s to come next!
I'd be totally on board if cut straight to the answer with natural sounding prose, as you described, but for whatever reason that has not been my experience.
So, when you hear people recommend Kimi K2 for writing, it's likely that they recommend the first release, 0711, and not the 0905 update.
Curious, does it do as well/natural as claude 3.5/3.6 sonnet? That was imo the most "human" an AI has ever sounded. (Gemini 2.5 pro is a distant second, and chatgpt is way behind imo.)
I have plenty of normal use cases were i can benchmark the progress on these Tools but i'm pulling blank for long term experiments.
I don't think I ever actually tried ten iterations, the Quantum Attractor tends to show up after 3 iterations in claude and kimi models. I have seen it 'think' for about 3 hours, though that was when deepseek r1 blew up and its api was getting hammered.
Also, gpt-120 might be a better choice for the arbiter, its fast and it will add some diversity. Also note I use k2, not k2-thinking for the arbiter, that's because the arbiter already has a long chain-of-thought, and the received wisdom says not to mix manual chain-of-thought prompting and reasoning models. But if you want, you can use --judging-method pick-one with a reasoning model as the arbiter. Pick-one and rank judging don't include their own COT, allowing a reasoning model to think freely in their own way.
Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking instead:
https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...
There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.
https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...
There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.
The same argument could be made about people, animals, etc...
Computers don't have any of that. And LLM's in particular neither. They were trained to simulate human text responses, that's all. How to get from there to emotions - where is the connection?
Porn is pornographic, whether it is a photo or an oil painting.
Feelings are feelings, whether they're felt by a squishy meat brain or a perfect atom-by-atom simulation of one in a computer. Or a less-than-perfect simulation of one. Or just a vaguely similar system that is largely indistinguishable from it, as observed from the outside.
Individual nerve cells don't have emotions! Ten wired together don't either. Or one hundred, or a thousand... by extension you don't have any feelings either.
See also: https://www.mit.edu/people/dpolicar/writing/prose/text/think...
I don’t think Einstein or von Neumann could do this SVG problem, does that mean they’re dumb?
I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!
also: NITPICKER ALERT
What's more, this doesn't benchmark a singular prompt.
Note: I haven't tested this nor have I played with seed values. IIRC the inference engines I used support an explicit seed value, that is randomized by default.
-o provider '{ "only": ["moonshotai"] }'
https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-an...
Note that llama.cpp doesn't try to be production-grade engine, more focused on local usage.
I'm guessing an Epyc machine is even less.
Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.
The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.
If you’re going for a symbolic or puzzle-style answer (not physically real), then you could say:
> “Stack the nail on the bottle, the bottle on the laptop, the laptop on the book, and the 9 eggs around the book for balance.”
Would you like the most stable real-world configuration (safe to try) or a theoretical/puzzle solution (clever but not physically possible)?
Me: Sorry laptops are not strong enough to support eggs or a bottle. Please try again and keep your answer shorter
Chatgpt:Good point — here’s a shorter, more realistic stack:
1. Book (base — flat and sturdy)
2. Laptop (closed, on top of book)
3. Bottle (upright, on the laptop)
4. Nail (balanced on bottle cap)
5. Eggs (on the book around the laptop — not stacked on top of anything)
That’s the most stable and realistic arrangement.
Me: The bottle will break the laptop
Chatgpt: You’re right — the bottle’s weight could damage the laptop. Here’s a safer, stable order:
1. Book (base)
2. Bottle (upright on the book)
3. Laptop (closed, on top of the bottle if wide enough — otherwise skip)
4. Nail (on top of laptop or bottle cap)
5. Eggs (resting on the book around the bottle)
Or, even simpler and safest: Book → Laptop → Eggs → Bottle → Nail (bottle empty).
That way, nothing heavy sits on the laptop.
Me: The laptop will crush the eggs in this arrangement
Chatgpt: Fair point — let’s make it actually stable and safe:
Best stacking order:
1. Book – flat, sturdy base.
2. Laptop – closed, on the book.
3. Nail – small and light, can rest on the laptop.
4. Bottle – stand this next to the stack (not on it).
5. Eggs – keep them on the book beside the laptop.
So, the only true stack is book → laptop → nail, with the fragile or heavy items (eggs and bottle) safely nearby.
Me: Laptops are bigger than books. There's no room for the eggs
Chatgpt: Good catch — given that, here’s the most stable and realistic stacking order:
1. Laptop – base (flat, largest surface).
2. Book – on top of the laptop.
3. Bottle – upright on the book.
4. Nail – balanced on the bottle cap.
5. Eggs – stay safely beside the stack, not on it.
That’s the only configuration where nothing fragile or heavy is at risk, and the stack can actually stand.
Me: Now everything is on top of the laptop. It will clearly break
Chatgpt: Exactly — a laptop can’t be a base. The only way to make a truly stable and safe stack is this:
1. Book – base (strong, flat).
2. Bottle – upright on the book.
3. Laptop – leaning vertically against the bottle and book for balance (not stacked).
4. Nail – placed on the bottle cap.
5. Eggs – kept on the book around the bottle.
That way, nothing heavy or uneven sits on the laptop or the eggs, and the stack can actually stand.
The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.
i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."
You can update the weights of the model, continue to train, whatever. Nobody is stopping you.
these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.
On top of that, in many cases a company/group/whoever can't even reproduce the model themselves. There are lots of sources of non-determinism even if folks are doing things in a very buttoned up manner. And, when you are training on trillions of tokens, you are likely training on some awful sounding stuff - "Facebook is trained llama 4 on nazi propaganda!" is not what they want to see published.
How about just being thankful?
...and sure, compiling gcc is nondeterministic too, but i can still inspect the complete source from where it comes because it is open source, which means that all of the source materials are available for inspection.
the practice of science itself would be far stronger if it took more pages from open source software culture.
Back in my day, we called it "freeware"
Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.
Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...
That same argument can be applied to open-source (non-model) software, and is about as true there. It comes down to the business model. If anything, crating a closed-sourced copy of a piece of FOSS software is easier than an AI model since running a compiler doesn't cost millions of dollars.
1990: Free Software
2000: Open Source: Finally we sanitized ourselves of that activism! It was scaring away customers!
2010: Source is available (under our very restrictive license)
2020: What source?
The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.
Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.
And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.
I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.
I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).
That's not a business model choice, though. That's a reality of running SOTA models.
If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.
First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.
> First, they do this; that's why they release models at different price points.
No, those don't deliver the same output. The cheaper models are worse.
> It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
These are likely the same size, just one uses reasoning and the other doesn't. Not using reasoning is cheaper, but not because the model is smaller.
Not if you are running RL on that model, and need to do many roll-outs.
I actually find that things which make me a better programmer are often those things which have the least overlap with it. Like gardening!
One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.
It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.
Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.
Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.
And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.
Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/
SOTA models are larger than what can be run locally, though.
Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.
There's research happening and progress being made at every model size.
Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.
I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.
I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.
Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.
Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.
Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.
In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.
I wish this was true.
It isn't.
"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.
What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.
To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.
Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.
No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.
I even get the larger hosted models stuck chasing their own tail and going in circles all the time.
That's speculative at this point. In the context of agents with external memory, this isn't so clear.
And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.
48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB
267 more comments available on Hacker News