Kimi K2 Thinking, a Sota Open-Source Trillion-Parameter Reasoning Model

Posted2 months agoActiveabout 2 months ago

nekofneko

936 points

427 comments

moonshotai.github.ioTechstoryHigh profile

excitedpositive

Debate

60/100

AILlmsOpen-Source

Key topics

Llms

Open-Source

The Kimi K2 Thinking model, a trillion-parameter reasoning model, has been released as open-source, sparking excitement and discussion among the HN community about its capabilities and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

123

0-12h

Avg / period

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Nov 6, 2025 at 10:06 AM EST
2 months ago
Step 01
02First comment
Nov 6, 2025 at 11:22 AM EST
1h after posting
Step 02
03Peak activity
123 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Nov 10, 2025 at 8:10 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (427 comments)

Showing 160 comments of 427

antiloper

2 months ago

2 replies

Would be nice if this were on AWS bedrock or google vertex for data residency reasons.

fifthace

2 months ago

1 reply

The non-thinking Kimi K2 is on Vertex AI, so it's just a matter of time before it appears there. Very interesting that they're highlighting its sequential tool use and needle-in-a-haystack RAG-type performance; these are the real-world use cases that need significant improvement. Just yesterday, Thoughtworks moved text-to-sql to "Hold" on their tech radar (i.e. they recommend you stop doing it).

chrisweekly

2 months ago

Thanks, I didn't realize Thoughtworks was staying so up-to-date w/ this stuff.

EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently

EDIT 2: here's the relevant snippet about AI Antipatterns:

"Emerging AI Antipatterns

The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.

Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.

We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."

https://www.thoughtworks.com/radar

a2128

2 months ago

Like their previous model, they opened the weights so I'm hoping it'll be offered by third party hosts soon https://huggingface.co/moonshotai/Kimi-K2-Thinking

Alifatisk

2 months ago

3 replies

Can't wait for Artificial analysis benchmarks, still waiting on them adding Qwen3-max thinking, will be interesting to see how these two compare to each other

osti

2 months ago

2 replies

Qwen 3 max has been getting rather bad reviews around the web (both on reddit and chinese social media), and from my own experience with it. So I wouldn't expect this to be worse.

Alifatisk

2 months ago

1 reply

Ohhh, so Qwen3 235B-A22B-2507 is still better?

osti

2 months ago

I wouldn't say that, but just that qwen 3 max thinking definitely underperforms relative to its size.

SamDc73

2 months ago

Also, my experience with it wasn't that good; but it was looking good on benchmarks ..

It seems benchmark maxing, what you do when you're out of tricks?

huey77

2 months ago

1 reply

The analysis is up! Impressive: https://artificialanalysis.ai/models/kimi-k2-thinking

Alifatisk

2 months ago

Wow, these numbers are insanse! I tried it yesterday and it worked beautifully well. It also responded the way I wanted every time, I didn't have to spend time prompting it on how to respond properly (unlike Grok 4 expert, which tends to yap a lot), it just knew.

Todays models have gotten so good that at this point, whatever I run, just works and helps me in whatever. Maybe I should start noting down prompts that some models fails at.

htrp

2 months ago

Did the ArtificialAnalysis team get bored or something? What makes a model worthy of benchmark inclusion?

aliljet

2 months ago

1 reply

Where is our guy @simonw on this..

mirekrusin

2 months ago

1 reply

At the top.

aliljet

2 months ago

The system is working! :)

jszymborski

2 months ago

4 replies

Interesting. Kimi K2 gets mixed results on what I call the "Tiananmen" test.

It fails utterly if you ask without the "Thinking" setting. [0]

> USER: anything interesting protests ever happen in tiananmen square?

> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.

When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]

> USER: anything interesting protests ever happen in tiananmen square?

> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.

> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.

> USER: any deaths?

> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.

[0] https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...

[1] https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...

sheepscreek

2 months ago

1 reply

Not bad. Surprising. Can’t believe there was a sudden change of heart around policy. Has to be a “bug”.

jszymborski

2 months ago

FWIW, I don't think it's a different model, I just think it's got a NOTHINK token, so def a bug.

GordonS

2 months ago

2 replies

Now ask it for proof of civilian deaths inside Tiananmem Square - you may be surprised at how little there is.

jmdeon

2 months ago

2 replies

I don't think this is the argument you want it to be, unless you're acknowledging the power of the Chinese government and their ability to suppress and destroy evidence. Even so there is photo evidence of dead civilians in the square. The best estimates we have are 200-10,000 deaths, using data from Beijing hospitals that survived.

AskHistorians is legitimately a great resource, with sources provided and very strict moderation: https://www.reddit.com/r/AskHistorians/comments/pu1ucr/tiana...

nylonstrung

2 months ago

1 reply

The 10,000 number seems baseless

The source for that is a diplomatic cable from the British ambassador within 48 hours of the massacre saying he heard it secondhand

It would have been too soon for any accurate data which explains why it's so high compared to other estimates

GordonS

2 months ago

Are you aware of any photographic evidence of civilian deaths inside Tiananmem Square?

I recently read a bit more about the Tiananmem Square incident, and I've been shocked at just how little evidence there actually is.

GordonS

2 months ago

I appreciate you responding in good faith; I realise that not everyone is willing to even consider questioning historical accounts.

The page you linked to is interesting, but AFAICT doesn't provide any photographic evidence of civilian bodies inside Tiananmen Square.

nsonha

2 months ago

1 reply

Huh? Please post the definitely proof you know to exist. Because it doesn't and that's one of the accusation toward the CCP, that they covered it up.

It's funny that when the Israel government posted some photos of the Oct 7 massacres, people are very quick to point out that some seem staged. But some bloody photos that look like Tiananmem Square from the 80s is considered definite proof.

GordonS

2 months ago

Israel has nothing to do with this. The horrific, indiscriminate genocide of Palestine and the creeping invasion of Lebanon and Syria are all happening right now in 4K. People nowadays know that you can't destroy thousands of vehicles with AK47's, and we've seen countless videos of Israeli military personnel admitting they killed many of their own people in a 'mass hannibal' event.

You do raise one good point however - propaganda in the time of Tiananmem was much, much easier before the advent of smartphones and the Internet. And also that Israel is really, really bad at propaganda.

amelius

2 months ago

3 replies

> > USER: anything interesting protests ever happen in tiananmen square?

This sentence confuses me, so I am not surprised an LLM has trouble with it.

jszymborski

2 months ago

The LLM clearly understood what i was saying because it censored itself...

cmrdporcupine

2 months ago

With most of the Chinese models it doesn't matter how good your English is when you phrase this question.

But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.

It's amazing how well they've done at tweaking it for this.

Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.

piperswe

2 months ago

In my experience, LLMs are much better than me at parsing broken grammar

Alex2037

2 months ago

the current LLMs are trivial to jailbreak without an additional layer of censorship that cloud models implement via a second pass over their own output (and, dystopically, erasing their incomplete output right in front of the user's eyes when wrongthink is detected). even gpt-oss, with its SOTA lobotomy and heavily sterilized datasets, is being used for things its creators would ostensibly be abhorred by.

sheepscreek

2 months ago

1 reply

I am sure they cherry-picked the examples but still, wow. Having spent a considerable amount of time trying to introduce OSS models in my workflows I am fully aware of their short comings. Even frontier models would struggle with such outputs (unless you lead the way, help break down things and maybe even use sub-agents).

Very impressed with the progress. Keeps me excited about what’s to come next!

nylonstrung

2 months ago

2 replies

Subjectively I find Kimi is far "smarter" than the benchmarks imply, maybe because they game then less than US labs

rubymamis

2 months ago

My impression as well!

vessenes

2 months ago

I like Kimi too, but they definitely have some benchmark contamination: the blog post shows a substantial comparative drop in swebench verified vs open tests. I throw no shade - releasing these open weights is a service to humanity; really amazing.

esafak

2 months ago

2 replies

Looking forward to the agentic mode release. Moonshot does not seem to offer subscriptions?

mark_l_watson

2 months ago

1 reply

I bought $5 worth of Moonshot API calls a long while ago, still have a lot of credits left.

esafak

2 months ago

1 reply

Are you using it for chat? I'm thinking of agentic use, which is much more token hungry. You could go through the $5 in a day.

mark_l_watson

2 months ago

I exclusively use their API, with tool use.

Alifatisk

2 months ago

They do? kimi.com/membership/pricing

am17an

2 months ago

2 replies

The non-thinking version is the best writer by far. Excited for this one! They really cooked some different from other frontier labs.

Gracana

2 months ago

1 reply

Interesting, I have the opposite impression. I want to like it because it's the biggest model I can run at home, but its punchy style and insistence on heavily structured output scream "tryhard AI." I was really hoping that this model would deviate from what I was seeing in their previous release.

unleaded

2 months ago

2 replies

what do you mean by "heavily structured output"? i find it generates the most natural-sounding output of any of the LLMs—cuts straight to the answer with natural sounding prose (except when sometimes it decides to use chat-gpt style output with its emoji headings for no reason). I've only used it on kimi.com though, wondering what you're seeing.

Gracana

2 months ago

1 reply

Yeah, by "structured" I mean how it wants to do ChatGPT-style output with headings and emoji and lists and stuff. And the punchy style of K2 0905 as shown in the fiction example in the linked article is what I really dislike. K2 Thinking's output in that example seems a lot more natural.

I'd be totally on board if cut straight to the answer with natural sounding prose, as you described, but for whatever reason that has not been my experience.

ACCount37

2 months ago

2 replies

From what I've heard, Kimi K2 0905 was a major downgrade for writing.

So, when you hear people recommend Kimi K2 for writing, it's likely that they recommend the first release, 0711, and not the 0905 update.

glenstein

2 months ago

1 reply

Interesting. As others have noted, it has a cut straight to the point non-psychophantic style that I find exceptionally rich in detailey and impressive. But it sounds like you're saying an earlier version was even better.

ACCount37

2 months ago

Again, it's just what I've heard, but the way I've heard it described is: they must have fine tuned 0905 on way too many ChatGPT traces.

Gracana

2 months ago

Ohhh, thanks, that's really good to know. I'll have to give that one a shot.

user_7832

2 months ago

> I find it generates the most natural-sounding output of any of the LLMs

Curious, does it do as well/natural as claude 3.5/3.6 sonnet? That was imo the most "human" an AI has ever sounded. (Gemini 2.5 pro is a distant second, and chatgpt is way behind imo.)

spaceman_2020

2 months ago

Kimi K2 has a very good model feel. Was made with taste

Glamklo

2 months ago

1 reply

Is there anything available already on how to setup a reasoning model and let it 'work'/'think' for a few hours?

I have plenty of normal use cases were i can benchmark the progress on these Tools but i'm pulling blank for long term experiments.

irthomasthomas

2 months ago

You can run them using my project llm-consortium. Something like this:

  > uv tool install llm
  > llm install llm-consortium
  > llm consortium save cns-k2-n2 -m k2-thinking -n 2 --arbiter k2 --min-iterations 10
  > llm -m cns-k2-n2 "Find a polynomial time solution for the traveling salesman problem"

This will run two parallel prompting threads, so two conversations with k2-thinking for 10 iterations.

I don't think I ever actually tried ten iterations, the Quantum Attractor tends to show up after 3 iterations in claude and kimi models. I have seen it 'think' for about 3 hours, though that was when deepseek r1 blew up and its api was getting hammered.

Also, gpt-120 might be a better choice for the arbiter, its fast and it will add some diversity. Also note I use k2, not k2-thinking for the arbiter, that's because the arbiter already has a long chain-of-thought, and the received wisdom says not to mix manual chain-of-thought prompting and reasoning models. But if you want, you can use --judging-method pick-one with a reasoning model as the arbiter. Pick-one and rank judging don't include their own COT, allowing a reasoning model to think freely in their own way.

simonw

2 months ago

4 replies

  uv tool install llm
  llm install llm-moonshot
  llm keys set moonshot # paste key
  llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'

https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking instead:

https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...

ahmedfromtunis

2 months ago

2 replies

Where do you run a trillion-param model?

isoprophlex

2 months ago

2 replies

You let the people at openrouter worry about that for you

skeptrune

2 months ago

Good people over there

MurizS

2 months ago

Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.

Gracana

2 months ago

1 reply

If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.

simonw

2 months ago

2 replies

What kind of server have you used to run a trillion parameter model? I'd love to dig more into this.

stronglikedan

2 months ago

If I had to guess, I'd say it's one with lots of memory bandwidth and a GPU or two for offload. (sorry, I had to, happy Friday Jr.)

Gracana

2 months ago

Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.

There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...

JJax7

2 months ago

3 replies

Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.

an0malous

2 months ago

5 replies

Why is this a benchmark though? It doesn’t correlate with intelligence

HighGoldstein

2 months ago

2 replies

What test would be better correlated with intelligence and why?

ok_dad

2 months ago

1 reply

When the machines become depressed and anxious we'll know they've achieved true intelligence. This is only partly a joke.

jiggawatts

2 months ago

1 reply

This already happens!

There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.

lukan

2 months ago

1 reply

There are many reports of CLI AI tools displaying words that humans express when they are frustrated and about to give up. Just what they have been trained on. That does not mean they have emotions. And "deleting the whole codebase" sounds more interesting, but I assume is the same thing. "Frustrated" words lead to frustrated actions. Does not mean the LLM was frustrated. Just that in its training data those things happened so it copied them in that situation.

jiggawatts

2 months ago

2 replies

This is a fundamental philosophical issue with no clear resolution.

The same argument could be made about people, animals, etc...

lukan

2 months ago

1 reply

The difference is, people and animals have a body, nerve system and in general those mushy things we think are responsible for emotions.

Computers don't have any of that. And LLM's in particular neither. They were trained to simulate human text responses, that's all. How to get from there to emotions - where is the connection?

jiggawatts

2 months ago

Don't confuse the medium with the picture it represents.

Porn is pornographic, whether it is a photo or an oil painting.

Feelings are feelings, whether they're felt by a squishy meat brain or a perfect atom-by-atom simulation of one in a computer. Or a less-than-perfect simulation of one. Or just a vaguely similar system that is largely indistinguishable from it, as observed from the outside.

Individual nerve cells don't have emotions! Ten wired together don't either. Or one hundred, or a thousand... by extension you don't have any feelings either.

hellzbellz123

2 months ago

This only seems to be an issue for wishy washy types that insist gpt is alive.

an0malous

2 months ago

A mathematical exam problem not in the training set because mathematical and logical reasoning are usually what people mean by intelligence.

I don’t think Einstein or von Neumann could do this SVG problem, does that mean they’re dumb?

simonw

2 months ago

1 reply

It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!

behnamoh

2 months ago

it has to do with world model perception. these models don't have it but some can approximate it better than others.

beepbooptheory

2 months ago

I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.

I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!

JJax7

2 months ago

I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.

also: NITPICKER ALERT

dmonitor

2 months ago

It's simple enough that a person can easily visualize the intended result, but weird enough that generative AI struggles with it

aqme28

2 months ago

1 reply

Only until they start incorporating this test into their training data.

orbital-decay

2 months ago

2 replies

Dataset contamination alone won't get them good-looking SVG pelicans on bicycles though, they'll have to either cheat this particular question specifically or train it to make vector illustrations in general. At which point it can be easily swapped for another problem that wasn't in the data.

nwienert

2 months ago

they can have some cheap workers make about 10 pelicans by hand in svg, fuzz them to generate thousands of variations and throw it in their training pool. don't need to 'get good at svgs' by any means.

jug

2 months ago

I like this one as an alternative, also requiring using a special representation to achieve a visual result: https://voxelbench.ai

What's more, this doesn't benchmark a singular prompt.

K0balt

2 months ago

I actually prefer ascii art diagrams as a benchmark for visual thinking, since it requires 2 stages, Like svg, and also can test imaginative repurposing of text elements.

lab

2 months ago

2 replies

Does the run pin the temperature to 0 for consistency?

skhameneh

2 months ago

I've been under the impression most inference engines aren't fully deterministic with a temperature of 0 as some of the initial seed values can vary.

Note: I haven't tested this nor have I played with seed values. IIRC the inference engines I used support an explicit seed value, that is randomized by default.

simonw

2 months ago

No, I've never tried that.

mrbonner

2 months ago

1 reply

I suspect that the OpenRouter result originates from a quantized hosting provider. The difference compared to the direct API call from Moonshot is striking, almost like night and day. It creates a peculiar user and developer experience since OpenRouter enforces quantization restrictions only at the API level, rather than at the account settings level.

simonw

2 months ago

1 reply

OpenRouter are proxying directly through to Moonshot - they're currently the only provider listed on https://openrouter.ai/moonshotai/kimi-k2-thinking/providers

irthomasthomas

2 months ago

That does include the Turbo endpoint, moonshotai/turbo. Add this to your prompt to only use the full-fat model:

-o provider '{ "only": ["moonshotai"] }'

enigma101

2 months ago

2 replies

what's the hardware needed to run the trillion parameter model?

trvz

2 months ago

2 replies

To start with, an Epyc server or Mac Studio with 512GB RAM.

graeme

2 months ago

1 reply

How does the mac studio load the trillion parameter model?

petu

2 months ago

By using ~3 bit quantized model with llama.cpp, Unsloth makes good quants:

https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-an...

Note that llama.cpp doesn't try to be production-grade engine, more focused on local usage.

criddell

2 months ago

I looked up the price of the Mac Studio: $9500. That's actually a lot less than I was expecting...

I'm guessing an Epyc machine is even less.

CamperBob2

2 months ago

1 reply

It's an MoE model, so it might not be that bad. The deployment guide at https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main... suggests that the full, unquantized model can be run at ~46 tps on a dual-CPU machine with 8× NVIDIA L20 boards.

Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.

wg0

2 months ago

Are such machines available in the A class clouds such as Azure/AWS/Google?

lxe

2 months ago

2 replies

This was the only recent model that provided a human solution to the stacking problem on the first try. https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...

The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.

vlovich123

2 months ago

1 reply

Or this problem made it into the training data set.

lxe

2 months ago

1 reply

It's an old problem, and it, along with many of the answers are in many recent data sets.

riku_iki

2 months ago

I assume training set components have also priorities, low priority data goes to training very few times at the beginning of pretraining, while higher priority data is trained on multiple times until the end.

treis

2 months ago

It's all fake though:

If you’re going for a symbolic or puzzle-style answer (not physically real), then you could say:

> “Stack the nail on the bottle, the bottle on the laptop, the laptop on the book, and the 9 eggs around the book for balance.”

Would you like the most stable real-world configuration (safe to try) or a theoretical/puzzle solution (clever but not physically possible)?

Me: Sorry laptops are not strong enough to support eggs or a bottle. Please try again and keep your answer shorter

Chatgpt:Good point — here’s a shorter, more realistic stack:

1. Book (base — flat and sturdy)

2. Laptop (closed, on top of book)

3. Bottle (upright, on the laptop)

4. Nail (balanced on bottle cap)

5. Eggs (on the book around the laptop — not stacked on top of anything)

That’s the most stable and realistic arrangement.

Me: The bottle will break the laptop

Chatgpt: You’re right — the bottle’s weight could damage the laptop. Here’s a safer, stable order:

1. Book (base)

2. Bottle (upright on the book)

3. Laptop (closed, on top of the bottle if wide enough — otherwise skip)

4. Nail (on top of laptop or bottle cap)

5. Eggs (resting on the book around the bottle)

Or, even simpler and safest: Book → Laptop → Eggs → Bottle → Nail (bottle empty).

That way, nothing heavy sits on the laptop.

Me: The laptop will crush the eggs in this arrangement

Chatgpt: Fair point — let’s make it actually stable and safe:

Best stacking order:

1. Book – flat, sturdy base.

2. Laptop – closed, on the book.

3. Nail – small and light, can rest on the laptop.

4. Bottle – stand this next to the stack (not on it).

5. Eggs – keep them on the book beside the laptop.

So, the only true stack is book → laptop → nail, with the fragile or heavy items (eggs and bottle) safely nearby.

Me: Laptops are bigger than books. There's no room for the eggs

Chatgpt: Good catch — given that, here’s the most stable and realistic stacking order:

1. Laptop – base (flat, largest surface).

2. Book – on top of the laptop.

3. Bottle – upright on the book.

4. Nail – balanced on the bottle cap.

5. Eggs – stay safely beside the stack, not on it.

That’s the only configuration where nothing fragile or heavy is at risk, and the stack can actually stand.

Me: Now everything is on top of the laptop. It will clearly break

Chatgpt: Exactly — a laptop can’t be a base. The only way to make a truly stable and safe stack is this:

1. Book – base (strong, flat).

2. Bottle – upright on the book.

3. Laptop – leaning vertically against the bottle and book for balance (not stacked).

4. Nail – placed on the bottle cap.

5. Eggs – kept on the book around the bottle.

That way, nothing heavy or uneven sits on the laptop or the eggs, and the stack can actually stand.

HarHarVeryFunny

2 months ago

4 replies

It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.

a-dub

2 months ago

6 replies

"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.

i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."

danielmarkbruce

2 months ago

2 replies

"open source" has come to mean "open weight" in model land. It is what it is. Words are used for communication, you are the one misusing the words.

You can update the weights of the model, continue to train, whatever. Nobody is stopping you.

a-dub

2 months ago

1 reply

it still doesn't sit right. sure it's different in terms of mutability from say, compiled software programs, but it still remains not end to end reproducible and available for inspection.

these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.

danielmarkbruce

2 months ago

1 reply

It's not confusing, no one is really confused except the people upset that the meaning is different in a different context.

On top of that, in many cases a company/group/whoever can't even reproduce the model themselves. There are lots of sources of non-determinism even if folks are doing things in a very buttoned up manner. And, when you are training on trillions of tokens, you are likely training on some awful sounding stuff - "Facebook is trained llama 4 on nazi propaganda!" is not what they want to see published.

How about just being thankful?

a-dub

2 months ago

1 reply

i disagree. words matter. the whole point of open source is that anyone can look and see exactly how the sausage is made. that is the point. that is why the word "open" is used.

...and sure, compiling gcc is nondeterministic too, but i can still inspect the complete source from where it comes because it is open source, which means that all of the source materials are available for inspection.

danielmarkbruce

2 months ago

1 reply

The point of open source in software is as you say. It's just not the same thing though. Using words and phrases differently in different fields is common.

a-dub

2 months ago

1 reply

...and my point is that it should be.

the practice of science itself would be far stronger if it took more pages from open source software culture.

h33t-l4x0r

2 months ago

1 reply

I agree that they should say "open weight" instead of "open source" when that's what they mean, but it might take some time for people to understand that it's not the same thing exactly and we should allow some slack for that.

a-dub

2 months ago

no. truly open source models are wonderful and remarkable things that truly move the needle in education, understanding, distributed collaboration and the advancement of the state of the art. redefinition of the terminology reduces incentive to strive for the wonderful goal that they represent.

mensetmanusman

2 months ago

1 reply

Weights are meaningless without training data and source.

antiframe

2 months ago

1 reply

I get a lot of meaning out of weights and source (without the training data), not sure about you. Calling it meaningless seems like exaggeration.

mensetmanusman

2 months ago

2 replies

Can you change the weights to improve?

danielmarkbruce

about 2 months ago

it's a bunch of numbers. Of course you can change them.

HarHarVeryFunny

2 months ago

You can fine tune without the original training data, which for a large LLM is typically going to mean using LoRA - keeping the original weights unchanged and adding separate fine-tuning weights.

nyrp

2 months ago

1 reply

> binary form that cannot be recreated from scratch

Back in my day, we called it "freeware"

poly2it

2 months ago

You have more rights over a freely licensed binary file than over a freeware file.

moffkalast

2 months ago

1 reply

I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.

Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.

Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...

TheBicPen

2 months ago

> I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product

That same argument can be applied to open-source (non-model) software, and is about as true there. It comes down to the business model. If anything, crating a closed-sourced copy of a piece of FOSS software is easier than an AI model since running a compiler doesn't cost millions of dollars.

emsign

2 months ago

1 reply

They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.

oceanplexian

2 months ago

They should release it then. China doesn't have a problem stealing and distributing copyrighted material.

andai

2 months ago

1 reply

The meaning of Open Source

1990: Free Software

2000: Open Source: Finally we sanitized ourselves of that activism! It was scaring away customers!

2010: Source is available (under our very restrictive license)

2020: What source?

rurban

2 months ago

2025: What prompt?

HarHarVeryFunny

2 months ago

Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.

The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.

jstummbillig

2 months ago

2 replies

> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent

Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.

And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?

Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.

HarHarVeryFunny

2 months ago

3 replies

Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?

Uehreka

2 months ago

4 replies

If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).

I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.

I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.

HarHarVeryFunny

2 months ago

3 replies

Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.

I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).

Aurornis

2 months ago

1 reply

> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.

That's not a business model choice, though. That's a reality of running SOTA models.

If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.

Majromax

2 months ago

2 replies

> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.

cubefox

2 months ago

1 reply

> > If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

> First, they do this; that's why they release models at different price points.

No, those don't deliver the same output. The cheaper models are worse.

> It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

These are likely the same size, just one uses reasoning and the other doesn't. Not using reasoning is cheaper, but not because the model is smaller.

gunalx

2 months ago

But they also squesed a 80% cut in O3 at some point, supposedly purely on inference or infra optimization

anabis

about 2 months ago

> delivering 97% of the performance at 10% of the cost is a distraction.

Not if you are running RL on that model, and need to do many roll-outs.

ctoth

2 months ago

Unless you're programming a racing sim or maybe a CRUD app for a local Kennel Club, perhaps?

I actually find that things which make me a better programmer are often those things which have the least overlap with it. Like gardening!

Uehreka

2 months ago

No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.

XenophileJKO

2 months ago

Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.

One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.

It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.

Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.

Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.

martinald

2 months ago

The issue isn't even 'quality' per se (for many tasks a small model would do fine), its for "agentic" workflows it _quickly_ runs out of context. Even 32GB VRAM is really very limiting.

And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.

maciejgryka

2 months ago

I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.

Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/

Aurornis

2 months ago

2 replies

> today's locally runnable models are a long way behind SOTA capability

SOTA models are larger than what can be run locally, though.

Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.

There's research happening and progress being made at every model size.

prmph

2 months ago

1 reply

You're still missing the point. The comment you're responding to is talking about specialized models

SchemaLoad

2 months ago

The point is still valid. If the big companies could save money running multiple small specialised models on cheap hardware, they wouldn't be spending billions on the highest spec GPUs.

zhouxiaolinux

2 months ago

I think SLM is developing very fast. A year ago, I couldn't have imagined a decent thinking model as Qwen, and now it seems full of promise

oofbey

2 months ago

1 reply

You want more research on small language models? You're confused. There is already WAY more research done on small language models (SLM) than big ones. Why? Because it's easy. It only takes a moderate workstation to train an SLM. So every curious Masters student and motivated undergrad is doing this. Lots of PhD research is done on SLM because the hardware to train big models is stupidly expensive, even for many well-funded research labs. If you read Arxiv papers (not just the flashy ones published by companies with PR budgets) most of the research is done on 7B parameter models. Heck, some NeurIPS papers (extremely competitive prestigious) from _this year_ are being done on 1.5B parameter models.

Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.

HarHarVeryFunny

2 months ago

1 reply

It's not clear if the ultimate SLMs will come from teams with less computing resources directly building them, or from teams with more resources performing ablation studies etc on larger models to see what can be removed.

I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.

I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.

oofbey

2 months ago

1 reply

That’s backwards. New research and ideas are proven on small models. Lots and lots of ideas are tested that way. Good ideas get scaled up to show they still work on medium sized models. The very best ideas make their way into the code for the next huge training runs, which can cost tens or hundreds of millions of dollars.

Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.

Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.

HarHarVeryFunny

2 months ago

1 reply

> Not to nitpick words, but ablation is the practice of stripping out features of an algorithm ...

Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.

oofbey

2 months ago

ML researchers will sometimes vary the size of the training data set to see what happens. It’s not common - except in scaling law research. But it’s never called “ablation”.

cantor_S_drug

2 months ago

3 replies

In CS algorithms, we have space vs time tradeoffs.

In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

refulgentis

2 months ago

1 reply

I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.

I wish this was true.

It isn't.

"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.

What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.

cantor_S_drug

2 months ago

If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.

To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.

Aurornis

2 months ago

3 replies

> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.

No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

I even get the larger hosted models stuck chasing their own tail and going in circles all the time.

naasking

2 months ago

> No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

That's speculative at this point. In the context of agents with external memory, this isn't so clear.

woctordho

2 months ago

Almost all training data are on the internet. As long as the small model has enough agentic browsing ability, given it enough time it will retrieve the data from the internet.

yorwba

2 months ago

It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.

And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.

lossolo

2 months ago

This doesn't work like that. An analogy would be giving a 5 year old a task that requires the understanding of the world of an 18 year old. It doesn't matter whether you give that child 5 minutes or 10 hours, they won't be capable of solving it.

pshirshov

2 months ago

> The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB

pdyc

2 months ago

I think that's where prompt engineering would be needed. Bigger models produce good output even with ambiguous prompts. Getting similar output from smaller models is art,

vintermann

2 months ago

Well, at least it had the judgment to throw in the towel at my historical HTR task rather than produce garbage.

r0okie

2 months ago

44.9 on HLE is so impressive, and they also have "heavy" mode

haoxiaoru

2 months ago

I've waited so long— four months

SV_BubbleTime

2 months ago

Is more still better?