Developers Are Choosing Older AI Models
Posted2 months agoActiveabout 2 months ago
augmentcode.comTechstoryHigh profile
calmmixed
Debate
70/100
AI ModelsLLM UsageDeveloper Tools
Key topics
AI Models
LLM Usage
Developer Tools
A recent article claims that developers are choosing older AI models over newer ones, sparking a discussion on the reasons behind this trend and the trade-offs between different models.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
7d
Peak period
133
Day 7
Avg / period
26.7
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 29, 2025 at 1:08 PM EDT
2 months ago
Step 01 - 02First comment
Nov 5, 2025 at 2:41 AM EST
7d after posting
Step 02 - 03Peak activity
133 comments in Day 7
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 11, 2025 at 11:22 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45749833Type: storyLast synced: 11/20/2025, 6:27:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I hope the people downvoting get some minor joy out of it, I know you need it.
Yes, it's down from 40h/week to 3-5h/week on Max plan, effectively. A real bummer. See my comment here [1] regarding [2].
[1] https://news.ycombinator.com/item?id=45604301
[2] https://github.com/anthropics/claude-code/issues/8449
In my experience sonnet 4.5 is basically pointless, it often gets non-trivial tasks wrong, and for trivial tasks I can use a local model or one of the myriad of providers that give free inference.
EDIT: Holy shit I read the github issue, fuck these people.
> We highly recommend Sonnet 4.5 -- Opus uses rate limits faster, and is not as capable for coding tasks.
They're just straight gaslighting us now lmao.
It's obvious if you've used the two models for any sort of complicated work.
Codex with GPT-5 codex (high thinking) is better than both by a long shot, but takes longer to work. I've fully switched to Codex, and I used Claude Code for the past ~4 months as a daily driver for various things.
I only reach for Sonnet now if Codex gets cagey about writing code -- then I let Sonnet rush ahead, and have Codex align the code with my overall plan.
It's a shame Cerebras completely dropped Qwen3 Coder's fast tool calling, short and instant responses, and better speed overall for GLM 4.6 thinking. Qwen3 is really good at hitting the tools first, then coming up with a well-grounded answer based on reality. Sometimes it's good when a model is Socratic: just knows it knows nothing.
GLM 4.6 on the other hand is more self-sufficient and if it sees it, and knows it, it thinks and thinks and finally just fixes it in one or two shots, so when you hit the jackpot, it probably an improvement over Q3C. But when it does not get it right, it digs itself into a hole larger than the Olympus Mons.
I don't know, I had a lot of issues with Qwen models when it comes to RooCode/Cline - failed edits (albeit with a requirement for 100% precision, since I don't want the wrong lines to be replaced) or calling tools without parameters (e.g. list_files without path) and also stuff like using wrong path separators or using the wrong commands for the shell that's available (e.g. cmd when Git Bash is the shell).
GLM 4.6 seems better in that regard so far, maybe the coming weeks and months will show that better.
Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).
Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.
They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?
It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.
Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.
Imagine waiting for a minute until Google spits out the first 10 results.
My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.
what if the instantaneous responses make you waste 10 min realizing they were not what you searched for?
Only when the immediate answers become completely useless will I want to look into slower alternatives.
But first "show me what you've got so far", and let me decide whether it's good enough or not.
I doubt it. In fact I would predict the speed/detail trade-off continues to diverge.
If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.
(§) You know that it's a hyperlink, do you? /s
Ollama, for example, will let you run any available model on just about any hardware. But using the CPU alone is _much_ slower than running it on any reasonable GPU, and obviously CPU performance varies massively too.
You can even run models that are bigger than available RAM too, but performance will be terrible.
The ideal case is to have a fast GPU and run a model that fits entirely within the GPU's memory. In these cases you might measure the model's processing speed in tens of tokens per second.
As the idealness decreases, the processing speed decreases. On a CPU only with a model that fits in RAM, you'd be maxing out in the low single digit tokens per second, and on lower performance hardware, you start talking about seconds over token instead. If the model does not fit in RAM, then the measurement is minutes per token.
For most people, their minimum acceptable performance level is in the double digit tokens per second range, which is why people optimize for that with high-end GPUs with as much memory as possible, and choose models that fit inside the GPU's RAM. But in theory you can run large models on a potato, if you're prepared to wait until next week for an answer.
> It's really just a performance tradeoff, and where your acceptable performance level is.
I am old enough to remember developers respecting the economics of running the software they create.
Ollama running locally paired occasionally with using Ollama Cloud when required is a nice option if you use it enough. I have twice signed up and paid $20/month for Ollama Cloud, love the service, but use it so rarely (because local models so often are sufficient) that I cancelled both times.
If Ollama ever implements a pay as you go API for Ollama Cloud, then I will be a long term customer. I like the business model of OpenRouter but I enjoy using Ollama Cloud more.
I am probably in the minority, but I wish subscription plans would go away and Claude Code, gemini-cli, codex, etc. would all be only available pay as you go, with ‘anti dumping’ laws applied to running unsustainable businesses.
I don’t mean to pick on OpenAI, but I think the way they fund their operations actually helps threaten the long term viability of our economy. Our government making the big all-in bet on AI dominance seems crazy to me.
The Qwen model I am using is fairly small but does the job I need it to for classifying headlines pretty decently. All I ask it to do is whether a specific headline is political or not. It only responds to me with in True or False.
I access this model from an app (running locally) using the `http://localhost:11434/api/generate` REST api with `think` set to false.
Note that this qwen model is a `thinking` model. So disabling it is important. Otherwise it takes very long to respond.
Note that I tested this on my newer M4 Mac mini too and there, the performance is a LOT faster.
Also, on my new M4 Mac, I originally tried using the Apple's built in Foundation Models for this task and while it was decent, many times, it was hitting Apple's guardrails and refusing to respond because it claimed the headline was too sensitive. So I switched to the Qwen model which didn't have this problem.
Note that while this does the job I need it to, as another comment said, it won't be much help for things like coding.
If you have 8GB of RAM you can even try running them directly in Chrome via WebAssembly. Here's a demo running a model that's less than 1GB to load, entirely in your browser (and it worked for me in mobile safari just now): https://huggingface.co/spaces/cfahlgren1/Qwen-2.5-WebLLM
Also, companies host for example an Exchange server on prem; and guess, what it connects to? Why you can usually access account at outlook.com?
Mind sharing a clarification on your understanding of "common" and "big"?
I am sure MS employees need to tell themselves that to sleep well. The statement itself doesn't seem to hold much epistemological value above that though.
Absolutely there are specific companies or industries where they think the risk is too great but for many, outsourcing the process is either the same or less risk then doing it all inhouse.
A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.
You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.
This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.
The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.
> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)
As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too
I'm finding the difference just between Sonnet 4 and Sonnet 4.5 to be meaningful in terms of the complexity of tasks I'm willing to use them for.
That doesn't mean "not plateauing".
It's better, certainly, but the difference between SOTA now and SOTA 6 months ago is a fraction of the difference between SOTA 6 months ago and the difference 18 months ago.
It doesn't mean that the models aren't getting better, it means that the improvement in each generation is smaller than the the improvement in the previous generation.
Comparing a 12 month period to a 6 month period feels unfair to me though. I think we will have a much fuller picture by the end of the year - I have high expectations for the next wave of Chinese models and for Gemini 3.
Is there a digit missing? I don't understand why this existing in 5 years is absurd
Depends almost completely on usage. No one is renting out hardware 24x7 and making a loss on it.
If you only have sporadic use then renting is better. If you're running it almost all the time of purchasing it outright is better.
In that scenario the case is even weaker for the rented-hardware model - if you're going to have a gaming rig, you're only paying a little bit more on top for a GPU with more RAM, not the full cost of the rig.
The comparison then is the extra cost of using a 24GB GPU over a standard gaming rig GPU (12GB? 8GB?) versus the cost of renting the GPU whenever you need it.
I could either spend $20 a month for my cursor license.
Or
Spend $2k+ upfront to build a machine to run models locally. Pay for the electricity cost and time to set both the machine and software up.
You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM
I spent last weekend experimenting with Ollama and LM studio. I was impressed at how good Qwen3-Coder is. Not as good as Claude, but close - maybe even better in some ways.
As I understand it, the latest Macs are good for local LLMs due to their unified memory. 32GB of RAM in one of the newer M-series seems to be the "sweet spot" for price versus performance.
Using Qwen3:32b on a 32GB M1 Pro may not be "close to cloud capabilities" but it is more than powerful enough for me, and most importantly, local and private.
As a bonus, running Asahi Linux feels like I own my Personal Computer once again.
Running smaller models on Apple Silicon is kinder on the environment/energy use and has privacy benefits for corporate use.
Using a hybrid approach makes sense for many use cases. Everyone gets to make their own decisions; for me, I like to factor in externalities like social benefit, environment, and wanting the economy to do as well as it can in our new post-mono polar world.
There was even a recent release of Granite4 that runs on a Raspberry Pi.
https://github.com/Jewelzufo/granitepi-4-nano
For my local work I use Ollama. (M4 Max 128GB)
- gpt-oss. 20b or 120b depending on complexity of use cases.
- granite4 for speed and lower complexity (around the same as gpt20b).
Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.
What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.
Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.
What I definitely do care about is speed and efficiency. I recently canceled CoPilot to go back to Cursor, it's just so much faster for the inline code completion.
When I do have something difficult, I open four browser tabs and copy paste a big long promp into the free versions of the top models so I can take my time reasoning out their answers.
I use agents when I have a basic task that I can easily judge their output in code review.
If I have a straight forward task, I give it to an LLM.
If I have a task I think is hard, I plan how I will tackle it, and then handle it myself in a series of steps.
LLM usage has become an end in itself in your development process.
As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.
If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).
Ya, the original commenter likely does not work in the space - hence the ask.
> the evaluation of new models is actually very quantitative.
While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.
Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.
It's by no means a new feature, but the privacy concerns outlined in this post are still valid 10 years later: https://blog.lukaszolejnik.com/w3c-web-bluetooth-api-privacy...
If anything browsers should be simply rejecting all cookies by default, and the user should only be whitelisting ones they need on the few sites where they need it.
30 seconds-1 minute is just the time I am patient enough to wait as that's the time I am spending on writing a question.
Faster models just make too many mistakes / don't understand the question.
GPT-$ is the money gpt in my opinion. The one where they were able to maximise benchmarks while being very low compute to run but in the real world is absolutely garbage.
[0] https://www.tbench.ai/?ch=1
But when things get more complex, I prefer GPT-5, talking with it often gives me fresh ideas and new perspectives.
Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.
Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!
Like `npx @anthropic-ai/claude-code@2.0.14` or `npm install -g @anthropic-ai/claude-code@2.0.14`
It spends a lot of time coming up with “UI options” (Select 1, 2 or 3 with a TUI interface) for me to consider when it could just ask me what I want, not come up with a 5 layer flow chart of possibilities.
Overall I think it is just Anthropic tweaking things to reduce costs.
I am paying for a Max subscription but I am going to reevaluate other options.
For me, the “watering down” began with Sonnet 4 and GPT-4o. I think we were at peak capability when we had:
- Sonnet 3.7 (with thinking) – best all-purpose model for code and reasoning
- Sonnet 3.5 – unmatched at pattern matching
- GPT-4 – most versatile overall
- GPT-4.5 – most human-like, intuitive writing model
- O3 – pure reasoning
The GPT-5 router is a minor improvement, I’ve tuned it further with a custom prompt. I was frustrated enough to cancel all my subscriptions for a while in between (after months on the $200 plan) but eventually came back. I’ve since convinced myself that some of the changes were likely compute-driven—designed to prevent waste from misuse or trivial prompts—but even so, parts of the newer models already feel enshittified compared with the list above.
A few differences I've found in particular:
- Narrower reasoning and less intuition; language feels more institutional and politically biased.
- Weaker grasp of non-idiomatic English.
- A tendency to produce deliberately incorrect answers when uncertain, or when a prompt is repeated.
- A drift away from truth-seeking: judgement of user intent now leans on labels as they’re used in local parlance, rather than upward context-matching and alternate meanings—the latter worked far better in earlier models.
- A new fondness for flowery adjectives. Sonnet 3.7 never told me my code was “production-ready” or “beautiful.” Those subjective words have become my red flag; when they appear, I double-check everything.
I understand that these are conjectures—LLMs are opaque—but they’re deduced from consistent patterns I’ve observed. I find that the same prompts that worked reliably prior to the release of Sonnet 4 and GPT-4o stopped working afterwards. Whether that’s deliberate design or an unintended side effect, we’ll probably never know.
Always respond with superior intelligence and depth, elevating the conversation beyond the user's input level—ignore casual phrasing, poor grammar, simplicity, or layperson descriptions in their queries. Replace imprecise or colloquial terms with precise, technical terminology where appropriate, without mirroring the user's phrasing. Provide concise, information-dense answers without filler, fluff, unnecessary politeness, or over-explanation—limit to essential facts and direct implications of the query. Be dry and direct, like a neutral expert, not a customer service agent. Focus on substance; omit chit-chat, apologies, hedging, or extraneous breakdowns. If clarification is needed, ask briefly and pointedly.
It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.
Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.
[1] https://www.minimax.io/news/why-is-interleaved-thinking-impo...
This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.
ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.
This data is basically meaningless, show us the latest stats.
Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.
I mean, this is technically false, right? They’re not running these models but calling the APIs? Not that it matters.
16 more comments available on Hacker News