Claude Sonnet 4.5
Posted3 months agoActive3 months ago
anthropic.comTechstoryHigh profile
excitedmixed
Debate
70/100
Artificial IntelligenceLarge Language ModelsClaude Sonnet 4.5Coding
Key topics
Artificial Intelligence
Large Language Models
Claude Sonnet 4.5
Coding
The release of Claude Sonnet 4.5 has generated significant interest and discussion on HN, with users sharing their experiences and concerns about the model's performance, pricing, and capabilities.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
34s
Peak period
122
0-2h
Avg / period
14.5
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 29, 2025 at 12:52 PM EDT
3 months ago
Step 01 - 02First comment
Sep 29, 2025 at 12:53 PM EDT
34s after posting
Step 02 - 03Peak activity
122 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 30, 2025 at 11:42 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45415962Type: storyLast synced: 11/27/2025, 3:36:11 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.
Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.
MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.
just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.
This has been outstanding for what I have been developing AI assisted as of late.
I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.
GPT-5 = Overengineering/complexity/"enterprise" king
Claude = "Get straightforwaed shit done efficiently" king
That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".
One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.
Do any other tools have anything like a /context command? They really should.
/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.
E.g. Command | Complicated Grep | Complicated Sed
Is way worse than multistep
Command > tmpfile
And then grep etc. Because latter can reuse tmpfile if grep is wrong.
I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
It would, but
> To state it plainly: We never reduce model quality due to demand, time of day, or server load.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
If you believe them or not is another matter, but that's what they themselves say.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
Whereas gemini and claude are excellent collaborators.
When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
Hopefully eventually we’ll all mostly figure it out.
Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.
The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.
Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.
And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.
The power of using LLMs is working out what it has encoded and how to access it.
Perhaps for the first time in history we have to understand culture when working with a tool, but it’s still just a tool.
"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"
It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.
> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
Really, GPT? Not just “can you set up the WiFi”??!
If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”
It’s by far the most toxic and gaslighting LLM
You could just say it’s another GPT-5 instance.
I wonder how long it will be before we get Opus 4.5
There's still a lot of low hanging fruit apparently
Pervert.
Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo
I am going to give this another shot but it will cost me $50 just to try it on a real project :(
Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
https://jsbin.com/hiruvubona/edit?html,output
https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c
But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png
I bet their ability to form a pellican result purely because someone already did it before.
It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".
It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?
At any rate, the pelican thing from simonw is clearly just for fun at this point.
It is extremely common, since it's used on every single LLM to bench it.
And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.
Pretty solid progress for roughly 4 months.
Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.
Silly idea - is there an inter-species game that we could use in order to measure ELO?
We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.
SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
> where there is no clear right answer
This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
https://en.m.wikipedia.org/wiki/P(doom)
I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...
but: https://imgur.com/a/462T4Fu
626 more comments available on Hacker News