Claude Sonnet 4.5

Posted3 months agoActive3 months ago

adocomplete

1,585 points

786 comments

anthropic.comTechstoryHigh profile

excitedmixed

Debate

70/100

Artificial IntelligenceLarge Language ModelsClaude Sonnet 4.5Coding

Key topics

Artificial Intelligence

Large Language Models

Claude Sonnet 4.5

Coding

System card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Cla...

The release of Claude Sonnet 4.5 has generated significant interest and discussion on HN, with users sharing their experiences and concerns about the model's performance, pricing, and capabilities.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

34s

Peak period

122

0-2h

Avg / period

14.5

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Sep 29, 2025 at 12:52 PM EDT
3 months ago
Step 01
02First comment
Sep 29, 2025 at 12:53 PM EDT
34s after posting
Step 02
03Peak activity
122 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Sep 30, 2025 at 11:42 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (786 comments)

Showing 160 comments of 786

dbbk

3 months ago

2 replies

So Opus isn't recommended anymore? Bit confusing

SatvikBeri

3 months ago

For now, yeah. Presumably they'll come out with Opus 4.5 soon.

causal

3 months ago

Don't think I've ever preferred Opus to Sonnet

cryptoz

3 months ago

8 replies

I've really got to refactor my side project which I tailored to just use OpenAI API calls. I think the Anthropic APIs are a bit different so I just never put in the energy to support the changes. I think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs? I'm sure I could do it manually but how do you all support multiple API providers that have some differences in the API design?

adidoit

3 months ago

1 reply

LiteLLM is your friend.

adidoit

3 months ago

or AI SDK

willcodeforfoo

3 months ago

https://openrouter.ai/?

pinum

3 months ago

I use LiteLLM as a proxy.

gloosx

3 months ago

Why don't you ask LLM to do it for you?

punkpeye

3 months ago

OpenRouter, Glama ( https://glama.ai/gateway/models/claude-sonnet-4-5-20250929 ), AWS Bedrock, all of them provide you access to all of the AI models via OpenAI compatible API.

juanre

3 months ago

I built LLMRing (https://llmring.ai) for exactly this. Unified interface across OpenAI, Anthropic, Google, and Ollama - same code works with all providers.

The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.

Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.

MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.

dingnuts

3 months ago

> think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs

just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts

l1n

3 months ago

https://docs.anthropic.com/en/api/openai-sdk

yewenjie

3 months ago

11 replies

Looking at the chart here, it seems like Sonnet 4 was already better than GPT-5-codex in the SWE verified benchmark.

However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.

CuriouslyC

3 months ago

3 replies

The Anthropic models have been vibe-coding tuned. They're beasts at simple python/ts programs, but they definitely fall apart with scientific/difficult code and large codebases. I don't expect that to change with the new Sonnet.

patates

3 months ago

5 replies

In my experience Gemini 2.5 Pro is the star when it comes to complex codebases. Give it a single xml from repomix and make sure to use the one at the aistudio.

garciasn

3 months ago

2 replies

In my experience, G2.5P can handle so much more context and giving an awesome execution plan that is implemented by CC so much better than anything G2.5P will come up with. So; I give G2.5P the relevant code and data underneath and ask it to develop an execution plan and then I feed that result to CC to do the actual code writing.

This has been outstanding for what I have been developing AI assisted as of late.

XenophileJKO

3 months ago

I would believe this. In regular conversational use with the Gemini family of models, I've noticed they regularly have issues with context blending.. i.e. confusing what you said and they said and causality.

I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.

baq

3 months ago

+1 but recently been experimenting with gpt-5–high for the plan part and it’s scary good sometimes.

jjani

3 months ago

1 reply

Gemini 2.5 Pro = Long context king, image input king

GPT-5 = Overengineering/complexity/"enterprise" king

Claude = "Get straightforwaed shit done efficiently" king

CuriouslyC

3 months ago

1 reply

On the plus side, GPT5 is very malleable, so you CAN prompt it away from that, whereas it's very hard to prompt Claude into producing hard code: even with a nearly file by file breakdown of a task, it'll occasionally run into an obstacle and just give up and make a mock or top implementation, basically diverge from the entire plan, then do its own version.

jjani

3 months ago

Absolutely, sometimes you want, or indeed need such complexity. Some work in settings where they would want it all of the time. IMHO, most people, most of the time don't really want it, and don't want to have to prompt it every time to avoid it. That's why I think it's still very useful to build up experience with the three frontier models, so you can choose according to the situation.

Workaccount2

3 months ago

Its mostly because it is so damn good with long contexts. It can stay on the ball even at 150k whereas other models really wilt around 50-75k.

CuriouslyC

3 months ago

Yup. In fact every deep research tool on the market is just a wrapper for gemini, their "secret sauce" is just how they partition/pack the codebase to feed it into gemini.

int_19h

3 months ago

I think a lot of it has to do with the super long context that it has. For extended sessions and/or large codebases that can fill up surprisingly quickly.

That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".

One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.

sixothree

3 months ago

1 reply

You definitely need some context management like Serena.

CuriouslyC

3 months ago

1 reply

Even with Serena and detailed plans crafted by Gemini that lay out file-by-file changes, Claude will sometimes go off the rails. Claude is very task-completion driven, and it's willing to relax the constraints of the task to complete in the face of even slight adversity. I can't tell you the number of times I've had Claude try to install a python computational library, get an error, then either try to hand-roll the algorithm (in PYTHON) or just return a hard coded or mock result. The worst part is that Claude will tell you that it completed the task as instructed in the final summary; Claude lying is a meme for a reason.

sixothree

3 months ago

I have to agree with pretty much all of this. Specifically, I've had Claude fail at creating a database migration using tooling then go on to create the migration manually. My only reaction to anyone doing this, be it human or computer, is "You did WHAT!?".

epolanski

3 months ago

They are very good with C too, but it helps that there's gazzilions of lines of C out there.

cellis

3 months ago

2 replies

Opposite for me…5-codex high ran out of tokens extremely quickly and didn’t adhere as well to the agents.md as Claude did to the Claude.md, perhaps because it insists on writing extremely complicated bash scripts or whole python programs to execute what should be simple commands.

TrainedMonkey

3 months ago

3 replies

Codex was a miserable experience for me until I learned to compact after every feature. Now it is a cut above CC, although the latter still has an edge at TODO scaffolding and planning.

oigursh

3 months ago

1 reply

Compact?

all2

3 months ago

1 reply

/compress or something like that, basically taking the context and summarizing it.

enraged_camel

3 months ago

1 reply

Cursor does this automatically, although I wish there was a command for it as well. All AIs start shitting the bed once their context goes above 80% or so.

consumer451

3 months ago

Claude Code was the first coding tool that was honest about performance degrading as the context windows fills, and gave use the /context command.

Do any other tools have anything like a /context command? They really should.

furyofantares

3 months ago

/new (codex) or /clear (claude code) are much better than compact after every feature, but of course if there is context you need to retain you should put it (or have the agent put it) in either claude/agents.md or a work log file or some other file.

/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.

rapind

3 months ago

I don't even compact, I just start from scratch whenever I get down below 40%, if I can. I've found Codex can get back up to speed pretty well.

I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.

Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.

renewiltord

3 months ago

gpt-5 command line use is bizarre. It always writes extraordinarily complicated pipelines that Claude instead just writes simple commands for.

My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.

E.g. Command | Complicated Grep | Complicated Sed

Is way worse than multistep

Command > tmpfile

And then grep etc. Because latter can reuse tmpfile if grep is wrong.

jasonsb

3 months ago

3 replies

My subjective personal experience is the exact opposite of yours, GPT-5-codex is super slow and the results are mediocre at best. I would probably stop using AI for coding if I was forced to use GPT-5-codex.

lordnacho

3 months ago

1 reply

I'm on your side.

I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.

jasonsb

3 months ago

I don’t think it’s just a quirk. I’ve tested Claude across Java, Python, TypeScript and several other projects. The results are consistent, regardless of language or project structure, though it definitely performs better with smaller codebases. For larger ones, it really helps if you’re familiar with the project architecture and can guide it to the right files or modules, that saves a lot of time.

baq

3 months ago

GPT-5-high (haven’t tried codex yet) is dog slow, but IME if you start with asking it for detailed requirements in a markdown doc with alternatives for each major decision and pseudocode implementations with references to relevant files, it makes a great prompt for faster a model like sonnet.

llmslave

3 months ago

You need to give it clear instructions on what to implement

AnotherGoodName

3 months ago

2 replies

I always wonder how absolute in performance a given model is. Sometimes i ask for Claude-Opus and the responses i get back are worse than the lowest end models of other assistants. Other times it surprises me and is clearly best in class.

Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.

While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.

steveklabnik

3 months ago

1 reply

> It would make sense they scale up and down depending on utilization right?

It would, but

> To state it plainly: We never reduce model quality due to demand, time of day, or server load.

https://www.anthropic.com/engineering/a-postmortem-of-three-...

If you believe them or not is another matter, but that's what they themselves say.

transcriptase

3 months ago

Well knowing the state of the tech industry they probably have a different, legal-team approved definition of “reducing model quality” than face value.

After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.

richwater

3 months ago

They absolutely mess with it

blurbleblurble

3 months ago

8 replies

GPT-5 is like the guy on the baseball team that's really good at hitting home runs but can't do basic shit in the outfield.

It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.

Whereas gemini and claude are excellent collaborators.

When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.

To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.

aaronbrethorst

3 months ago

1 reply

Please tell me you're joking or at least exaggerating about GPT-5's behavior

blurbleblurble

3 months ago

1 reply

The only exaggeration is in that the way I asked GPT-5 to leave claude to do its thing was to say "why don't we just let claude cook"? I later checked with ChatGPT about the whole exchange and it confirmed that it was well aware of the meaning of this slang, and it's first reaction was that whole thing just sounded like a funny programmer joke, all in jest. But then I reminded it that I'd explicitly pushed back on a hard reset twice.

To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.

nerdsniper

3 months ago

1 reply

I strongly disagree with the personified way you interact with LLMs from a standpoint of “I’ve rarely gotten the best output from the LLM when I interact casually with them”.

However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.

It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.

Hopefully eventually we’ll all mostly figure it out.

blurbleblurble

3 months ago

That's fair. I enjoy the playfulness of it and for me it feels almost like a video game or something, and also like I'm using my own natural language directly.

Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.

The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.

vrosas

3 months ago

2 replies

Why are you having a conversation with your LLM about other agents?

doctoboggan

3 months ago

1 reply

I do it as well. I have a Claude code instance running in my backend repo, and one running in my frontend repo. If there is required coordination, I have the backend agent write a report for the front end agent about the new backend capabilities, or have the front end agent write a report requesting a new endpoint that would simplify the code.

Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.

Sammi

3 months ago

1 reply

Sure. But at no point do you need to talk about the existence of other agents. You talk about making a plan, and you talk about implementing the plan. There's no need to talk about where the plan came from.

blurbleblurble

3 months ago

Because the plan involves using multiple agents with different roles and I don't want them conflicting.

Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.

And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.

blurbleblurble

3 months ago

It's not a whole conversation it's like "hey I'm using claude code to do analysis and this is what it said" or "gemini just used its large context window to get a bird's eye view of the code and this is what it saw".

renewiltord

3 months ago

2 replies

All of these perform better if you say "a reviewer recommended" or something. The role statement provides the switch vs the implementation. You have to be careful, though. They all trust "a reviewer" strongly but they'll be more careful with "a static analysis tool".

blurbleblurble

3 months ago

1 reply

Yeah, it's wild how the biases get encoded in there. Maybe they aren't even entirely separable from the magic of LLMs.

Marazan

3 months ago

2 replies

It isn't wild, it is inherent to the very nature of large language models.

The power of using LLMs is working out what it has encoded and how to access it.

baq

3 months ago

It’s as if we made the machine in our own image. Who would’ve thought /s

Perhaps for the first time in history we have to understand culture when working with a tool, but it’s still just a tool.

blurbleblurble

3 months ago

I appreciate it being wild in the sense that language is inherently a tangled mess and these tools are actually leveraging that messy complexity.

prodigycorp

3 months ago

My favorite evaluation prompt which, I've found, tends to have the right level of skepticism is as follows (you have to tack it on to whatever idea/proposal you have):

"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"

It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.

johnfn

3 months ago

1 reply

That’s such a great analogy. I always say GPT is like the genius that completely lacks common sense. One of my favorite things is when I asked it why the WiFi wasn’t working, and showed it a photo of our wiring. It said that I should tell support:

> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”

Really, GPT? Not just “can you set up the WiFi”??!

ipython

3 months ago

1 reply

I'm curious what you would have expected it to reply given the input you provided?

johnfn

3 months ago

Er, I said it in my post, but calling support and saying “can you set up the WiFi” would have been fine.

tux3

3 months ago

1 reply

That's great given that the goal of OAI is to train artificial superintelligence first, hoping that the previous version of the AI will help us control the bigger AI.

If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.

int_19h

3 months ago

The best way is to nuke the servers from orbit, just to be sure. ~

rapind

3 months ago

2 replies

> it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea

So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.

It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.

Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.

When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.

These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.

deciduously

3 months ago

2 replies

Just to add another anecdotal data point, ive absolutely observed Claude Code doing exactly this as well with git operations.

rapind

3 months ago

I exclusively used sonnet when I used Claud Code and never ran into this, so maybe it's an Opus thing, or I just got lucky? Definitely has happened to me a few times with Codex (which is what I'm currently using).

blurbleblurble

3 months ago

I've gotten the `git reset --hard` with Claude Code as well, just not immediately after (1)) explicitly pushing back against the idea or (2) it talking a bunch of shit about another agent's totally reasonable analysis.

bobbylarrybobby

3 months ago

I've seen sonnet undo changes I've made while it was working quite a few times. Now I just don't edit concurrently with it, and make sure to inform of it of changes I've made before letting it work on its own

artdigital

3 months ago

1 reply

Gemini is an excellent collaborator?

It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”

It’s by far the most toxic and gaslighting LLM

alex1138

3 months ago

What you get when you mix Google's excellent technical background with interoffice politics and extreme political correctness

layer8

3 months ago

> "another agent"

You could just say it’s another GPT-5 instance.

mentos

3 months ago

Curious how you find ChatGPT5 to ChatGPT5-Codex?

jjcm

3 months ago

How long have you had early access for?

yunohn

3 months ago

Well, they seem to benchmark better only when giving the model "parallel test time compute" which AFAIU is just reasoning enabled? Whereas the GPT5 numbers are not specified to have any reasoning mode enabled.

esafak

3 months ago

I'm only a week into testing, but so far codex has been slow and the cli is worse than claude code. I intend to return to Claude.

ambyra

3 months ago

For unity gamedev code reviews, I much preferred the gpt5 code. Claude gave me a bunch of bad recommendations for code changes, and also an incorrect formula for completion percentage.

llmslave

3 months ago

Gpt5 codex is incredible, far ahead of all the other models for implementing code.

chipgap98

3 months ago

1 reply

Interesting that this is better than Opus 4.1. I want to see how this holds up under real world use, but if that's the case its very impressive.

I wonder how long it will be before we get Opus 4.5

FergusArgyll

3 months ago

IIRC sonnet 3.5 (and definitely 3.5-new aka 3.6) was better than opus 3.

There's still a lot of low hanging fruit apparently

kixiQu

3 months ago

1 reply

Lots of feature dev here – anyone have color on the behavior of the model yet? Mouthfeel, as it were.

sexyman48

3 months ago

2 replies

Mouthfeel, as it were

Pervert.

rkomorn

3 months ago

Weird comment given your username.

kixiQu

3 months ago

https://en.wikipedia.org/wiki/Mouthfeel ??

meetpateltech

3 months ago

1 reply

Seeing the progress of the Claude models is really cool!

Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo

clueless

3 months ago

would love to see the prompt they used and the final code of the Claude.ai clone it generated

mohsen1

3 months ago

6 replies

Price is playing a big role in my AI usage for coding. I am using Grok Code Fast as it's super cheap. Next to it GPT-5 Codex. If you are paying for model use out of pocket Claude prices are super expensive. With better tooling setup those less smart (and often faster) models can give you better results.

I am going to give this another shot but it will cost me $50 just to try it on a real project :(

muttantt

3 months ago

3 replies

how are you using grok code fast? what tooling/cli/etc?

esafak

3 months ago

1 reply

Through Opencode.

xwowsersx

3 months ago

Same

hu3

3 months ago

free in GitHub copilot atm

rafaquintanilha

3 months ago

It’s currently free in OpenRouter.

_joel

3 months ago

3 replies

I'm paying $90(?) a month for the Max and it holds up for about an hour or so of in depth coding before it kicks in the 5-hour window lockout (so effectively about 4 hours of time when I can't run it). Kinda frustrating, even with efficient prompt and context length conservation techniques. I'm going to test this new sonnet 4.5, now but it'll probably be just as quick to gobble my credits.

mrshu

3 months ago

Do you normally run Opus by default? It seems the Max subscription should let you run Sonnet in an uninterrupted way, so it was surprising to read.

Implicated

3 months ago

I'm on a max ($200) plan and I only use opus and I've _never_ hit a rate limit. Definitely using for 5+ hours at a time multiple days per week.

salomonk_mur

3 months ago

You have got to have some extremely large files or something. Even with only Opus, running into the limits with the Max subscription is almost impossible unless you really try.

xwowsersx

3 months ago

1 reply

Same here. I've been using GCF1 with opencode and getting good results. I also started using [Serena](https://github.com/oraios/serena), which has been really helpful in a large codebase. It gives you better search than plain grep, so you can quickly find what you need instead of dumping huge chunks of code into Claude or Grok and wasting tokens.

sixothree

3 months ago

Serena really does feel like a secret weapon sometimes.

jjani

3 months ago

I really struggle to see the usecase of Grok Code Fast when you have Qwen 3 Coder right there providing much better outputs while still being fast and cheap.

sixothree

3 months ago

I just can't bring myself to get over the grossness factor of using an x branded product.

Hamuko

3 months ago

I'm too cheap to pay for any of them. I've only tried gpt-oss:20b because I can run it locally and it's a complete waste of time for anything except code completions.

alach11

3 months ago

3 replies

I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.

sipjca

3 months ago

1 reply

Maybe this is true? But it's not clear to me this methodology will ever be quite as good as native tool calling. Or maybe I don't know the benchmark well enough, I just assume it's vision based

Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.

But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug

simianwords

3 months ago

Looks like RPA vs API debate all over again

mrshu

3 months ago

What are some standard benchmarks you look at in this space?

cantor_S_drug

3 months ago

Do you think a Genie like model specifically trained on data consisting of interacting with application interfaces would be good on computer use tasks?

mohsen1

3 months ago

3 replies

That's a pretty pelican on a bicycle!

https://jsbin.com/hiruvubona/edit?html,output

https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c

greenfish6

3 months ago

1 reply

pelican on a bicycle benchmark probably getting saturated... especially as it's become a popular way to demonstrate model ability quickly

AlecSchueler

3 months ago

1 reply

But where is the training set of good pelicans on bikes coming from? You think they have people jigging them up internally?

eli

3 months ago

1 reply

Assuming they updated the crawled training data, just having a bunch of examples of specifically pelicans on bicycles from other models is likely to make a difference.

AlecSchueler

3 months ago

1 reply

But then how does the quality increase? Normally we hear that when models are trained on the output of other models the style becomes very muted and various other issues start to appear. But this probably the best pelicans on a bicycle I've ever seen, by quite some margin.

Kuinox

3 months ago

1 reply

Just compare it with a human on a bicycle, you would see that LLMs are weirdly good at drawing pelicans in SVG but not humans.

AlecSchueler

3 months ago

I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.

But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.

0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.

Kuinox

3 months ago

3 replies

I never understood the point of the pellican on a bicycle exercise: LLMs coding agent doesnt have any way to see the output. It means the only thing this test is testing, is the ability of the LLMs to memorise.

Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png

mhh__

3 months ago

1 reply

Memorise what exactly?

Kuinox

3 months ago

1 reply

Coordinate and shape of the element used to form a pellican. If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

I bet their ability to form a pellican result purely because someone already did it before.

throwaway314155

3 months ago

> If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".

It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?

At any rate, the pelican thing from simonw is clearly just for fun at this point.

_joel

3 months ago

2 replies

Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.

Kuinox

3 months ago

1 reply

> Because it excercises thinking about a pelican riding a bike (not common)

It is extremely common, since it's used on every single LLM to bench it.

And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.

_joel

3 months ago

I mean the real world examples of a pelican riding a bike is not common. It's common in benchmarking LLM's but that's not what I meant.

imiric

3 months ago

The only thing it exercises is the ability of the model to recall its pelican-on-bicycle and other SVG training data.

furyofantares

3 months ago

1 reply

It's more for fun than as a benchmark.

Kuinox

3 months ago

1 reply

It also measure something llms are good probably due to cheating.

furyofantares

3 months ago

I wouldn't say any LLMs are good at it. But it doesn't really matter, it's not a serious thing. It's the equivalent of "hello world" - or whatever your personal "hello world" is - whenever you get your hands on a new language.

_joel

3 months ago

... but can it create an svg renderer for claude's site.

zurfer

3 months ago

2 replies

Same price and a 4.5 bp jump from 72.7 to 77.2 SWEBench

Pretty solid progress for roughly 4 months.

zurfer

3 months ago

3 replies

Also getting a perfect score on AIME (math) is pretty cool.

Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.

wohoef

3 months ago

2 replies

Just a few months ago people were still talking about exponential progress. The fact that we’re already going for just linear progress is not a good sign

falcor84

3 months ago

2 replies

Linear growth on a 0-100 benchmark is quite likely an exponential increase in capability.

usaar333

3 months ago

2 replies

Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.

baq

3 months ago

Sublinear as demonstrated on a sigmoid scale is quite fast enough for me thank you.

GoatInGrey

3 months ago

We should all know that in the software world, the last 10% requires 90% of the effort!

falcor84

3 months ago

This got me thinking - is there any reasonable metric we could use to measure the intellectual capabilities of the most capable species on Earth that had evolved at each point in time? I wonder what kind of growth function we'd see.

Silly idea - is there an inter-species game that we could use in order to measure ELO?

theptip

3 months ago

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.

crthpl

3 months ago

2 replies

The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.

mbesto

3 months ago

> SWE-bench, and many other AI benchmarks, have lots of eval noise

SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.

> where there is no clear right answer

This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.

mrshu

3 months ago

Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?

levocardia

3 months ago

1 reply

Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.

I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"

typpilol

3 months ago

That's why they made the swe verified. Verified excludes those

XMPPwocky

3 months ago

nit: assuming you mean basis points, one basis point is 0.01%. 4.5bp would be 72.7% to 72.71%. this is 450bp!

schmorptron

3 months ago

5 replies

Oh wow, a lot of focus on code from the big labs recently. In hindsight it makes sense that the domain the people building it know best is the one getting the most attention, and it's also the one the models have seen the most undeniable usefulness in so far. Though personally, the unpredictability of the future where all of this goes is a bit unsettling at the same time...

doctoboggan

3 months ago

1 reply

Along with developers wanting to build tools for developers like you said, I think code is a particularly good use case for LLMs (large language models), since the output product is a language.

fragmede

3 months ago

It's because the output is testable. If the model outputs a legal opinion or medical advice, a human needs to be looped in to verify that the advice is not batshit insane. Meanwhile, if the output is code, it can be run through a compiler and (unit) tests run to verify that the generated code is cromulent without a human being in the loop for 100% of it, which means the supercomputer can just go off and do it a thing with less supervision.

modeless

3 months ago

OpenAI and Anthropic are both trying to automate their own AI research, which requires coding.

neuronexmachina

3 months ago

I think coding is also the area where companies are most likely to buy large team licenses.

baq

3 months ago

Congrats! You’re now on the p(doom)-aware path. People have been concerned for decades and are properly scared today. That doesn’t stop the tools from being useful, though, so enjoy while the golden age lasts.

https://en.m.wikipedia.org/wiki/P(doom)

martinald

3 months ago

Thing is though if you are good at code it solves many other adjacent tasks for LLMs, like formatting docs for output, presentations, spreadsheet analysis, data crawling etc.

fibers

3 months ago

1 reply

This looks exciting. I hope they add this to Windsurf soon.

pzo

3 months ago

1 reply

it looks like its already there

simianwords

3 months ago

2 replies

It’s stupid… like just have a registry of models and let people automatically use them. It’s silly to wait for manual whitelisting each time for every app

ReverseCold

3 months ago

It was there a few (<5? I think?) minutes after the Anthropic post went out. If you look at Windsurf's web traffic it looks like they did a thing (model is an int) to make it so the IDE doesn't need to update to get new models.

fibers

3 months ago

I agree, I use Windsurf for personal projects and I think the pricing model is a bit better than what a professional dev would be using on cursor or something like that.

greenfish6

3 months ago

As the rate of model improvement appears to slow, the first reactions seem to be getting worse and worse, as it takes more time to assess the model's quality and understand the nuances & subtler improvements

atemerev

3 months ago

Ah, the company where the models are unusable even with Pro subscription (start to hit the limit after 20 minutes of talking), and free models are not usable at all (currently can't even send a single message to Sonnet 4.5)...

usr19021ag

3 months ago

Their benchmark chart doesn't match what's published on https://www.swebench.com/.

I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...

idkmanidk

3 months ago

Page cannot be found Empty screen mocks my searching Only void responds

but: https://imgur.com/a/462T4Fu

626 more comments available on Hacker News

View full discussion on Hacker News

ID: 45415962Type: storyLast synced: 11/27/2025, 3:36:11 PM

Want the full context?