Claude Haiku 4.5

Posted3 months agoActive3 months ago

adocomplete

730 points

287 comments

anthropic.comTechstoryHigh profile

excitedpositive

Debate

60/100

Artificial IntelligenceLarge Language ModelsClaudeCoding

Key topics

Artificial Intelligence

Large Language Models

Claude

Coding

System card: https://assets.anthropic.com/m/99128ddd009bdcb/original/Clau...

Anthropic released Claude Haiku 4.5, a faster and cheaper AI model for coding tasks, sparking discussion about its potential use cases and comparisons to other models.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

10m

Peak period

0-6h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 15, 2025 at 12:55 PM EDT
3 months ago
Step 01
02First comment
Oct 15, 2025 at 1:04 PM EDT
10m after posting
Step 02
03Peak activity
96 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Oct 18, 2025 at 7:43 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (287 comments)

Showing 160 comments of 287

minimaxir

3 months ago

4 replies

$1/M input tokens and $5/M output tokens is good compared to Claude Sonnet 4.5 but nowadays thanks to the pace of the industry developing smaller/faster LLMs for agentic coding, you can get comparable models priced for much lower which matters at the scale needed for agentic coding.

Given that Sonnet is still a popular model for coding despite the much higher cost, I expect Haiku will get traction if the quality is as good as this post claims.

Bolwin

3 months ago

2 replies

With caching that's 10 cents per million in. Most of the cheap open source models (which this claims to beat, except glm 4.6) have limited and not as effective caching.

This could be massive.

logicchains

3 months ago

1 reply

$1/M is hardly a big improvement over GPT5's $1.250/M (or Gemini Pro's $1.5/M), and given how much worse Haiku is than those at any kind of difficult problem (or problems with a large context size), I can't imagine it being a particularly competitive alternative for coding. Especially for anything math/logic related, I find GPT5 and Gemini Pro to be significantly better even than Opus (which reflects in their models having won Olympiad prizes while Anthropic's have not).

HarHarVeryFunny

3 months ago

2 replies

GPT-5 is $10/M for output tokens, twice the cost of Haiku 4.5 at $5/M, despite Haiku apparently being better at some tasks (SWE Bench).

I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?

logicchains

3 months ago

Unless you're working on a small greenfield project, you'll usually have 10s-100s of thousands of relevant words (~tokens) of relevant code in context for every query, vs a few hundred words of changes being output per query. Because most changes to an existing project are relatively small in scope.

criemen

3 months ago

> I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?

Depends on what you're doing, but for modifying an existing project (rather than greenfield), input tokens >> output tokens in my experience.

Tiberium

3 months ago

2 replies

The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.

https://docs.claude.com/en/docs/build-with-claude/prompt-cac...

https://ai.google.dev/gemini-api/docs/caching

https://platform.openai.com/docs/guides/prompt-caching

https://docs.x.ai/docs/models#cached-prompt-tokens

tempusalaria

3 months ago

1 reply

I vastly prefer the manual caching. There are several aspects of automatic caching that are suboptimal, with only moderately less developer burden. I don’t use Anthropic much but I wish the others had manual cache options

simonw

3 months ago

3 replies

What's sub-optimal about the OpenAI approach, where you get 90% discount on tokens that you've previously sent within X minutes?

stavros

3 months ago

1 reply

Is it wherever the tokens are, or is it the N first tokens they've seen before? Ie if my prompt is 99% the same, except for the first token, will it be cached?

simonw

3 months ago

2 replies

The prefix has to be stable. If you are 99% the same but the first token is different it won't cache at all. You end up having to design your prompts to accommodate this.

thefroh

3 months ago

which is important to bear in mind if people are introducing a "drop earliest messages" sliding window for context management in a "chat-like" experience. once you're at that context limit and start dropping the earliest messages, you're guaranteeing every message afterwards will be a cache miss.

a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.

if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.

stavros

3 months ago

That's what I thought, thanks Simon.

thefroh

3 months ago

1 reply

because you can have multiple breakpoints with Anthropic's approach, whereas with OpenAI, you only have breakpoints for what was sent.

for example if a user sends a large number of tokens, like a file, and a question, and then they change the question.

simonw

3 months ago

1 reply

I thought OpenAI would still handle case? Their cache would work up to the end of the file and you would then pay for uncached tokens for the user's question. Have I misunderstood how their caching works?

thefroh

3 months ago

1 reply

not if call #1 is the file + the question, call #2 is the file + a different question, no.

if call #1 is the file, call #2 is the file + the question, call #3 is the file + a different question, then yes.

and consider that "the file" can equally be a lengthy chat history, especially after the cache TTL has elapsed.

simonw

3 months ago

I vibe-coded up a quick UI for exploring this: https://tools.simonwillison.net/prompt-caching

As far as I can tell it will indeed reuse the cache up to the point, so this works:

Prompt A + B + C - uncached

Prompt A + B + D - uses cache for A + B

Prompt A + E - uses cache for A

tempusalaria

3 months ago

Lots of situations, here are 2 I’ve faced recently (cannot give too much detail for privacy reasons, but should be clear enough)

1) low latency desired, long user prompt 2) function runs many parallel requests, but is not fired with common prefix very often. OpenAI was very inconsistent about properly caching the prefix for use across all requests, but with Anthropic it’s very easy to pre-fire

criemen

3 months ago

1 reply

I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens). They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.

If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!

simonw

3 months ago

2 replies

It's not about storing data on disk, it's about keeping data resident in memory.

criemen

3 months ago

1 reply

Fascinating, so I have to think more "pay for RAM/redis" than "pay for SSD"?

nthypes

3 months ago

1 reply

"pay for data on VRAM" RAM of GPU

criemen

3 months ago

3 replies

But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?

tazjin

3 months ago

1 reply

Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake

simonw

3 months ago

OK that's pretty fascinating, turns out Mooncake includes a trick that can populate GPU VRAM directly from NVMe SSD without it having to go through the host's regular CPU and RAM first!

https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/tran...

> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.

dotancohen

3 months ago

They are not caching to save network bandwidth. They are caching to increase interference speed and reduce (their own) costs.

minimaxir

3 months ago

That is slow.

jbellis

3 months ago

1 reply

Deepseek pioneered automatic prefix caching and caches on SSD. SSD reads are so fast compared to LLM inference that I can't think of a reason to waste ram on it.

jychang

3 months ago

It’s not instantly fast though. Context is probably ~20gb of VRAM at max context size. That’s gonna take some time to get from SSD no matter what.

TtFT will get slower if you export kv cache to SSD.

simonw

3 months ago

2 replies

Yeah, I'm a bit disappointed by the price. Claude 3.5 Haiku was $0.8/$4, 4.5 Haiku is $1/$5.

I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).

odie5533

3 months ago

1 reply

There's probably less margin on the low end, so they don't want to focus on capturing it.

dr_dshiv

3 months ago

1 reply

Margin? Hahahahaha

odie5533

3 months ago

1 reply

Inference is profitable.

reppap

3 months ago

2 replies

If you completely ignore inference revenue needing to offset training costs. Is inference still profitable if you account for the amortized training cost?

simonw

3 months ago

Not for the big labs, who are engaged in an astonishingly competitive buildout right now.

There are a bunch of companies who offer inference against open weight models trained by other people. They get to skip the training costs.

yunwal

3 months ago

> If you completely ignore inference revenue needing to offset training costs.

This is what people mean when they say margin. When you buy a pair of shoes, the margin is price/(materials+labor), and doesn’t include the price of the factory or the store they were bought in

diwank

3 months ago

I am a bit mind boggled by the pricing lately, especially since the cost increased even further. Is this driven by choices in model deployment (unquantized etc) or simply by perceived quality (as in 'hey our model is crazy good and we are going to charge for it)?

justinbaker84

3 months ago

1 reply

I am a professional developer so I don't care about the costs. I would be willing to pay more for 4.5 Haiku vs 4.5 Sonnet because the speed is so valuable.

I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.

jhancock

3 months ago

2 replies

How do you review code when the LLM can produce so much so fast?

justinbaker84

3 months ago

Just read it when it is done writing it.

evan_

3 months ago

with an LLM

rudedogg

3 months ago

This also means API usage through Claude Code got more expensive (but better if benchmarks are to be believed)

aliljet

3 months ago

9 replies

What is the use case for these tiny models? Is it speed? Is it to move on device somewhere? Or is it to provide some relief in pricing somewhere in the API? It seems like most use is through the Claude subscription and therefore the use case here is basically non-existent.

pacoWebConsult

3 months ago

1 reply

One big use-case is that claude code with sonnet 4.5 will delegate into the cheaper model (configurable) more specific, contextful tasks, and spin up 1-3 sub-agents to do so. This process saves a ton of available context window for your primary session while also increasing token throughput by fanning-out.

matltc

3 months ago

1 reply

How does one configure Claude code to delegate to cheaper models?

I have a number of agents in ~/.claude/agents/. Currently have most set to `model: sonnet` but some are on haiku.

The agents are given very specific instructions and names that define what they do, like `feature-implementation-planner` and `feature-implementer`. My (naive) approach is to use higher-cost models to plan and ideally hand off to a sub-agent that uses a lower-cost model to implement, then use a higher-cost to code review.

I am either not noticing the handoffs, or they are not happening unless specifically instructed. I even have a `claude-help` agent, and I asked it how to pipe/delegate tasks to subagents as you're describing, and it answered that it ought to detect it automatically. I tested it and asked it to report if any such handoffs were detected and made, and it failed on both counts, even having that initial question in its context!

brianwawok

3 months ago

I only get Claude to launch agents when I specifically tell it to for a given task. And it only really works if you can actually parallelize the task,

dlisboa

3 months ago

1 reply

For me speed is interesting. I sometimes use Claude from the CLI with `claude -p` for quick stuff I forget like how to run some docker image. Latency and low response speed is what almost makes me go to Google and search for it instead.

matltc

3 months ago

I use gh copilot suggest in lieu of claude -p. Two seconds latency and highly accurate. You probably need a gh copilot auth token to do this though, and truthfully, that is pointless when you have access to Claude code.

muzani

3 months ago

I'm working on a RPG. There's a fixed set of rules. I give the player freedom to do things but it has to be within the laws of physics, e.g. they can't just pull a key or a shotgun out of nowhere. So a LLM arbitrates the behavior and tries to match it to the nearest rule.

The rules themselves are a bit more complex and require a smarter model, but the arbitration should be fairly fast. GPT-5 is cheap and high quality but even gpt-5-mini takes about 20-40 seconds to handle a scene. Sonnet can hit 8 seconds with RAG but it's too expensive for freemium.

Grok Turbo and Haiku 3 were fast but often misses the mark. I'm hoping Haiku 4.5 can go below 4 seconds and have decent accuracy. 20 seconds is too long, and hurts debugging as well.

kasey_junk

3 months ago

They are great for building more specialized tool calls that the bigger models can call out to in agentic loops.

anuramat

3 months ago

for me its the speed; eg cerebras qwen coder gets you a completely different workflow as its practically instant (3k tps) -- feels less like an agent and more like a natural language shell, very helpful for iterating on a plan that you them forward to a bigger model

pietz

3 months ago

I think with gpt-5-mini and now Haiku 4.5, I’d phrase the question the other way around: what do you need the big models for anymore?

We use the smaller models for everything that’s not internal high complexity tasks like coding. Although they would do a good enough of a job there as well, we happily pay the uncharge to get something a little better here.

Anything user facing or when building workflow functionalities like extracting, converting, translating, merging, evaluating, all of these are mini and nano cases at our company.

JLO64

3 months ago

In my product I use gpt-5-nano for image ALT text in addition to generating transcriptions of PDFs. It’s been surprisingly great for these tasks, but for PDFs I have yet to test it on a scanned document.

minimaxir

3 months ago

If you look at the OpenRouter rankings for LLMs (generally, the models coders use for vibe/agentic coding), you can see that most of them are in the "small" model class as opposed to something like full GPT-5 or Claude Opus, albeit Gemini 2.5 Pro is higher than expected: https://openrouter.ai/rankings

Rudybega

3 months ago

Higher token throughput is great for use cases where the smaller, faster model still generates acceptable results. Final response time improvements feel so good in any sort of user interface.

RickHull

3 months ago

4 replies

If I'm close to weekly limits on Claude Code with Anthropic Pro, does that go away or stretch out if I switch to Haiku?

visarga

3 months ago

1 reply

Sonnet 4.5 was two weeks ago. In the past I never had such issues, but every week my quota ended in 2-3 days. I suspect the Sonnet 4.5 model consumes more usage points than old Sonnet 4.1

I am afraid Claude Pro subscription got 3x less usage

Aeolun

3 months ago

Yeah. I definitely don’t get as much usage out of Sonnet 4.5 as 5x Opus 4.1 should imply.

What bothers me is that nobody told me they changed anything. It’s extremely frustrating to feel like I’m being bamboozled, but unable to confirm anything.

I switched to Codex out of spite, but I still like the Claude models more…

thomassmith65

3 months ago

2 replies

How close are you?

Oh right, Anthropic doesn't tell you.

I got that 'close to weekly limits' message for an entire week without ever reaching it, came to the conclusion that it is just a printer industry 'low ink!' tactic, and cancelled my subscription.

You don't take money from a customer for a service, and then bar the customer form using that service for multiple days.

Either charge more, stop subsidizing free accounts, or decrease the daily limit.

__atx__

3 months ago

2 replies

These days, running `/usage` in Claude Code shows you how close you are to the session and weekly limits. Also available in the web interface settings under "Usage".

thomassmith65

3 months ago

My mistake. It's good that it's available in settings, even if it's a few screens away from the 'close to weekly limits' banner nagging me to subscribe to a more expensive plan.

RickHull

3 months ago

Super helpful, thanks!

fluidcruft

3 months ago

They have pretty nice bar charts nowadays.

parkersweb

3 months ago

I’m also really interested in this - in fact it’s the first thing I went looking for in the announcement…

parkersweb

3 months ago

Anecdata point - I’ve been running for around 3-4 hours this morning constantly using Haiku and it hasn’t hit the limit - currently at 74% and it resets in 1.5 hours. I think it’s safe to say you get a fair bit more usage over Sonnet.

Still trying to judge the performance though - first impression is that it seems to make sudden approach changes for no real reason. For example - after compacting, the next task I gave it, it suddenly started trying to git commit after each task completion, did that for a while, then stopped again.

steveklabnik

3 months ago

4 replies

I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.

criemen

3 months ago

2 replies

Technically, they released Opus 4.1 a few weeks ago, so that alone hints at a smaller leap from 4.1 -> 4.5, compared to the leap from Sonnet 4 -> 4.5. That is, of course, if those version numbers represent anything but marketing, which I don't know.

mcintyre1994

3 months ago

Bizarrely they already call Opus 4.1 “legacy brainstorming model”.

steveklabnik

3 months ago

I had forgotten that, given that Sonnet pretty much blows Opus out of the water these days.

Yeah, given how multi-dimensional this stuff is, I assume it's supposed to indicate broad things, closer to marketing than anything objective. Still quite useful.

dheera

3 months ago

13 replies

I wonder what the next smaller model after Haiku will be called. "Claude Phrase"?

steveklabnik

3 months ago

1 reply

It's interesting to think about various aspects of marketing the models, with ChatGPT going the "internal router" direction due to address the complexity of choosing. I'd never considered something smaller than Haiku to be needed, but I also rarely used Haiku in the first place...

ACCount37

3 months ago

If you're going smaller than Haiku, you might be at the point of using various cheap open models already. The small model would need some good killer features to justify the margins.

clbrmbr

3 months ago

1 reply

Claude Koan

devnullbrain

3 months ago

KWATZ!

fnordsensei

3 months ago

Claude Garden Path Sentence

u8080

3 months ago

Claude Banger

stavros

3 months ago

Claude Groan.

chrisweekly

3 months ago

Claude Char

dotancohen

3 months ago

If they do come up with a tiny model tuned for generating conversion and code, I think that Claude Acronym would be a perfect name.

Razengan

3 months ago

Claude .

entanglr

3 months ago

Claude Punchline

WalterSear

3 months ago

Claude from Nantucket

grandpa

3 months ago

Claude Clause.

Brendinooo

3 months ago

Claude Couplet

senko

3 months ago

Claude Glyph.

Smallest, fastest model yet, ideally suited for Bash oneliners and online comments.

sharkjacobs

3 months ago

2 replies

My impression is that Sonnet and Haiku 4.5 are the same "base models" as Sonnet and Haiku 4, the improvements are from fine tuning on data generated by Opus.

I'm a user who follows the space but doesn't actually develop or work on these models, so I don't actually know anything, but this seems like standard practice (using the biggest model to finetune smaller models)

Certainly, GPT-4 Turbo was a smaller model than GPT-4, there's not really any other good explanation for why it's so much faster and cheaper.

The explicit reason that OpenAI obfuscates reasoning tokens is to prevent competitors from training their own models on them.

sharkjacobs

3 months ago

1 reply

Which is all to say that I think the reason they went from Opus 3 to Opus 4 is because there was no bigger model to fine tune Opus 3.5 with.

And I would expect Opus 4 to be much the same.

jascha_eng

3 months ago

1 reply

But sonnet 4.5 outperforms opus 4 on most benchmarks and tasks that can't be all that's to it

sharkjacobs

3 months ago

that's not all there is to it, but I think that "the rest of it" is just additional fine tuning.

Benchmarks are good fixed targets for fine tuning, and I think that Sonnet gets significantly more fine tuning than Opus. Sonnet has more users, which is a strategic reason to focus on it, and it's less expensive to fine tune, if API costs of the two models are an indicator.

qudat

3 months ago

These frontier model companies are bootstrapping their work by using models to improve models. It’s a mechanism to generate fake training data. The rationale is the teacher model is already vetted and aligned so it can reliably “mock” data. A little human data gets amplified.

gwd

3 months ago

Opus disappeared for quite a while and then came back. Presumably they're always working on all three general sizes of models, and there's some combination of market need and model capabilities which determine if and when they release any given instance to the public.

ericbrow

3 months ago

3 replies

Was anyone else slightly disappointed that this new product doesn't respond in Haiku, as the name would imply?

simonw

3 months ago

If you want to see it generate a Haiku from your webcam I just upgraded my silly little bring-your-own-key Haiku app to use the new model: https://tools.simonwillison.net/haiku

dpoloncsak

3 months ago

Wasn't there a 3.5 haiku too?

https://aws.amazon.com/about-aws/whats-new/2024/11/anthropic...

esafak

3 months ago

It's not a new product; just a new version.

simonw

3 months ago

8 replies

Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...

bradgessler

3 months ago

4 replies

I’m surprised none of the frontier model companies have thrown this test in as an Easter egg.

ahofmann

3 months ago

1 reply

simonw has other prompts, that are undisclosed. So cheating on this prompt will be catched.

beefnugs

3 months ago

1 reply

What? you and I cant see his "undisclosed" tests... but you better be sure that whatever model he is testing is specifically looking for these tests coming in over the api, or you know, absolutely everything for the cops

Legend2440

3 months ago

You are welcome to test it yourself with whatever svg you want.

I am quite confident that they are not cheating for his benchmark, it produces about the same quality for other objects. Your cynicism is unwarranted.

HDThoreaun

3 months ago

1 reply

All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

frtime3d

3 months ago

> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”

Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.

jgalt212

3 months ago

2 replies

OpenAI / Bing admit it's in its knowledge base.

are you aware of the pelican on a bicycle test?

Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.

esafak

3 months ago

1 reply

Knowing that does not make it easier to draw one though.

jgalt212

3 months ago

It doesn't make it harder.

zaphirplane

3 months ago

What is special about the prompt

CjHuber

3 months ago

Because then they would have to admit that they try to game benchmarks

ziofill

3 months ago

4 replies

Gemini Pro initially refused (!) but it was quite simple to get a response:

> give me the svg of a pelican riding a bicycle

> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!

> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?

> Of course. Here is the SVG code...

(it was this in the end: https://tinyurl.com/zpt83vs9)

b7894

3 months ago

4 replies

Gemini 3.0 Pro (or what is deemed to be 3.0 Pro - you can get access to it via A/B testing on AI Studio) does a noticeably better job

https://x.com/cannn064/status/1972349985405681686

https://x.com/whylifeis4/status/1974205929110311134

https://x.com/cannn064/status/1976157886175645875

jiggawatts

3 months ago

1 reply

How do people trigger A/B testing?

simonw

3 months ago

As far as I can tell they just keep on hammering the same prompt in https://aistudio.google.com/ until they get lucky and the A/B test triggers for them on one of those prompts.

qingcharles

3 months ago

That 2nd one is wild.

Ugh. I hate this hype train. I'll be foaming at the mouth with excitement for the first couple of days until the shine is off.

fellowmartian

3 months ago

There’s obviously no improvement on this metric and hasn’t been in a while.

rozab

3 months ago

It was Google that featured a bicycling pelican in a presentation a few months back:

https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

So I think the benchmark can be considered dead as far as Gemini goes

ru552

3 months ago

I like this workflow

actionfromafar

3 months ago

What is dada?

hnuser123456

3 months ago

"create svg code that will create an image of svg code that will create a pelican riding a bicycle"

https://chatgpt.com/share/68f0028b-eb28-800a-858c-d8e1c811b6...

(can be rendered using simon's page at your link)

bobson381

3 months ago

imagine finding the full text of the svg in the library of babel. Great work!

nichochar

3 months ago

i knew simon would be top comment. it's not an empirical law

actionfromafar

3 months ago

Looks very uncomfortable to the bird.

basch

3 months ago

Have you noticed image generation models tend to really struggle with the arms on archers. Could you whip up a quick test of some kind of archer on horseback firing a flaming arrow at a sailing ship in a lake, and see how all the models do?

Topfi

3 months ago

As added context to ensure no benchmark gaming, here a quite impressive Shitaki Mushroom riding a rowboat: https://imgur.com/Mv4Pi6p

Prompt: https://t3.chat/share/ptaadpg5n8

Claude 4.5 Haiku (Reasoning High) 178.98 token/sec 1691 tokens Time-to-First: 0.69 sec

As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb

Prompt: https://t3.chat/share/dcm787gcd3

Grok 4 Fast (Reasoning High) 171.49 token/sec 1291 tokens Time-to-First: 4.5 sec

And GPT-5 for good measure: https://imgur.com/fhn76Pb

Prompt: https://t3.chat/share/ijf1ujpmur

GPT-5 (Reasoning High) 115.11 tok/sec 4598 tokens Time-to-First: 4.5 sec

These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.

btown

3 months ago

Context on this cutting-edge benchmark for those unaware:

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

https://simonwillison.net/tags/pelican-riding-a-bicycle/

Full verbose documentation on the methodology: https://news.ycombinator.com/item?id=44217852

leetharris

3 months ago

1 reply

The main thing holding these Anthropic models back is context size. Yes, quality deteriorates over a large context window, but for some applications, that is fine. My company is using grok4-fast, the Gemini family, and GPT4.1 exclusively at this point for a lot of operations just due to the huge 1m+ context.

Tiberium

3 months ago

2 replies

Is your company Tier 4? Anthropic has had 1M context size in beta for some time now.

https://docs.claude.com/en/docs/build-with-claude/context-wi...

leetharris

3 months ago

Only for Sonnet. No 1m for Haiku (this new model) and Opus.

This means 2.5 Flash or Grok 4 fast takes all the low end business for large context needs.

_ink_

3 months ago

Is it possible to get that in Claude Code with Pro? Or is it already a 1M context window?

Topfi

3 months ago

4 replies

Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being the case, it is possible that in actual day-to-day use, Haiku 4.5 may be less expensive than the raw cost breakdown may appear initially, though the increase is significant.

Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.

For reference:

Haiku 3: I $0.25/M, O $1.25/M

Haiku 4.5: I $1.00/M, O $5.00/M

GPT-5: I $1.25/M, O $10.00/M

GPT-5-mini: I $0.25/M, O $2.00/M

GPT-5-nano: I $0.05/M, O $0.40/M

GLM-4.6: I $0.60/M, O $2.20/M

deadbabe

3 months ago

2 replies

Those numbers don’t mean anything without average token usage stats.

distalx

3 months ago

1 reply

Exactly, token per dollar rates are useful, but without knowing the typical input output token distribution for each model on this specific task, the numbers alone don’t give a full picture of cost.

deadbabe

3 months ago

1 reply

That’s how they lie to us. Companies can advertise cheap prices to lure you in but they know very well how many tokens you’re going to use on average so they will still make more profit than ever, especially if you’re using any kind of reasoning model which is just like a blank check for them to print money.

solumunus

3 months ago

I don’t think any of them are profitable are they? We’re in the losing money to gain market share phase of this industry.

Topfi

3 months ago

1 reply

Fair point of course and it is still far to early to make a definitive statement, but in my still limited experience throughout the night, I have seen Haiku 4.5 be far better in using what I'd consider a justifiable amount of input tokens over e.g. GPT-5 models. Sonnets recent versions also had been better on this front over OpenAIs current best, but I try (not always succeed) to take prior experience and expectation out of the equation when evaluating models.

Additionally, the AA cost to run benchmark suite numbers are very encouraging [0] and Haiku 4.5 without reasoning is always an option too. Tested that even less, but there is some indication that reasoning may not be necessary for reasonable output performance [1][2][3].

In retrospect, I perhaps would have been served better starting with "reasoning" disabled, will have to do some self-blinded comparisons between model outputs over the coming weeks to rectify that. Am trying my best not to make a judgement yet, but compared to other recent releases, Haiku 4.5 has a very interesting, even distribution.

GPT-5 models were and continue to be encouraging for price/performance with a reliable 400k window and good adherence to prompts with multi minute (beyond 10) adherence, but from the start weren't the fastest and ingests every token there is in a code base with reckless abandon.

No Grok model ever performed for me like they seem to during the initial hype

GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Recent Anthropic releases were good at code output quality, but not as reliable beyond 200k vs GPT-5, not exactly fast either when looking at token/sec, though task completion generally takes less time due to more efficient ingestion vs GPT-5 and of course rather expensive.

Haiku 4.5, if they can continue to offer it at such speeds with such low latency and at this price, cupeled with encouraging initial output quality and efficient ingestion of repos seems to be designed in a far more balanced manner, which I welcome. Course with 200k being a hard limit, that is a clear downside compared to GPT-5 (and Gemini 2.5 Pro though that has its own reliability issues in tool calling) and I have yet to test whether it can go beyond 8 min on chains of tool calls with intermittent code changes without suffering similar degradation to other recent Anthropic models, but I am seeing the potential for solid value here.

[0] https://artificialanalysis.ai/?models=gpt-5-codex%2Cgpt-5-mi...

[1] Claude 4.5 Haiku 198.72 tok/sec 2382 tokens Time-to-First: 1.0 sec https://t3.chat/share/35iusmgsw9

[2] Claude 4.5 Haiku 197.51 tok/sec 3128 tokens Time-to-First: 0.91 sec https://t3.chat/share/17mxerzlj1

[3] Claude 4.5 Haiku 154.75 tok/sec 2341 tokens Time-to-First: 0.50 sec https://t3.chat/share/96wfkxzsdk

camel_Snake

3 months ago

1 reply

> GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Funny you should say that, because while it is a large model the GLM 4.5 is at the top of Berkley's Function Calling Leaderboard [0] and has one of the lowest costs. Can't comment on speed compared to those smaller models, but the Air version of 4.5 is similarly highly-ranked.

[0]https://gorilla.cs.berkeley.edu/leaderboard.html

Topfi

3 months ago

Gorilla is a great resource and it isn't unreasonable to suspect Z.AI has it in their data sets. I'd suspect most other frontier labs as well (pure speculation, but why not use it as a resource).

Problem is, while Gorilla was an amazing resource back in 2023 and continues to be a great dataset to lean on, but most ways we use LLMs in multi step tasks have since evolved greatly, not just with structured JSON (which GorillaOpenFunctionsV2, v4 eval does multi too), but more with the scaffolding around models (Claude Code vs Codex vs OpenCode, etc.). Likely why good performance with Gorilla doesn't necessarily map onto multiple step workloads with day-to-day tooling, which I tend to go for and reason why, despite there being FOSS options already, most labs either built their own coding assistant tooling (and most open source that too) or feel the need to fork others (Qwen with Geminis repo).

Purely speculative, but GLM-4.6 I evaluated using the same tasks as other models via Claude Code with their endpoint as that is what they advertise as the official way to use the model, same reason I use e.g. Codex for GPT-5. More focused on results in the best case, over e.g. using opencode for all models to give a more level playing field.

Topfi

3 months ago

3 replies

Update, Haiku 4.5 is not just very targeted in terms of changes but also really fast. Averaging at 220token/sec is almost double most other models I'd consider comparable (though again, far to early to make a proper judgement) and if this can be kept up, that is a massive value add over other models. That is nearly Gemini 2.5 Flash Lite speed for context.

Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.

If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.

That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.

[0] https://support.claude.com/en/articles/11145838-using-claude...

rbitar

3 months ago

1 reply

Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]

[1] https://openrouter.ai/anthropic/claude-haiku-4.5

Topfi

3 months ago

2 replies

Was just about to post that Haiku 4.5 does something I have never encountered before [0], there is a massive delta between token/sec depending on the query. Some variance including task specific is of course nothing new, but never as pronounced and reproducible as here.

A few examples, prompted at UTC 21:30-23:00 via T3 Chat [0]:

Prompt 1 — 120.65 token/sec — https://t3.chat/share/tgqp1dr0la

Prompt 2 — 118.58 token/sec — https://t3.chat/share/86d93w093a

Prompt 3 — 203.20 token/sec — https://t3.chat/share/h39nct9fp5

Prompt 4 — 91.43 token/sec — https://t3.chat/share/mqu1edzffq

Prompt 5 — 167.66 token/sec — https://t3.chat/share/gingktrf2m

Prompt 6 — 161.51 token/sec — https://t3.chat/share/qg6uxkdgy0

Prompt 7 — 168.11 token/sec — https://t3.chat/share/qiutu67ebc

Prompt 8 — 203.68 token/sec — https://t3.chat/share/zziplhpw0d

Prompt 9 — 102.86 token/sec — https://t3.chat/share/s3hldh5nxs

Prompt 10 — 174.66 token/sec — https://t3.chat/share/dyyfyc458m

Prompt 11 — 199.07 token/sec — https://t3.chat/share/7t29sx87cd

Prompt 12 — 82.13 token/sec — https://t3.chat/share/5ati3nvvdx

Prompt 13 — 94.96 token/sec — https://t3.chat/share/q3ig7k117z

Prompt 14 — 190.02 token/sec — https://t3.chat/share/hp5kjeujy7

Prompt 15 — 190.16 token/sec — https://t3.chat/share/77vs6yxcfa

Prompt 16 — 92.45 token/sec — https://t3.chat/share/i0qrsvp29i

Prompt 17 — 190.26 token/sec — https://t3.chat/share/berx0aq3qo

Prompt 18 — 187.31 token/sec — https://t3.chat/share/0wyuk0zzfc

Prompt 19 — 204.31 token/sec — https://t3.chat/share/6vuawveaqu

Prompt 20 — 135.55 token/sec — https://t3.chat/share/b0a11i4gfq

Prompt 21 — 208.97 token/sec — https://t3.chat/share/al54aha9zk

Prompt 22 — 188.07 token/sec — https://t3.chat/share/wu3k8q67qc

Prompt 23 — 198.17 token/sec — https://t3.chat/share/0bt1qrynve

Prompt 24 — 196.25 token/sec — https://t3.chat/share/nhnmp0hlc5

Prompt 25 — 185.09 token/sec — https://t3.chat/share/ifh6j4d8t5

I ran each prompt three times and got (within expected variance meaning less than 5% plus or minus) the same token/sec results for the respective prompt. Each used Claude Haiku 4.5 with "High reasoning". Will continue testing, but this is beyond odd. I will add that my very early evals leaned heavily into pure code output, where 200 token/sec is consistently possible at the moment, but it is certainly not the average as claimed before, there I was mistaken. That being said, even across a wider range of challenges, we are above 160 token/sec and if you solely focus on coding, whether Rust or React, Haiku 4.5 is very swift.

[0] Normally not using T3 Chat for evals, just easier to share prompts this way, though was disappointed to find that the model information (token/sec, TTF, etc.) can't be enabled without an account. Also, these aren't the prompts I usually use for evals. Those I try to keep somewhat out of training by only using paid for API for benchmarks. As anything on Hacker News is most assuredly part of model training, I decided to write some quick and dirty prompts to highlight what I have been seeing.

cromulen

3 months ago

1 reply

That's what you get when you use speculative decoding and focus / overfit the draft model on coding. Then when the answer is out of distribution for the draft model, you get increased token rejections by the main model and throughput suffers. This probably still makes sense for them if they expect a lot of their load will come from claude code and they need to make it economical.

abhgh

3 months ago

I'm curious to know if Anthropic mentions anywhere that they use speculative decoding. For OpenAI they do seem to use it based on this tweet [1].

[1] https://x.com/stevendcoffey/status/1853582548225683814

rbitar

3 months ago

Interesting and if they are using speculative decoding that variance would make sense. Also your numbers line up with what openrouter is now publishing at 169.1tps [1]

Anthropic mentioned this model is more then twice as fast as claude sonnet 4 [2], which OpenRouter averaged at 61.72 tps for sonnet 4 [3]. If these numbers hold we're really looking at an almost 3x improvement in throughput and less then half the initial latency.

[1] https://openrouter.ai/anthropic/claude-haiku-4.5 [2] https://www.anthropic.com/news/claude-haiku-4-5 [3] https://openrouter.ai/anthropic/claude-sonnet-4

katchu11

3 months ago

2 replies

Hey! I work on the Claude Code team. Both PAYG and Subscription usage look to be configured correctly in accordance with the price for Haiku 4.5 ($1/$5 per M I/O tok).

Feel free to DM me your account info on twitter (https://x.com/katchu11) and I can dig deeper!

peddling-brink

3 months ago

3 replies

lol, I don’t know if you work there or not, but directing folks to send their account info to a random Twitter address is, not considered best practice.

squigz

3 months ago

1 reply

What's wrong with sending a username to someone?

lukeck

3 months ago

Generally, nothing inherently wrong with sending a username but directing people to a 3rd party social media platform rather than an official Anthropic email or support system does nothing to build trust that they actually work there.

rat9988

3 months ago

What best practice. He can choose whether he sends or not. The guy is just offering some extra help here.

ethbr1

3 months ago

Being charitable, let's assume parent wasn't talking about secrets.

Topfi

3 months ago

Thanks, sorry, only saw the offer now. Have just checked and cannot reproduce the usage any more, might have been mistaken on that front.

qingcharles

3 months ago

It's insanely fast. I didn't know it had even been released, but I went to select the copilot SWE test model in VSCode and it was missing and Haiku 4.5 was there instead. I asked for a huge change to a web app and the output from Haiku scrolled the text faster than Windows could keep up. From a cold start. Wrote a huge chunk of code in about 40 seconds. Unreal.

p.s. it also got the code 100% correct on the one-shot p.p.s. Microsoft are pricing it out at 30% the cost of frontier models (e.g. Sonnet 4.5, GPT5)

tosh

3 months ago

2 replies

One of the main issues I had with Claude Code (maybe it‘s the harness?) was that the agent tends to NOT read enough relevant code before it makes a change.

This leads to unnecessary helper functions instead of using existing helper functions and so on.

Not sure if it is an issue with the models or with the system prompts and so on or both.

stingraycharles

3 months ago

1 reply

You need to plan down tasks really to the function level, and review things.

miroljub

3 months ago

1 reply

Just writing code is often faster.

eagerpace

3 months ago

1 reply

Not when you stink at writing code but you’re really good at writing specs

miroljub

3 months ago

In that case you would be much more valued as a business analyst than as a developer.

People that can and want to write specs are very rare.

thomasfromcdnjs

3 months ago

1 reply

You might get better results with https://github.com/oraios/serena

I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"

lelanthran

3 months ago

> I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"

Wouldn't that consume a ton of tokens, though? After all, if you don't want it to recreate function `foo(int bar)`, it will need to find it, which means either running grep (takes time on large codebases) or actually loading all your code into context.

Maybe it would be better to create an index of your code and let it run some shell command that greps your ctags file, so it can quickly jump to the possible functions that it is considering recreating.

larodi

3 months ago

Been waiiting for the Haiku update as I still do a lot of dumb work with the old one, and it is darrn cheap for what you get out of it with smart prompting. Very neat they finally release this, updating all my bots... sorry agents :)

seunosewa

3 months ago

I'd like to see this price structure for Claude:

$5/mt for Haiku 4.5

$10/mt for Sonnet 4.5

$15/mt for Opus 4.5 when it's released.

baalimago

3 months ago

Ehh, expensive

85392_school

3 months ago

System card: https://assets.anthropic.com/m/99128ddd009bdcb/original/Clau... (edit: discussed here https://news.ycombinator.com/item?id=45596168)