Incident Report for Anthropic
Posted4 months agoActive4 months ago
status.anthropic.comTechstory
heatednegative
Debate
80/100
Artificial IntelligenceClaudeAnthropicModel Degradation
Key topics
Artificial Intelligence
Claude
Anthropic
Model Degradation
Anthropic released an incident report for a bug that caused degraded model quality in Claude, sparking controversy and frustration among users about the lack of transparency and reliability.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
29
4-6h
Avg / period
6.4
Comment distribution70 data points
Loading chart...
Based on 70 loaded comments
Key moments
- 01Story posted
Sep 8, 2025 at 9:51 PM EDT
4 months ago
Step 01 - 02First comment
Sep 9, 2025 at 12:48 AM EDT
3h after posting
Step 02 - 03Peak activity
29 comments in 4-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 10, 2025 at 3:39 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45176491Type: storyLast synced: 11/20/2025, 2:40:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://x.com/claudeai/status/1965208247302029728
I’d almost say it’s hard to understand how people don’t realize that grok has all of the same power and incentive structures behind it as Anthropic’s cloud models.
Unfortunate timing, as I am rooting for Anthropic as the underdog, but feel compelled to use whatever works best. Since mid-August I've demoted Claude to only putting the fire on UIs and am getting amazing results with GPT-5 for everything else. Given the nonstop capacity warnings on codex cli, I might not be the only one.
Give me a break... Anthropic has never been the underdog. Their CEO is one of the most hypocrite people in the field. In the name of "safety" and "ethics", they got away with not releasing even a single open-weight (or open-source) model, calling out OpenAI as the "bad guys", and constantly trying to sabotage pro-competition and pro-consumer AI laws in the US.
Define "bad". Sama is a businessman and at least doesn't pretend to be a saint like Amodei does.
Oh, let me just fix that!
Comments out test
I now have a system of automated tripwires on all experimental scripts that notifies me and terminates the experiment when any sort of statistical irregularity is detected.
Removing the shown token comsumption rates (which allowed understanding when tokens were actually being sent / received!) … sometimes hiding the compaction percentage … the incredible lag on ESC interruption on long running sessions, the now broken clearing of the context window content on TASK tool usage
Who the fuck is working on this software and do they actually use it themselves?
Maybe the quality of Claude Code on any given day is indicative of whether their models are degraded …
Needless to say I built my own agent (just needs a good web UI, last step!). The only thing keeping me with anthropic right now is the economics of the plan, my inference bill would be a second mortgage without it.
I contribute to Happy Coder, an open source agent client (mobile app, desktop app, and web app). The project is just a UI layer for existing agents. Adding Codex specific UI and plumbing last week was a 2,600 line diff that took a contributor 3 evenings. And it should be even less plumbing for the next agent.
I'm looking for developers to try integrating their agents and tell me what's broken or awkward. I'd appreciate feedback on where the abstractions leak or what's missing. Even with the friction of using someone else's codebase, it could be less work than starting from zero.
I keep seeing these proprietary clients charge $50/month for basically the same plumbing everyone needs. Selfishly I would like open source and free to win this category here.
GitHub: https://github.com/slopus/happy (MIT License)
Speaking of which, if you're interested in a CLI that creates a comprehensive refactoring plan for agents, I'm going to drop https://github.com/sibyllinesoft/valknut tomorrow once I polish the github and create a product page for it on my site. It's helped me keep my agents on rails while refactoring, so that I can scale my AI workflows to larger codebases. It's available on crates as valknut-rs and brew as valknut.
- shitty voice to text (why not just use Whisper at this point?)
- clunky website
- no image/video generation models
- DeepResearch sucks big time
- "Extended Thinking" doesn't seem to do much "thinking". I get the same results without it.
- API too expensive for what it is.
- No open-weight model to boost their reputation. Literally every other player has released an open model at this point..
In Cursor I am seeing varying degrees of delays after exhausting my points, for On-Demand Usage. Some days it works well, other days it just inserts a 30s wait on each message. What am I paying for? You never know when you buy.
And I think it was 100% on purpose that they degraded the model performance as Claude Code got so popular and they either ran out of capacity or were losing money too fast.
But now that people are fleeing to Codex as it improved so much during the time, they had to act now.
If they're not losing money on inference, then why do they need to keep raising absurd amounts of money? Like, if inference is profitable and they're still losing lots and lots of money, then training must be absurdly expensive, which means that basically they invest in quickly depreciating capital assets (the models) so not a good business.
I think Anthropic is an interesting case study here, as most of their volume is API and they don't have a very generous free tier (unlike OpenAI).
I recently heard someone say that ~"state of the art LLMs are the most rapidly depreciating asset in history."
This seemed accurate sounding to me. Anyone else have thoughts on this?
But alas it's not. It looks like some intern whipped it together.
I thought Anthropic said they never mess with their models like this? Now they do it often?
Never seen or heard of (from people running services at scale, not just rumours) this kind of API behaviour change for a the same model from OpenAI and Google. Gemini 2.5 Pro did materially change at time of prod release despite them claiming they had simply "promoted the final preview endpoint to GA", but in that case you can give them the benefit of it being technically a new endpoint. Still lying, but less severe.
The only explanations were either internal system prompt changes, or updating the actual model. Since the only sharply different evals were those expecting 2.5k+ token outputs with all short ones remaining the same, and the consistency of the change was effectively 100%, it's unlikely to have been a stealth model update, though not impossible.
[1] https://news.ycombinator.com/item?id=44844311
I think that is compatible with making "changes intended to improve the efficiency and throughput of our models" - i.e. optimizing their inference stack, but only if they do so in a way that doesn't affect model output quality.
Clearly they've not managed to do that recently, but they are at least treating these problems as bugs and rolling out fixes for them.
That statement aged poorly. The recent incident report admits they "often" ship optimizations that affect "efficiency and throughput." Whether those tweaks touch weights, tensor-parallel layout, sampling or just CUDA kernels is academic to paying users: when downstream quality drops, we eat the support tickets and brand damage.
We don't need philosophical nuance about what counts as a "model change." We need a change log: timestamped, versioned, and machine-readable that covers any modification that can shift outputs: weights, inference config, system prompt, temperature, top-P, KV-cache size, rollout percentage, the lot. If your internal evals caught nothing but users did, give us the diff and let us run our own regression tests.
History proves inference changes can drastically alter outputs. When gpt-oss launched, providers using identical weights delivered wildly different qualities due to inference configurations.
We need transparency about all changes whether model weights or infrastructure. Anthropic's eval suite clearly missed this real-world regression. Proactive change notifications would let us run our own evals to prevent failures. Without this, we're forced to reactively troubleshoot. An unacceptable risk for production systems.
As someone who has implemented this myself, I know that it’s pretty easy to make innocent mistakes there. And the only visible result is a tiny distortion of the output distribution which only really becomes visible after analysing thousands of tokens. And I would assume that all providers are using speculative decoding by now because it’s the only way to have good inference speed at scale.
As a quick recap, you train a small model to quickly predict the easy tokens, like filler words, so that you can jump over them in the recurrent decoding loop. That way, a serial model can predict multiple tokens per invocation, thereby easily doubling throughput.
And the fact that they need lots of user tokens to verify that it works correctly would nicely explain why it took them a while to find and fix the issue.
In other words: The speculative decoding causes "holes" in your beam search data. You can fill them by sampling more, increasing hosting costs. Or you fill them with approximations, but that'll skew the results to be more "safe" => more generic, less reasoning.
And this bad memory might stick for a while.
Sure. I give it a few hours until the prolific promoters start to parrot this apologia.
Don't forget: the black box nature of these hosted services means there's no way to audit for changes to quantization and model re-routing, nor any way to tell what you're actually getting during these "demand" periods.
I may be in a minority but I am still quite bullish on them as a company. Even with GPT-5 out they still seem to have a monopoly on taste - Claude is easily the most "human" of the frontier models. Despite lagging in features compared to ChatGPT Web, I mostly ask Claude day-to-day kinds of questions. It's good at inferring my intent and feels more like a real conversation partner. Very interested to see their next release.
Curious why they can’t run some benchmarks with the model (if they suspect the issue is with the model itself) or some agentic coding benchmarks on Claude-code (if the issue might be with the scaffolding, prompts etc).
Before I finally gave up on Claude Code, I noticed that I got more aggressive towards it, the more stupid it got as I could not believe how dumb it started to be.
And I am sure I was not the only one.
I want to know how i could have been impacted.