The LLM Lobotomy?
Posted4 months agoActive3 months ago
learn.microsoft.comTechstoryHigh profile
skepticalnegative
Debate
80/100
LlmsAIModel DegradationQuantization
Key topics
Llms
AI
Model Degradation
Quantization
The post discusses a user's complaint about the degradation of Azure's LLM performance over time, sparking a debate about the potential causes and implications of such degradation.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
22m
Peak period
34
0-2h
Avg / period
6.1
Comment distribution61 data points
Loading chart...
Based on 61 loaded comments
Key moments
- 01Story posted
Sep 20, 2025 at 2:07 PM EDT
4 months ago
Step 01 - 02First comment
Sep 20, 2025 at 2:29 PM EDT
22m after posting
Step 02 - 03Peak activity
34 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 21, 2025 at 3:16 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45315746Type: storyLast synced: 11/20/2025, 6:48:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.
edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.
> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”
- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)
The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.
https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.
I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.
As for the original forum post:
- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)
- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data
If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.
If they do, I think that it will add a lot to this conversation. Hope it happens!
I asked them to share data/dates as much as that’s possible - fingers crossed
That said, I would also love to see some examples or data, instead of just "it's getting worse".
With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.)
The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety.
I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone.
[1]: https://www.microsoft.com/en-us/ai/responsible-ai
What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.
I can tell you that the post describes is exactly what I’ve seen also: degraded performance and excruciatingly slow.
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...
That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.
I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.
I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.
I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.
I think they have receipts, but did not post them there
Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.
but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.
so perhaps it's just a matter of transparency
but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model
Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.
What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.
I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English)
((this comment was also written without AI!!)) :-)
I wonder how the causal graph looks here: do people (esp those working with LLMs a lot) lean towards LLM-speak over time, or both LLMs and native speakers picked up this very particular sentence structure from a common source? (eg a large corpus of French-English translations in the same style?)
I’ve been removing hyphens and bullet points from my own writing just to appear even less llm like! :)
Great stylistic chicken and egg question! French definitely tends to use certain (I’m struggling to not say “fancier”) words even in informal contexts.
I personally value using over-the-top ornate expressions in French: they both sound distinguished and a bit ridiculous, so I get to both ironically enjoy them and feel detached from them… but none of that really translates to casual English. :)
Cheers
The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models.
There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option.
Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity.
Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of.
The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens.