What Makes 5% of AI Agents Work in Production?
Key topics
The article discusses the challenges of deploying AI agents in production, with a panel of experts suggesting that 95% of deployments fail due to lack of scaffolding around the models, while the discussion revolves around the validity of this claim and the complexities of building reliable AI systems.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
59
108-120h
Avg / period
24.2
Based on 121 loaded comments
Key moments
- 01Story posted
Oct 2, 2025 at 6:30 PM EDT
3 months ago
Step 01 - 02First comment
Oct 6, 2025 at 10:01 PM EDT
4d after posting
Step 02 - 03Peak activity
59 comments in 108-120h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 8, 2025 at 11:36 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://ai.meta.com/research/publications/cwm-an-open-weight...
I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.
If we had tech support for a toaster, you might see:
Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.
I suspect that people who are good engineering managers will also be good at 'vibe coding'.
I have observed that those who have both technical and management experience seem to be more adept (or perhaps willing?) to use LLMs in the daily life to good effect.
Of course what really helps, like in all things, is conscientiousness and an obsession for working through problems (if people don't like obsession then tenacity and diligence).
>We weren’t there to rehash prompt engineering tips.
>We talked about context engineering, inference stack design, and what it takes to scale agentic systems inside enterprise environments. If “prompting” is the tip of the iceberg, this panel dove into the cold, complex mass underneath: context selection, semantic layers, memory orchestration, governance, and multi-model routing.
I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.
As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.
Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?
It comes off as so cheap.
The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.
The businessmen's job will be complete when they've totally eliminated all pride from work.
Also, automation and pride can go hand in hand. Pride doesn't mean "make it by hand," that would be silly.
But it's a fallacy to apply it elsewhere and there are millions of examples where the free market failed to optimize a product.
No. Have you worked with businessmen? 90% of the time they're telling you to cut corners and leave things broken, to the point you have a janky mess that can be barely held together. And, right now, we're talking about a technology (LLMs) that is well known to introduce stupid but often hard to spot errors.
They don't want a pencil that's perfect. They want one that's just barely good enough to write with and that they can get maximum profit margin on.
And then, you know, there's the whole thing about life being more than output.
You're not getting it. It'd probably help if you stopped focusing on your pencil story, it's frankly off-topic.
To try one more time: You probably spend half your waking ours at work. The quality of that time is important to your well being. Even if the businessmen sell you cheap, perfect pencils (which I do not grant), swimming in them in your off hours won't help with the other half of your time.
I've no idea what this italicisation is meant to do; nor why this is off-topic. Stating things isn't explaining them.
> Even if the businessmen sell you cheap, perfect pencils (which I do not grant), swimming in them in your off hours won't help with the other half of your time.
It helps in that I don't have to spend as much of my time working to buy pencils. It's the same with everything. There's no reason why a laptop doesn't cost $1m except that the incredible, detailed, cross-continent cooperative work is done by experts and coordinated by a market for that work driving costs down and quality up.
The brands that do have a claim to "perfection" necessarily had the pride to not participate in that race to the bottom.
"The real insight?"
0: https://en.wikipedia.org/wiki/Hypophora
The way I see it is that the majority of people never bothered to write actual content. Now there’s a tool the non-writers can use to write dubious content.
I would wager this tool is being used much differently by actual writers focused on producing quality. There’s just way less of them, same way there is less of any specialization.
The real question with AI to me is whether it will remain consistently better when wielded by a specialist who has invested their time into whatever the thing is they are producing. If that ever changes then we are doomed. When it’s no longer slop…
The tone of AI-written stuff sounds to me just like the soul-less SEO-optimized content marketing blog crap we saw the years before AI became a thing. Very prevalent on Linkedin too. It just sounds/reads so hopelessly artificial.
If I were to begin using AI to write stuff for me (comments or articles or whatever), I'd at least begin with having it train on the collection of everything I've written so far.
Perhaps they can be called vibe bloggers?
What bothers me compared to code is that for software, the code is just a means to and end. But for articles, it’s much more than that.
I wonder how this will end up affecting our lives. Last week I saw a video that highlighted how AI is already affecting our vocabulary. It introduces words not typically used in American English (but more commonly used in Nigeria, where a lot of content writing is outsourced to) into mainstream media.
I can totally see how this will slowly start affecting language itself.
> One panelist shared a personal story that crystallized the challenge: his wife refuses to let him use Tesla’s autopilot. Why? Not because it doesn’t work, but because she doesn’t trust it.
> Trust isn’t about raw capability, it’s about consistent, explainable, auditable behavior.
> One panelist described asking ChatGPT for family movie recommendations, only to have it respond with suggestions tailored to his children by name, Claire and Brandon. His reaction? “I don’t like this answer. Why do you know my son and my girl so much? Don’t touch my privacy.”
What I wonder is whether the author of the article recognized these patterns and didn’t care, didn’t even recognize them, or didn’t proofread the article?
> thanks, I used AI but aren't we all? I thought the point of AI is to get us to be more productive.
You've also repeatedly dismissed any criticism of the writing as "hate."
If you want readers to do you the favor of reading your work, please do them the favor of writing it.
Beyond the em dashes and overuse of "delve" etc. there is this distinctive style of composition I want to understand and recognize better
"There’s a missing primitive here: a secure, portable memory layer that works across apps, usable by the user, not locked inside the provider. No one’s nailed it yet. One panelist said if he weren’t building his current startup, this would be his next one."
I find it annoying that, when prompting ChatGPT, Claude, Gemini, etc. on personal tasks through their chat interfaces, I have to provide the same context about myself and my job again and again to the different providers.
The memory functions of the individual providers now reduce some of that repetition, but it would be nice to have a portable personal-memory context (under my control, of course) that is shared with and updated semiautomatically by any AI provider I interact with.
As isoprophlex suggests in a sister comment, though, that would be hard to monetize.
Edit: Aaaand it’s gone.
Will someone please think of the MRR!
This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!
I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".
It works for the first couple of responses but then it's back to loads of links again.
In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.
If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.
You might even be able to put a UI on it that is a lot more effective than asking the user to type text into a box.
Well said and I could not agree more.
might as well just write the ai agent part of the software yourself as well.
So...
The bot, to its credit, returns some decent results. But my guess is that it will be quite a while before we see it in prod since a lot of these projects go from 0 - 80% in a week and 80% - deployable in several years.
Text-to-SQL is the funniest example. It seems to be the "hello world" of agentic use in enterprise environments. It looks so easy, so clear, so straight-forward. But just because the concept is easy to grasp (LLMs are great at generating markup or code, so let's have them translate natural language to SQL) doesn't mean it is easy to get right.
I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.
https://bird-bench.github.io/
A user having to come up with novel queries all the time to warrant text 2 sql is a failure of product design.
But this is precisely why we're seeing startups build insane things fast while well established companies are still questioning if it's even worth it or not.
People got good results on the test datasets, but the test datasets had errors so the high performance was actually just the models being overfitted.
I don't remember where this was identified, but it's really recent, but before GPT-5.
Wait but this just sounds unhinged, why oh why
People don't know exactly what they want from the data warehouse, just a fuzzy approximation of it. You need stochastic software (AI) to map the imprecise instructions from your users to precise instructions the warehouse can handle.
60% of the time I spend writing sql is probably validation. A single hallucinated assumption can blow the whole query. And there are questions that don’t have clear modelling approaches that you have to deal with.
Plus, a lot of the sql training data in LLMs is pretty bad, so I’ve not been impressed yet. Certainly not to let business users run an AI query agent unchecked.
I’m sure AI will get good at this, so I’m building up my warehouse knowledge base and putting together documentation as best I can. It’s just pretty awful today.
Okay, how would that work though? Verified by who and calculated by what?
I need deets.
On the other side, you have an SQL that calculates the revenue
Compare the two. If the two disagree, get the AI to try again. If the AI is still wrong after 10 tries, just use the SQL output.
What I hear is a billion dollar AI startup in the making!
So if you have a "CalculateQuarterRevenue(year, quarter)" function, you'll soon find your users asking for the data per-month. Or just for the last six weeks. Or just for a specific client. And they'll be confused when it doesn't work.
Compare this to basically any website you've ever been to. It's the "GUIs vs. CLIs" discussion all over again, except even CLIs had man pages for discoverability.
edit: I'm serious. I'm just answering the question, not making a value judgement.
Verbal queries is the solution for the world we have even if it's not optimal.
The main killer app, I think, boils down really expensive speech-to-text (and vice versa) with a reasonable number of seemingly authoritative querying details in fairly plain language. It's a new, 'better' search engine, just with different pitfalls people need to get up to speed on. And that may be enough, because employing humans to fill the same role as effectively is more expensive still.
If you said it’s something you made for perusal and reading? Then it reads like AI.
I’ve had to read tons of papers and articles, the most testing being conference submissions. I won’t read something with that structure unless I have to.
So you scaffold this up in 30 seconds but want me to read through it carefully? Cool, thanks.
End users (at my company) - Can your AI system look at numbers and find differences and generate a text description?
Pre-sales - (trying to clarify) For our systems to generate text it will be better if you give it some live examples so that it understands what text to generate.
End users - But there is supporting data (metadata) around the numbers. Can't your AI system just generate text?
Pre-Sales - It can but you need to provide context and examples. Otherwise it is going to generic text like "there is x difference".
End user - You mean I need to write comments manually first? That is too much work.
Now these users have a call with another product - MS Copilot.
This is really the truth of all things in life.
This is how it is being marketed and I guess people are silly enough to believe marketing so it's not too surprising
Anyone that's been involved in data science roles in corporate environments knows that "the data" is usually forced into an execs pre-existing understanding of a phenomenon. With AI, execs are really excited at "cutting out the middlemen" when the middlemen in the equation are very often their own paid employees. That's all fine and dandy in an abstract economic view, but it's sure something they won't say publicly (at least most won't).
In terms of potential cost cutting, it probably is the most recent "new magic". You used to have to pay a consultant, now you can "ask AI".
That's because it is marketed as magic. It's marketed as magic so people will adopt the thing before knowing its shortcomings.
https://pbfcomics.com/comics/the-masculator/
Conversational UIs are controversial but I think there are a good number of websites where a better search could be more centric. Not generating text, but surfacing the most relevant text.
I’m thinking of a lot of library documentation, government info websites, etc. Basically an improvement over deep hierarchical navigation, where their way of organizing info is a leaky abstraction.
Maybe that will be one of the side effects of this AI boom. Who knows.
For example, the number comes from perceived successes and failures and not actual measurements. The customer conclusions are also - it doesnt improve or it doesnt remember. Literally buying into the hype of recursive self improvement and completely oblivious to the fact that API dont control model weights and such cant do much self improvement besides writing more CRUD layers. The other complaints are about integrations which are totally valid. But in industries which still run windows XYZ without any API platforms so thats not going away in those cases.
Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.
``` The teams that succeed don’t just throw SQL schemas at the model. They build:
Business glossaries and term mappings
Query templates with constraints
Validation layers that catch semantic errors before execution ```
Unfortunately, the mixing of fluffy tone and high level ideas is bound to be detested by hands on practitioners.
And now we have an entire panel of bullshitters with an article-long theory about how to make LLMs program actually for real this time.
(Oh, and it would be great if journalists actually cited their public sources, instead of pretending they link to the article but actually linking to their review of related content.)
It's a big pet peeve of mine when an author states an opinion, with no evidence, as some kind of axiom. I think there is plenty of evidence that "the models aren't smart enough". Or to put it more accurately, it's an incredibly difficult problem to get a big productivity gain when an automated system is blatantly wrong ~1% of the time but when those wrong answers are inherently designed to look like right answers as much as possible.