Opus 4.5 Is Not the Normal AI Agent Experience That I Have Had Thus Far
Key topics
The AI coding revolution is here, or so it seems, as users rave about Opus 4.5's impressive capabilities, with some commenters hailing it as a game-changer for the programming profession. However, not everyone is convinced, with critics pointing out that the current output isn't production-ready and lacks edge case testing, security audits, and maintainability. As some commenters debate the potential for continued progress, others are already exploring the implications of relying on AI-generated code, with some arguing that it's a seismic shift and others warning that it's being oversold. Amidst the discussion, a surprising consensus emerges: even if progress stalls, the programming landscape has already been forever altered.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
7m
Peak period
116
0-6h
Avg / period
20
Based on 160 loaded comments
Key moments
- 01Story posted
Jan 6, 2026 at 12:45 PM EST
3d ago
Step 01 - 02First comment
Jan 6, 2026 at 12:52 PM EST
7m after posting
Step 02 - 03Peak activity
116 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Jan 9, 2026 at 1:41 PM EST
4h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Yes opus 4.5 seems great but most of the time it tries to vastly over complicate a solution. Its answer will be 10x harder to maintain and debug than the simpler solution a human would have created by thinking about the constraints of keeping code working.
Over time, I imagine even cloud providers, app stores etc can start doing automated security scanning for these types of failure modes, or give a more restricted version of the experience to ensure safety too.
I didn't write off an entire field of research, but rather want to highlight that these aren't intractable problems for AI research, and that we can actually start codifying many of these things today using the skills framework to close up edges in the model training. It may not be 100% but it's not 0%.
I predict in 2026 we're going to see agents get better at running their own QA, and also get better at not just disabling failing tests. We'll continue to see advancements that will improve quality.
And it turns out the quality of output you get from both the humans and the models is highly correlated with the quality of the specification you write before you start coding.
Letting a model run amok within the constraints of your spec is actually great for specification development! You get instant feedback of what you wrongly specified or underspecified. On top of this, you learn how to write specifications where critical information that needs to be used together isn't spread across thousands of pages - thinking about context windows when writing documentation is useful for both human and AI consumers.
I can’t get past that by the time I write up an adequate spec and review the agents code, I probably could have done it myself by hand. It’s not like typing was even remotely close to the slow part.
AI, agents, etc are insanely useful for enhancing my knowledge and getting me there faster.
Stuff that seems basic, but that I haven't always been able to count on in my teams' "production" code.
Maintain and debug by who? It's just going to be Opus 4.5 (and 4.6...and 5...etc.) that are maintaining and debugging it. And I don't think it minds, and I also think it will be quite good at it.
something like code-simplifier is surprisingly useful (as is /review)
https://x.com/bcherny/status/2007179850139000872
It means that it is going to be as easy to create software as it is to create a post on TikTok, and making your software commercially successful will be basically the same task (with the same uncontrollable dynamics) as whether or not your TikTok post goes viral.
If anything this example shows that these cli tools give regular devs much higher leverage.
There's a lot of software labor that is like, go to the lowest cost country, hire some mediocre people there and then hire some US guy to manage them.
That's the biggest target of this stuff, because now that US guy can just get equal or hight code in both quality and output without the coordination cost.
But unless we get to the point where you can do what I call "hypercode" I don't think we'll see SWEs as a whole category die.
Just like we don't understand assembly but still need technical skills when things go wrong, there's always value in low level technical skills.
This is also my take. When the printing press came out, I bet there were scribes who thought, "holy shit, there goes my job!" But I bet there were other scribes who thought, "holy shit, I don't have to do this by hand any more?!"
It's one thing when something like weaving or farming gets automated. We have a finite need for clothes and food. Our desire for software is essentially infinite, or at least, it's not clear we have anywhere close to enough of it. The constraint has always been time and budget. Those constraints are loosening now. And you can't tell me that when I am able to wield a tool that makes me 10X more productive that that somehow diminishes my value.
So a lot of people will end up doing something different. Some of it will be menial and be shit, and some of it will be high level. New hierarchies and industries will form. Hard to predict the details, but history gives us good parallels.
I don't understand this argument. Surely the skill set involved in being a scribe isn't the same as being a printer, and possibly the the personality that makes a good scribe doesn't translate to being a good printer.
So I imagine many of the scribes lost their income, and other people made money on printing. Good for the folks who make it in the new profession, sucks for those who got shafted. How many scribes transitioned successfully to printers?
Genuinely asking, I don't know.
If AI datacenters' hungry need for energy gets us to nuclear power, which gets us the energy to run desalination plants as the lakes dry up because the Earth is warming, hopefully we won't die of thirst.
I think for a while people have been talking about the fact that as all development tools have gotten better - the idea that a developer is a person who turns requirements into code is dead. You have to be able to operate at a higher level, be able to do some level of work to also develop requirements, work to figure out how to make two pieces of software work together, etc.
But the point is Obviously at an extreme end 1 CTO can't run google and probably not say 1 PM or Engineer per product, but what is the mental load people can now take on. Google may start hiring less engineers (or maybe what happens is it becomes more cuthroat, hire the same number of engineers but keep them much more shortly, brutal up or out.
But essentially we're talking about complexity and mental load - And so maybe it's essentially the same number of teams because teams exist because they're the right size, but teams are a lot smaller.
Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.
You have to experience it yourself on your own real problems and over the course of days or weeks.
Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.
In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
Curious how this will perform in complex, large production environments.
How do you stop it from over-engineering everything?
It may end up working, but the thing is going to convolute apis and abstractions and mix patterns basically everywhere
I can't teach it taste.
It's verbose by default but a few hours of custom instructions and you can make it code just like anyone
Once more people realize how easy it is to customize and personalized your agent, I hope they will move beyond what cookie cutter Big AI like Anthropic and Google give you.
I suspect most won't though because (1) it means you have to write human language, communication, and this weird form of persuasion, (2) ai is gonna make a bunch of them lazy and big AI sold them on magic solutions that require no effort on your part (not true, there is a lot of customizing and it has huge dividends)
When I use Claude Code I find that it *can* add a tremendous amount of ability due to its ability to see my entire codebase at once, but the issue is that if I'm doing something where seeing my entire codebase would help that it blasts through my quota too fast. And if I'm tightly scoping it, it's just as easy & faster for me to use the website.
Because of this I've shifted back to the website. I find that I get more done faster that way.
Yes, feel free to blame me for the fact that these aren’t very business-realistic.
[0] https://github.com/s-macke/coding-agent-benchmark
[1] https://github.com/s-macke/weltendaemmerung
This is basically all my side projects.
If AI cost nothing and wasn't absolutely decimating our economy, I'd find what you've shared cute. However, we are putting literally all of our eggs, and the next generation's eggs, and the one after that, AND the one after that, into this one thing, which, I'm sorry, is so far away from everything that keeps on being promised to us that I can't help but feel extremely depressed.
1) Modern LLMs are an inflection point for coding.
2) The current LLM ecosystem is unsustainable.
This submission discussion is only about #1, which #2 does not invalidate. Even if the ecosystem crashes, then open-source LLMs that leverage the same tricks Opus 4.5 does will just be used instead.
So cool bro, you managed to ship a useless (except for your specific use-case) app to your iphone in an hour :O
What I think this is doing is it's pitting people against the fact that most jobs in the modern economy (mine included btw) are devoid of purpose. This is something that, as a person on the far left, I've understood for a long time. However, a lot (and I mean a loooooot) of people have never even considered this. So when they find that an AI agent is able to do THEIR job for them in a fraction of the time, they MUST understand it as the AI being some finality to human ingenuity and progress given the self-importance they've attributed to themselves and their occupation - all this instead of realizing that, you know, all of our jobs are useless, we all do the exact same useless shit which is extremely easy to replicate quickly (except for a select few occupations) and that's it.
I'm sorry to tell anyone who's reading this with a differing opinion, but if AI agents have proven revolutionary to your job, you produced nothing of actual value for the world before their advent, and still don't. I say this, again, as someone who beyond their PhD thesis (and even then) does not produce anything of value to the world, while being paid handsomely for it.
Definitely, it’s an unhealthy fixation.
> I'm sorry to tell anyone who's reading this with a differing opinion, but if AI agents have proven revolutionary to your job, you produced nothing of actual value for the world before their advent, and still don't.
I agree with this, but I think my take on it is a lot less nihilistic than yours. I think people vastly undersell how much effort they put into doing something, even if that something is vibecoding a slop app that probably exists. But if people are literally prompting claude with a few sentences and getting revolutionary results, then yes, their job was meaningless and they should find something to do that they’re better at.
But what frustrates me the most about this whole hype wave isn’t just that the powers that be have bet the entire economy on a fake technology, it’s that it’s sucking all of of the air out of the room. I think most people’s jobs can actually provide value and there’s so much work to be done to make _real_ progress. But instead of actually improving the world, all the time, money, and energy is being thrown into such a wasteful technology that is actively making the world a worse place. I’m sure always been like this and I was just to naive to see it, but I much preferred it when at least the tech companies pretended they cared about the impact their products had on society rather than simply trying to extract the most value out of the same 5 ideas.
I really think we're just cooked at this point. The amount of people (some great friends whom I respect) that have told me in casual conversation that if their LLM were taken from them tomorrow, they wouldn't know how to do their work (or some flavour of that statement) has made me realize how deep the problem is.
We could go on and on about this, but let's both agree to try and look inward more and attempt to keep our own things in order, while most other people get hooked on the absolute slop machine that is AI. Eventually, the LLM providers will need to start ramping up the costs of their subscriptions and maybe then will people start clicking that the shitty code that was generated for their pointless/useless app is not worth the actual cost of inference (which some conservative estimates put out to thousands of dollars per month on a subscription basis). For now, people are just putting their heads in the sand and assuming that physicists will somehow find a way to use quantum computers to speed up inference by a factor of 10^20 in the next years, while simultaneously slashing its costs (lol).
But hey, Opus 4.5 can cook up a functional app that goes into your emails and retrieves all outstanding orders - revolutionary. Definitely worth the many kWh and thousands of liters of water required, eh?
Cheers.
GPT-3 Da Vinci cost $20/million tokens for both input and output.
GPT-5.2 is $1.75/million for input and $14/million for output
I'd call that pretty strong evidence that they've been able to dramatically increase quality while slashing costs, over just the past ~4 years.
1. The AI water issue is fake: https://andymasley.substack.com/p/the-ai-water-issue-is-fake (This one goes into OCD-levels of detail with receipts to debunk that entire issue in all aspects.)
2. LLMs are far, far more efficient than humans in terms of resource consumption for a given task: https://www.nature.com/articles/s41598-024-76682-6 and https://cacm.acm.org/blogcacm/the-energy-footprint-of-humans...
The studies focus on a single representative task, but in a thread about coding entire apps in hours as opposed to weeks, you can imagine the multiples involved in terms of resource conservation.
The upshot is, generating and deploying a working app that automates a bespoke, boring email workflow will be way, way, wayyyyy more efficient than the human manually doing that workflow everytime.
Hope this makes you feel better!
I want to push back on this argument, as it seems suspect given that none of these tools are creating profit, and so require funds / resources that are essentially coming from the combined economic efforts of entire countries. I.e. the energy externalities here are monstrous and never factored into these things, even though these models could never have gotten off the ground if not for the massive energy expenditures that were (and continue to be) needed to sustain the funding for these things.
To simplify, LLMs haven't clearly created the value they have promised, but have eaten up massive amounts of value produced by anyone else. But that value produced by everyone else had energy costs too. Whether or not all this AI stuff ends up being more energy efficient than people needs to be measured on whether it actually delivers on its promises.
The thing is in a vacuum this stuff is actually kinda cool. But hundreds of billions in debt-financed capex that will never seen a return, and this is the best we’ve got? Absolutely cooked indeed.
This doesn’t logically follow. AI agents produce loads of value. Cotton picking was and still is useful. The cotton gin didn’t replace useless work. It replaced useful work. Same with agents.
If the AI thing does indeed come crashing down I expect there will be a whole lot of second-hand GPUs going for pennies on the dollar.
Or I'll start one myself, if the market fails to provide!
The projects being submitted to product hunt is 4x the year before.
The market is shrinking rapidly because now more people make their own apps.
Even making a typo and landing on a website, there is good chance its selling more ai snake oil, yet none of these apps are feature complete and easily beaten by apps made by guys in 2010s. (tldr & sketchbook for the drawing space).
Only way to excite the investors is to fake the ARR by giving free trials and sell before the recurring event occurs.
https://news.ycombinator.com/newsguidelines.html
That describes half of the current unicorn startups nowadays.
https://news.ycombinator.com/newsguidelines.html
I have some old borderline senile relatives writting apps (asking LLMs to write for it them) for their own personal use. Stuff they surely haven't done on their own (or had the energy to do). Their extent of programming background - shitty VBScript macros for excel.
It also helps people to pick up programming and helps with the initial push of getting started. Getting over the initial hump, getting something on the screen so to speak.
Most things people want from their computers are simple shit that LLMs usually manage quite well.
Good question whether or not this (outsourcing their thinking) actually just accelerates their senility or not.
As someone who likes to solve hard or interesting technical problems, I've long before LLMs often been disappointed that most of the time what people want from programmers is simple stupid shit (ie. stuff i dont find interesting to work on).
The article is fine opinion but at what point are we going to either:
a) establish benchmarks that make sense and are reliable, or
b) stop with the hypecycle stuff?
How aren't current LLM coding benchmarks reliable?
If you can figure out how to create benchmarks that make sense, are reliable, correlate strongly to business goals, and don't get immediately saturated or contorted once known, you are well on your way to becoming a billionaire.
https://news.ycombinator.com/item?id=46443767
When I'm dying of dehydration because humanity has depleted all fresh water deposits, I'll think of you and your stupid NES emulator which is just an LLM-produced copy of many ones that had already existed.
I don't think open office/libre office etc have access to the source code for MS office and if they did MS would be on them like a rash.
https://andymasley.substack.com/p/the-ai-water-issue-is-fake
However, Opus 4.5 is incredible when you give it everything it needs, a direction, what you have versus what you want and it will make it work, really, it will work. The code might me ugly, undesirable, would only work for that one condition, but with futher prompting you can evolve it and produce something that you can be proud of.
Opus is only as good as the user and the tools the user gives to it. Hmm, that's starting to sound kind-of... human...
I now write very long specifications and this helps. I haven't figured out a bulletproof workflow, I think that will take years. But I often get just amazing code out of it.
It's always just the "Fibonacci" equivalent
They're going to find a lot of stuff to fix.
The few times I did go to a shop and ask for a check-up they didn’t find anything. Just an anecdote.
This reminds me of that example where someone asked an agent to improve a codebase in a loop overnight and they woke up to 100,000 lines of garbage [0]. Similarly you see people doing side-by-side of their implementation and what an AI did, which can also quite effectively show how AI can make quite poor architecture decisions.
This is why I think the “plan modes” and spec driven development are so important effective for agents, because it helps to avoid one of their main weaknesses.
[0] https://gricha.dev/blog/the-highest-quality-codebase
I had this long discussion today with a co-worker about the merits of detailed queries with lots of guidance .md documents, vs just asking fairly open ended questions. Spelling out in great detail what you want, vs just generally describing what you want the outcomes to be in general then working from there.
His approach was to write a lot of agent files spelling out all kinds of things like code formatting style, well defined personas, etc. And here's me asking vague questions like, "I'm thinking of splitting off parts of this code base into a separate service, what do you think in general? Are there parts that might benefit from this?"
I have a hunch if you asked which approach we took based on background, you'd think I was the one using the detailed prompt approach and him the vague.
Your approach is also very similar to spec driven development. Your spec is just a conversation instead of a planning document. Both approaches get ideas from your brain into the context window.
So you gave it an poorly defined task, and it failed?
Have you tried the planning mode? Ask it to review the codebase and identify defects, but don't let it make any changes until you've discussed each one or each category and planned out what to do to correct them. I've had it refactor code perfectly, but only when given examples of exactly what you want it to do, or given clear direction on what to do (or not to do).
I wouldn't be surprised to find out that they will find issues infinitely, if looped with fixes.
The above is for vibe coding; for taking the wheel, I can only use Opus because I suck at prompting codex (it needs very specific instructions), and codex is also way too slow for pair programming.
Why not the other way around? Have the quick brown fox churn out code, and have codex review it, guide changes, and loop?
I've actually gone one step further down the delegation. I use opus/gemini3 for plan, review, edit plan for a few steps. Then write it out to .md files. Then have GLM implement it (I got a cheap plan for like 28$ for a year on Christmas). Then have the code this produced reviewed and fixed if needed by opus. Final review by codex (for some reason it's very good at review, esp if you have solid checkboxes for it to check during review). Seems to work so far.
Currently I don’t let GLM or Opus near my codebases unsupervised because I’m convinced that the better the foundation, the better the end result will be. Is the first draft not pretty crappy with GLM?
Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review.
Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly.
We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already.
There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call.
(used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!)
Why on earth would you be hunting $20 a month subscriptions from random assed people? Peanuts.
Lockheed-Martin could be, but isn’t, opening lemonade stands outside their offices… they don’t because of how buying a Ferrari works.
LLMs do not result in bosses firing people, it results in more projects / faster completed projects, what in turn means more $$$ for a company.
It's hard to say if Opus 4.5 itself will change everything given the cost/latency issues, but now that all the labs will have very good synthetic agentic data thanks to Opus 4.5, I will be very interested to see what the LLMs release this year will be able to do. A Sonnet 4.7 that can do agentic coding as well as Opus 4.5 but at Sonnet's speed/price would be the real gamechanger: with Claude Code on the $20/mo plan, you can barely do more than one or two prompts with Opus 4.5 per session.
1167 more comments available on Hacker News