Why agents DO NOT write most of our code – a reality check
Mood
thoughtful
Sentiment
mixed
Category
tech
Key topics
AI
software development
automation
The author argues that AI agents are not yet capable of writing most of the code, providing a reality check on the current state of AI in software development.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
7h
Peak period
23
Day 1
Avg / period
12
Based on 24 loaded comments
Key moments
- 01Story posted
11/13/2025, 8:22:35 PM
5d ago
Step 01 - 02First comment
11/14/2025, 3:39:25 AM
7h after posting
Step 02 - 03Peak activity
23 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/14/2025, 10:07:55 PM
4d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
If you review the first-stage output from the AI manually, you're wasting time.
You still need to review the final outputs, but reviewing the initial output is like demanding a developer hands over code they just barely got working and pointing out all of the issues to them without giving them a chance to clean it up first. It's not helpful to anyone unless your time costs the business less than the AI's time.
That does not mean you can get away with not reviewing it. But you can most certainly with substantial benefit defer reviewing it until an AI review thinks the code doesn't need further refinement. It probably still does need refinement despite the AI's say so - and sometimes it needs throwing away -, but it's also highly likely in my experience to need less, and take less time to review.
The only way to use a tool like this is to give a problem that fits context, evaluate the solution it chugs at you and re-roll it if it wasn't correct. Don't tell a language model to think because it can't and won't. It's a way less efficient way of re-rolling the solution
my mental model is that the llm learned to predict what another person would say just by looking at that solution.
so it’s really telling whether the solution is likely (likely!) to be right or wrong
This is both why they're sycophantic, and also why they're better than just median internet comments.
But this is only a slight quibble, because what you say is also somewhat true, and why they have such a hard time saying "I don't know".
I don't think I agree and I want to understand this argument better.
Someone with the ability to "think" should be able to separate oft repeated fiction from fact
I guess humans don’t think then.
That or a reductive fallacy, in either case I’m not convinced, IMO they are just not smart enough (either due to lack of complexity in the architecture or bad training that didn’t help it generalize reasoning patterns).
I'm just not so sure of importance of the difference between swimming and whatever the word for how a submarine moves is.
If it looks like thinking and quacks like thinking...
The word thinking is going too much work in your argument, but arguably “assume it’s thinking” is not doing enough work.
The models do compute and can reduce entropy; however, they don’t match the way we presume things do this because we assume every intelligence is human or more accurately the same as our own mind.
To see the algorithm for what it is, you can make it work through a logical set of steps from input to output but it requires multiple passes. The models use a heuristic pattern matching approach to reasoning instead of a computational one like symbolic logic.
While the algorithms are computed, the virtual space the input is transformed to the output is not computational.
The models remain incredible and remarkable but they are incomplete.
Further there is a huge garbage in garbage out problem as often the input to the model lacks enough information to decide on the next transformation to the code base. That’s part of the illusion of conversationality that tricks us into thinking the algorithm is like a human.
AI has always had human reactions like this. Eliza was surprisingly effective, right?
It may be that average humans are not capable of interacting with an AI reliably because the illusion is overwhelming for instinctive reasons.
As engineers we should try to accurately assess and measure what is actually happening so we can predict and reason about how the models fit into systems.
Or give context to the model which fits the problem. That's more of an art than a science at this point it seems
I think people with better success are those better at generating prompts but that's non trivial
For all the other trivial things, I can delegate those out to it, and expect junior results when I give it sub-optimal guidance, however through nominal and or extreme guidance I can get adequate / near-perfect results.
Another dimension that really matters here is the actual model used, not every model is the same.
Also, if the AI does something wrong, have it assess why things went wrong, revert back to the previous checkpoint and integrate that into the plan.
You're driving, you are ultimately, in control, learn to drive. It's a tool, it can be adjusted, you can modify the output, you can revert, you can also just not use it. But, if you do actually learn how to use it you'll find it can speed up your process. It is not a cure-all though, it's good in certain situations, just like a hammer.
It seems like “you’re not doing it correctly” is just a rationalization to protect the pro-AI person’s established opinion.
It’s about breaking the problem down into epics, tasks, and acceptance criteria that is reviewed. Review the written code and adjust as needed.
Tests… a lot of tests.
I used Claude Code and while the end result works (kinda) I noticed I was less satisfied with the process and, more importantly, I now had to review "someone else's" code instead of writing it myself, I had no idea of the internal workings of the application and it felt like starting at day one on a new codebase. It shifted my way of working from thinking/writing into reviewing/giving feedback which for me personally is way less mentally stimulating and rewarding.
There were def. some "a-ha" moments where CC came up with certain suggestions I wouldn't have thought of myself but those were only a small fraction of the total output and there's def. a dopamine hit from seeing all that code being spit out so fast.
Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.
For now I've decided to stick to code completion, writing of unit tests, commit messages, refactoring short snippets, CHANGELOG updates, it does fairly well on all of those small very focused tasks and the saved time on those end up being net positive.
This would be amazing. I think claude code is a great prototyping tool, but I agree, you don't really learn your code base. But I think, that is okay for a prototype if you just want to see if the idea works at all. Then you can restart as you say with some scaffolding to implement it better.
Frankly, I have been highly concerned seeing all the transformer hype in here when the gains people claims cannot be reliably replicated everywhere.
The financial incentives to make transformer tech work as it is being sold (even when it might not be cost effective) need to be paid close attention because to me, it looks a bit too much like blockchain or big data.
One thing I was wondering after looking at the list of items in the “Cursor agent produced a coding plan” image: do folks actually make such lists when developing a feature without AI assistants?
That list has items like “Create API endpoints for …”, “Write tests …”. If you’re working on a feature that’s within a single codebase and not involving dependencies on other systems or teams, isn’t that a lot of ceremony for what you’ll eventually end up doing anyway (and only likely to miss due to oversight)?
I see a downside to such lists, because when I see a dozen items lined up like that… who knows whether they’re all the right ones for the feature at hand? Or whether the feature needs some other change entirely, or whether you’ve figured out the right order to do them in?
Where I’ve seen such fine-grained lists have value is for task timeline estimation, but rarely for the actual implementation.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.