Back to Home11/13/2025, 8:22:35 PM

Why agents DO NOT write most of our code – a reality check

birdculture

75 points

24 comments

Mood

thoughtful

Sentiment

mixed

Discussion Activity

Very active discussion

First comment

Peak period

Day 1

Avg / period

Comment distribution24 data points

Based on 24 loaded comments

Key moments

01Story posted
11/13/2025, 8:22:35 PM
5d ago
Step 01
02First comment
11/14/2025, 3:39:25 AM
7h after posting
Step 02
03Peak activity
23 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
11/14/2025, 10:07:55 PM
4d ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (24 comments)

Showing 24 comments

another_twist

5d ago

1 reply

Eerily maps to my experience almost word for word. I had codex write a chunk of code step by step with guidance and whatnot. Had to spend days cleaning up the mess.

vidarh

5d ago

1 reply

My experience is that if AI creates the mess, AI should clean it up, and it usually can, if you put it in a suitable agent loop that does a review, hands off small, well defined cleanup steps to an agent, and runs test suites.

If you review the first-stage output from the AI manually, you're wasting time.

You still need to review the final outputs, but reviewing the initial output is like demanding a developer hands over code they just barely got working and pointing out all of the issues to them without giving them a chance to clean it up first. It's not helpful to anyone unless your time costs the business less than the AI's time.

Zardoz84

4d ago

1 reply

IA reviewing code generated by AI, it's a recipe for disaster.

vidarh

4d ago

That's categorically not true, as long as there's a human reviewer at the end of the chain. It can usually continue to deliver actual improvements over several iterations (just like a human would).

That does not mean you can get away with not reviewing it. But you can most certainly with substantial benefit defer reviewing it until an AI review thinks the code doesn't need further refinement. It probably still does need refinement despite the AI's say so - and sometimes it needs throwing away -, but it's also highly likely in my experience to need less, and take less time to review.

reaslonik

5d ago

5 replies

One thing I find that constantly makes pain for users is assuming that any of these models are thinking, when in reality they're completing a sentence. This might seem like a nitpick at first, but it's a huge deal in reality: if you ask a language model to evaluate whether a solution is right, it's not evaluating the solution, it's giving you a statistically likely next sentence where yes and no are fairly common. If you tell it's wrong, the likely next sentence is something affirming it, but it doesen't really make a difference.

The only way to use a tool like this is to give a problem that fits context, evaluate the solution it chugs at you and re-roll it if it wasn't correct. Don't tell a language model to think because it can't and won't. It's a way less efficient way of re-rolling the solution

giuscri

4d ago

1 reply

but it’s also true that the next sentence is generated by evaluating the whole conversation including the proposed solution.

my mental model is that the llm learned to predict what another person would say just by looking at that solution.

so it’s really telling whether the solution is likely (likely!) to be right or wrong

ben_w

4d ago

1 reply

Slight quibble, but the reinforcement learning from human feedback means they're trained (somewhat) on what the specific human asking the question is likely to consider right or wrong.

This is both why they're sycophantic, and also why they're better than just median internet comments.

But this is only a slight quibble, because what you say is also somewhat true, and why they have such a hard time saying "I don't know".

giuscri

4d ago

1 reply

idk… maybe we’ll found out the reason is that on the internet no one ends a conversation saying “i don’t know” :D

ben_w

4d ago

That's my point :)

pietz

4d ago

2 replies

Can you go into a bit more detail why the two approaches are so different in your opinion?

I don't think I agree and I want to understand this argument better.

nijave

4d ago

1 reply

They regurgitate what they're trained on so they're largely consensus based. However, the consensus can be frequently wrong--especially when the information is outdated

Someone with the ability to "think" should be able to separate oft repeated fiction from fact

solumunus

4d ago

> Someone with the ability to "think" should be able to separate oft repeated fiction from fact

I guess humans don’t think then.

ismailmaj

4d ago

I’m guessing the argument is that LLMs get worse for problems they haven’t seen before, so you may assume they think for problems that are commonly discussed in the internet or seen on github, but once you step out of that zone, you get plausible but logically false results.

That or a reductive fallacy, in either case I’m not convinced, IMO they are just not smart enough (either due to lack of complexity in the architecture or bad training that didn’t help it generalize reasoning patterns).

stray

4d ago

I get that a submarine can't swim.

I'm just not so sure of importance of the difference between swimming and whatever the word for how a submarine moves is.

If it looks like thinking and quacks like thinking...

sunir

4d ago

You’re right and wrong at the same time. A quantum superposition of validity.

The word thinking is going too much work in your argument, but arguably “assume it’s thinking” is not doing enough work.

The models do compute and can reduce entropy; however, they don’t match the way we presume things do this because we assume every intelligence is human or more accurately the same as our own mind.

To see the algorithm for what it is, you can make it work through a logical set of steps from input to output but it requires multiple passes. The models use a heuristic pattern matching approach to reasoning instead of a computational one like symbolic logic.

While the algorithms are computed, the virtual space the input is transformed to the output is not computational.

The models remain incredible and remarkable but they are incomplete.

Further there is a huge garbage in garbage out problem as often the input to the model lacks enough information to decide on the next transformation to the code base. That’s part of the illusion of conversationality that tricks us into thinking the algorithm is like a human.

AI has always had human reactions like this. Eliza was surprisingly effective, right?

It may be that average humans are not capable of interacting with an AI reliably because the illusion is overwhelming for instinctive reasons.

As engineers we should try to accurately assess and measure what is actually happening so we can predict and reason about how the models fit into systems.

nijave

4d ago

>The only way to use a tool like this is to give a problem that fits context

Or give context to the model which fits the problem. That's more of an art than a science at this point it seems

I think people with better success are those better at generating prompts but that's non trivial

graphememes

4d ago

1 reply

Every time I read a post about this, none of the prompts are shared, and when I review the actual commands and how the AI is working it makes me realize that the person who is driving is not experienced in doing so. AI's will do a best attempt, you can see this by looking at the reasoning / thinking output, additionally, the temperature, is usually pretty moderate (4.5-8) and so you'll have heavy "creative liberties" taken. So you need to account for that, you have to show it the right and wrong way to do things. I don't usually use agents or AI for things that are one-offs but not copy & paste, or for deep thinking / critical tasks that require human thought where AI wouldn't be able to do it.

For all the other trivial things, I can delegate those out to it, and expect junior results when I give it sub-optimal guidance, however through nominal and or extreme guidance I can get adequate / near-perfect results.

Another dimension that really matters here is the actual model used, not every model is the same.

Also, if the AI does something wrong, have it assess why things went wrong, revert back to the previous checkpoint and integrate that into the plan.

You're driving, you are ultimately, in control, learn to drive. It's a tool, it can be adjusted, you can modify the output, you can revert, you can also just not use it. But, if you do actually learn how to use it you'll find it can speed up your process. It is not a cure-all though, it's good in certain situations, just like a hammer.

davidclark

4d ago

1 reply

On the other hand, when people who claim success with AI share their prompts, I see all the same misses and flaws that keep me from fully buying in. For the person though, it seems like they gloss over these errors and claim wild success. Their prompts never actually seem that different from the ones that fail me as well.

It seems like “you’re not doing it correctly” is just a rationalization to protect the pro-AI person’s established opinion.

_boffin_

4d ago

Nobody does it correctly—ai or not.

It’s about breaking the problem down into epics, tasks, and acceptance criteria that is reviewed. Review the written code and adjust as needed.

Tests… a lot of tests.

raflueder

4d ago

1 reply

I had a similar experience a couple of months ago where I decided to give it a go and "vibe code" a small TUI to get a feel for the workflow.

I used Claude Code and while the end result works (kinda) I noticed I was less satisfied with the process and, more importantly, I now had to review "someone else's" code instead of writing it myself, I had no idea of the internal workings of the application and it felt like starting at day one on a new codebase. It shifted my way of working from thinking/writing into reviewing/giving feedback which for me personally is way less mentally stimulating and rewarding.

There were def. some "a-ha" moments where CC came up with certain suggestions I wouldn't have thought of myself but those were only a small fraction of the total output and there's def. a dopamine hit from seeing all that code being spit out so fast.

Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.

For now I've decided to stick to code completion, writing of unit tests, commit messages, refactoring short snippets, CHANGELOG updates, it does fairly well on all of those small very focused tasks and the saved time on those end up being net positive.

mnky9800n

4d ago

> Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.

This would be amazing. I think claude code is a great prototyping tool, but I agree, you don't really learn your code base. But I think, that is okay for a prototype if you just want to see if the idea works at all. Then you can restart as you say with some scaffolding to implement it better.

netdevphoenix

4d ago

This has always been true. The difference is that now more people are admitting it. While you could argue that LLMs have junior level capabilities, they definitely do not have junior level self reflection or self awareness or self anything. It fundamentally doesn't learn where learning means being significantly less likely to fail at a task class x after being taught about it. And even just the ability of asking for help. These agents just choose to generate unusable code over stopping and asking for help or guidance and this implies that they are unable to tell their limits skill wise, knowledge wise, etc.

Frankly, I have been highly concerned seeing all the transformer hype in here when the gains people claims cannot be reliably replicated everywhere.

The financial incentives to make transformer tech work as it is being sold (even when it might not be cost effective) need to be paid close attention because to me, it looks a bit too much like blockchain or big data.

mprast

5d ago

great stuff; I've had almost exactly the same experience. I think blow-by-blow writeups like this are a sorely needed antidote to the hype

pramodbiligiri

4d ago

Great article.

One thing I was wondering after looking at the list of items in the “Cursor agent produced a coding plan” image: do folks actually make such lists when developing a feature without AI assistants?

That list has items like “Create API endpoints for …”, “Write tests …”. If you’re working on a feature that’s within a single codebase and not involving dependencies on other systems or teams, isn’t that a lot of ceremony for what you’ll eventually end up doing anyway (and only likely to miss due to oversight)?

I see a downside to such lists, because when I see a dozen items lined up like that… who knows whether they’re all the right ones for the feature at hand? Or whether the feature needs some other change entirely, or whether you’ve figured out the right order to do them in?

Where I’ve seen such fine-grained lists have value is for task timeline estimation, but rarely for the actual implementation.

View full discussion on Hacker News

ID: 45920020Type: storyLast synced: 11/16/2025, 9:42:57 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN