Two Things LLM Coding Agents Are Still Bad At
Posted3 months agoActive3 months ago
kix.devTechstoryHigh profile
calmmixed
Debate
60/100
LLM Coding AgentsAI LimitationsSoftware Development
Key topics
LLM Coding Agents
AI Limitations
Software Development
The article discusses two areas where LLM coding agents struggle: refactoring code and asking questions, with commenters sharing their experiences and insights on these limitations.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2m
Peak period
82
0-6h
Avg / period
22.9
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 9, 2025 at 12:33 AM EDT
3 months ago
Step 01 - 02First comment
Oct 9, 2025 at 12:36 AM EDT
2m after posting
Step 02 - 03Peak activity
82 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 11, 2025 at 12:43 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45523537Type: storyLast synced: 11/20/2025, 8:14:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Oh, sorry. You already said that. :D
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
I am guessing this because:
1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
It makes sense to me, at least.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.
However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.
Maybe I should use the same example repeated for clarity. Let me do that.
Edit: Fixed. Thank you.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
Question - this loads a 2 MB JS parser written in Rust to turn `x => x.foo` into `{ op: 'project', field: 'foo', target: 'x' }`. But you don't actually allow any complex expressions (and you certainly don't seem to recursively parse references or allow return uplift, e. g. I can't extract out `isOver18` or `isOver(age: int)(Row: IQueryable): IQueryable`). Why did you choose the AST route instead of doing the same thing with a handful of regular expressions?
I could have allowed (I did consider it) functions external to the expression, like isOver18 in your example. But it would have come at the cost of the parser having to look across the code base, and would have required tinqerjs to attach via build-time plugins. The only other way (without plugins) might be to identify callers via Error.stack, and attempting to find the calling JS file.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...
> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
Obviously generated code drift a little from deleted ones.
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase"
Of course, now I have to check if someone has done this already.
We started with building the best code retrieval and build an agent around it.
Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.
It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.
I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.
Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.
Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)
But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.
If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.
Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.
But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.
I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.
I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.
I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”
It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.
You can't fix it.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
All while having the tone of an over caffeinated intern who has only ever read medium articles.
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
Typo or trolling the next LLM to index HN comments?
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff?
https://cursed-lang.org/
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
Impressive nonetheless.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
Absolutely. I do not underestimate this.
There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...
A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)
When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.
This morning :)
>"so far outside of any capabilities"
Anthropic was just bragging last week about being able to code without intervention for 30 hours before completely losing focus. They hailed it as a new bench mark. It completed a project that was 11k lines of code.
The max unsupervised run that GPT-5-Codex has been able to pull off is 7 hours.
That's what I mean by the current SOTA demonstrated capabilities.
https://x.com/rohanpaul_ai/status/1972754113491513481
And yet here you have a rando who is saying that he was able able to get an agent to run unsupervised for 100x longer than what the model companies themselves have been able to do and produce 10x the amount of code--months ago.
I'm 100% confident this is fake.
>There's a yt channel where the sessions were livestreamed.
There are a few videos that long, not 3 months worth of videos. Also I spot checked the videos and it the framerate is so low that it would be trivial to cut out the human intervention.
>guaranteed to be written by an LLM
I don't doubt that it was 99.9% written by an LLM, the question is whether he was able to run unsupervised for 3 months or whether he spent 3 months guiding an LLM to write it.
That would mean that every few hours the agent starts fresh, does the inspect repo thing, does the plan for that session, and so on. That would explain why it took it ~3 months to do what a human + ai could probably do in a few weeks. That's why it doesn't sound too ludicrous for me. If you look at the repo there are a lot of things that are not strictly needed for the initial prompt (make a programming language like go but with genz stuff, nocap).
Oh, and if you look at their discord + repo, lots of things don't actually work. Some examples do, some segfault. That's exactly what you'd expect from "running an agent in a loop". I still think it's impressive nonetheless.
The fact that you are so incredulous (and I get why that is, scepticism is warranted in this space) is actually funny. We are on the right track.
If Anthropic thought they could produce anything remotely useful by wiping the context and reprompting every few hours, they would be doing it. And they’d be saying “look at this we implemented hard context reset and we can now run our agent for 30 days and produce an entire language implementation!”
In 3 months or 300 years of operating like this a current agent being freshly reprompted every few hours would never produce anything that even remotely looked like a language implementation.
As soon as its context was poisoned with slightly off topic todo comments it would spin out into writing a game of life implementation or whatever. You’d have millions of lines of nonsense code with nothing useful after 3 months of that.
The only way I see anything like this doing anything approaching “useful” is if the outer loop wipes the repo on every reset as well, and collects the results somewhere the agent can’t access. Then you essentially have 100 chances to one shot the thing.
But at that point you just have a needlessly expensive and slow agent.
What does that mean exactly? I assume the LLM was not left alone with its task for 3 months without human supervision.
> the following prompt was issued into a coding agent:
> Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang?
> and then the coding agent was left running AFK for months in a bash loop
Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.
I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.
Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.
Both are right.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
They're useful, but you must verify anything you get from them.
Perhaps you’ve been sold a lie?
You have to be able to see what this thing can actually do, as opposed to what it can’t.
But all code is "long precise strings".
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Did the LLM have this?
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.
When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.
Sadly it seems the best use-case for LLMs at this point is bamboozling humans.
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
> You can never trust the LLM to generate a url
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
That is what they said.
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
Edit: I think I'm just regurgitating the article here.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
Can you spot the next problem introduced by this?
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
So there's hope.
But often they just delete and recreate the file, indeed.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
210 more comments available on Hacker News