Two Things LLM Coding Agents Are Still Bad At

Posted3 months agoActive3 months ago

kixpanganiban

345 points

370 comments

kix.devTechstoryHigh profile

calmmixed

Debate

60/100

LLM Coding AgentsAI LimitationsSoftware Development

Key topics

LLM Coding Agents

AI Limitations

Software Development

The article discusses two areas where LLM coding agents struggle: refactoring code and asking questions, with commenters sharing their experiences and insights on these limitations.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

0-6h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 9, 2025 at 12:33 AM EDT
3 months ago
Step 01
02First comment
Oct 9, 2025 at 12:36 AM EDT
2m after posting
Step 02
03Peak activity
82 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Oct 11, 2025 at 12:43 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (370 comments)

Showing 160 comments of 370

davydm

3 months ago

2 replies

Coding and...?

Black616Angel

3 months ago

Copy and pasting.

Oh, sorry. You already said that. :D

drdeca

3 months ago

More granular. What things is it bad at that result in it being overall “bad at coding”? It isn’t all of the parts.

nikanj

3 months ago

2 replies

4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b)

First it gets an error because bash doesn’t understand \

Then it gets an error because /b doesn’t work

And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files

If it was an actual coworker, we’d send it off to HR

cheema33

3 months ago

Most models struggle in a Windows environment. They are trained on a lot of Unixy commands and not as much on Windows and PowerShell commands. It was frustrating enough that I started using WSL for development when using Windows. That helped me significantly.

I am guessing this because:

1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.

Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.

anonzzzies

3 months ago

I have a list of those things in CLAUDE.md -> it seems to help (unless it's context is full, but you should never let it get close really).

rconti

3 months ago

1 reply

Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve?

athrowaway3z

3 months ago

You don't learn new languages/paradigms/frameworks by inserting it into an existing project.

LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.

But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.

koliber

3 months ago

1 reply

Most developers are also bad at asking questions. They tend to assume too many things from the start.

In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.

rkomorn

3 months ago

1 reply

But, just like lots of people expect/want self-driving to outperform humans even on edge cases in order to trust them, they also want "AI" to outperform humans in order to trust it.

So: "humans are bad at this too" doesn't have much weight (for people with that mindset).

It makes sense to me, at least.

darkwater

3 months ago

3 replies

If we had a knife that most of the time cuts a slice of bread like the bottom p50 of humans cutting a slice of bread with their hands, we wouldn't call the knife useful.

Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.

Certhas

3 months ago

1 reply

I think this is still too extreme. A machine that cuts and preps food at the same level as a 25th percentile person _being paid to do so_, while also being significantly cheaper would presumably be highly relevant.

rkomorn

3 months ago

Aw man. There are so many angles though.

Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.

But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).

I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.

koliber

3 months ago

Agreed in a general sense, but there's a bit more nuance.

If a knife slices bread like a normal human at p50, it's not a very good knife.

If a knife slices bread like a professional chef at p50, it's probably a very decent knife.

I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.

The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.

The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?

rkomorn

3 months ago

I feel kind of attacked for my sub-p50 bread slicing skills, TBH. :(

AllegedAlec

3 months ago

5 replies

On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.

I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.

Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.

hu3

3 months ago

1 reply

Interesting. What model and tool was used?

I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.

AllegedAlec

3 months ago

1 reply

Claude code. Whichever model it started up automatically last weekend, I didn't explicitly check.

rglynn

3 months ago

1 reply

This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet.

coldtea

3 months ago

1 reply

This is "just use another Linux distro" all over again

rglynn

3 months ago

Yes and no, it's a fair criticism to some extent. Inasamuch as I would agree that different models of the same type have superficial differences.

However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.

jeswin

3 months ago

4 replies

> I wanted it to refactor a parser in a small project

This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.

I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.

iLoveOncall

3 months ago

2 replies

Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.

pprotas

3 months ago

2 replies

I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better?

jeswin

3 months ago

1 reply

Very true. However, to claim that the "API looks completely different for Postgre and SQLite" is disingenuous. What was he looking at?

tom_

3 months ago

1 reply

There are two examples on the landing page, and they both look quite different. Surely if the API is the same for both, there'd be just one example that covers both cases, or two examples would be deliberately made as identical as possible? (Like, just a different new somewhere, or different import directive at the top, and everything else exactly the same?) I think that's the point.

Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.

jeswin

3 months ago

If you're mentioning the first two examples, they're doing different things. The pg example does an orderby, and the sqlite example does a join. You'll be able to switch the client (ie, better-sqlite and pg-promise) in either statement, and the same query would work on the other database.

Maybe I should use the same example repeated for clarity. Let me do that.

Edit: Fixed. Thank you.

iLoveOncall

3 months ago

Actually the interesting question is whether this library not existing would have been a loss for humanity. I'll posit that it would not.

jeswin

3 months ago

> Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.

How does the API look completely different for pg and sqlite? Can you share an example?

It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.

Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.

AllegedAlec

3 months ago

1 reply

I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md).

jeswin

3 months ago

Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind.

svieira

3 months ago

1 reply

Quite impressive, thank you for sharing!

Question - this loads a 2 MB JS parser written in Rust to turn `x => x.foo` into `{ op: 'project', field: 'foo', target: 'x' }`. But you don't actually allow any complex expressions (and you certainly don't seem to recursively parse references or allow return uplift, e. g. I can't extract out `isOver18` or `isOver(age: int)(Row: IQueryable): IQueryable`). Why did you choose the AST route instead of doing the same thing with a handful of regular expressions?

jeswin

3 months ago

Parsing code with regex is a minefield. You can get it to work with simpler cases, but even that might get complex very quickly with all sorts of formatting preferences that people have. In fact, I'll be very surprised if it can be done with a few regular expressions; so I never gave it much consideration. Additionally, improved subquery support etc is coming, involving deeper recursion.

I could have allowed (I did consider it) functions external to the expression, like isOver18 in your example. But it would have come at the cost of the parser having to look across the code base, and would have required tinqerjs to attach via build-time plugins. The only other way (without plugins) might be to identify callers via Error.stack, and attempting to find the calling JS file.

TheCoelacanth

3 months ago

Development tools and libraries seem like they may be one of the absolute easiest use cases to get LLMs to work with since they generally have far less ambiguous requirements than other software and the LLMs generally have an enormous amount of data in their training set to help them understand the domain.

coldtea

3 months ago

1 reply

>I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.

The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.

AllegedAlec

3 months ago

1 reply

That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.

I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".

It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.

We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.

coldtea

3 months ago

1 reply

>But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.

Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)

>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.

We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...

AllegedAlec

3 months ago

> >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.

> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)

But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.

jansan

3 months ago

1 reply

I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results.

So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.

But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.

You still cannot trust LLMs. And that is a problem.

ogogmad

3 months ago

The obvious point has to be made: Generating formal proofs might be a partial fix for this. By contrast, coding is too informal for this to be as effective for it.

habibur

3 months ago

Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste.

Obviously generated code drift a little from deleted ones.

schiho

3 months ago

1 reply

I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent.

tjansen

3 months ago

I guess the LLMs are trained to know what finished code looks like. They don't really know the operations a human would use to get there.

tjansen

3 months ago

5 replies

Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it.

The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.

(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)

Leynos

3 months ago

4 replies

I wonder if a large context model could be employed here via tool call. One of the great things Gemini chat can do is ingest a whole GitHub repo.

Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase"

Of course, now I have to check if someone has done this already.

knes

3 months ago

This is what we do at Augmentcode.com.

We started with building the best code retrieval and build an agent around it.

4b11b4

3 months ago

Sure, but just bcuz it went into context doesn't mean LLM "understand" it. Also, not all sections of context iz equal.

itsdavesanders

3 months ago

Claude can use use tools to do that, and some different code indexer MCPs work, but that depends on the LLM doing the coding to make the right searches to find the code. If you are in a project where your helper functions or shared libs are scattered everywhere it’s a lot harder.

Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.

It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.

bunderbunder

3 months ago

Large context models don't do a great job of consistently attending to the entire context, so it might not work out as well in practice as continuing to improve the context engineering parts of coding agents would.

I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.

hwillis

3 months ago

2 replies

That's what claude.md etc are for. If you want it to follow your norms then you have to document them.

tjansen

3 months ago

That's fine for norms, but I don't think you can use it to describe every single piece of your code. Every function, every type, every CSS class...

ColonelPhantom

3 months ago

Well, sure, but from what I know, humans are way better at following 'implicit' instructions than LLMs. A human programmer can 'infer' most of the important basic rules from looking at the existing code, whereas all this agents.md/claude.md/whatever stuff seems necessary to even get basic performance in this regard.

Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.

Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)

bunderbunder

3 months ago

2 replies

This is what I keep running into. Earlier this week I did a code review of about new lines of code, written using Cursor, to implement a feature from scratch, and I'd say maybe 200 of those lines were really necessary.

But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.

tjansen

3 months ago

1 reply

>>because that's how management decided we should work, there's no point

If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.

Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.

But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.

bunderbunder

3 months ago

There really wouldn't be; it would just be spitting into the wind. What am I going to do, convince every member of my team to ignore a direct instruction from the people who sign our paychecks?

ahi

3 months ago

I really really hate code review now. My colleagues will have their LLMs generate thousands of lines of boiler plate with every pattern and abstraction under the sun. A lazy programmer use to do the bare minimum and write not enough code. That made review easy. Error handling here, duplicate code there, descriptive naming here, and so on. Now a lazy programmer generates a crap load of code cribbed from "best practice" tutorials, much of it unnecessary and irrelevant for the actual task at hand.

ewoodrich

3 months ago

> When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)

I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.

I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.

I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”

It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.

rdsubhas

3 months ago

To be fair, this is a daily life story for any senior engineer working with other engineers.

Vipsy

3 months ago

1 reply

Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts.

Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.

pimeys

3 months ago

Yep. One of the things I've found agents always having a lot of trouble with is anything related to OpenTelemetry. There's a thing you call that uses some global somewhere, there's a docker container or two and there's the timing issues. It takes multiple tries to get anything right. Of course this is hard for a human too if you haven't used otel before...

throw-10-8

3 months ago

2 replies

3. Saying no

LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.

nxpnsv

3 months ago

1 reply

Agree, this is really bad.

throw-10-8

3 months ago

It's a fundamental failing of trying to use a statistical approximation of human language to generate code.

You can't fix it.

pimeys

3 months ago

1 reply

I've found codex to be better here than Claude. It has stopped many times and said hey you might be wrong. Of course this changes with a larger context.

Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.

throw-10-8

3 months ago

1 reply

i find the chirpy affirmative tone of claude to be rage inducing

pimeys

3 months ago

1 reply

This. The biggest reason I went with OpenAI this month...

throw-10-8

3 months ago

My "favorite" is when it makes a mistake and then tries gaslight you into thinking it was your mistake and then confidently presents another incorrect solution.

All while having the tone of an over caffeinated intern who has only ever read medium articles.

the_mitsuhiko

3 months ago

2 replies

> LLMs don’t copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, they’ll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time.

There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.

> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.

It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.

brianpan

3 months ago

1 reply

How is it not clear that it would be beneficial?

To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.

the_mitsuhiko

3 months ago

> How is it not clear that it would be beneficial?

There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.

So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.

> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.

So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.

3abiton

3 months ago

I think copy/paste can alleviate context explosion. Basically the model can remember what's the code block contain, can access it at any time, without needing to "remember" it.

sxp

3 months ago

1 reply

Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests.

LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.

It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.

athrowaway3z

3 months ago

>But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons.

Typo or trolling the next LLM to index HN comments?

freetonik

3 months ago

5 replies

I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc.

Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.

quietbritishjim

3 months ago

4 replies

Yesterday, I got Claude Code to make a script that tried out different point clustering algorithms and visualise them. It made the odd mistake, which it then corrected with help, but broadly speaking it was amazing. It would've taken me at least a week to write by hsnd, maybe longer. It was writing the algorithms itself, definitely not just simple CRUD stuff.

dncornholio

3 months ago

1 reply

That's actually a very specific domain, which is well documented and researched in which LLM's will alawys do well. Shit will hit the fans quickly when you're going to do integration where it won't have a specific problem domain.

fwip

3 months ago

Yep - visualizing clustering algorithms is just the "CRUD app" of a different speciality.

One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.

piva00

3 months ago

In my experience it's been great to have LLMs for narrowly-scoped tasks, things I know how I'd implement (or at least start implementing) but that would be tedious to manually do, prompting it with increasingly higher complexity does work better than I expected for these narrow tasks.

Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.

It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.

There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.

freetonik

3 months ago

I also got good results for “above CRUD” stuff occasionally. Sorry if I wasn’t clear, I meant to primarily share an observation about vastly different responses in discussions related to LLMs. I don’t believe LLMs are completely useless for non-trivial stuff, nor I believe that they won’t get better. Even those two problems in the linked article: sure, those actions are inherently alien to the LLM’s structure itself, but can be solved with augmentation.

an0malous

3 months ago

Let’s see the diff

NitpickLawyer

3 months ago

2 replies

> almost always the first group presents examples of simple CRUD apps

How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff?

https://cursed-lang.org/

It might be a meme project, but it's still impressive as hell we're here.

I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.

freetonik

3 months ago

2 replies

Ok, not trivial for sure, but not novel? IIUC, the language does not have really new concepts, apart from the keywords (which is trivial).

Impressive nonetheless.

NitpickLawyer

3 months ago

1 reply

Novel as in never done before? Of course not.

Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.

Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.

freetonik

3 months ago

>Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space.

Absolutely. I do not underestimate this.

sarchertech

3 months ago

1 reply

There’s no evidence that this ever happened other than this guy’s word. And since the claim that he ran an agent with no human intervention for 3 months is so far outside of any capabilities demonstrated by anyone else, I’m going to need to see some serious evidence before I believe it.

NitpickLawyer

3 months ago

1 reply

> There’s no evidence that this ever happened other than this guy’s word.

There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...

A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)

When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.

sarchertech

3 months ago

1 reply

>When's the last time you used one of these SotA coding agents?

This morning :)

>"so far outside of any capabilities"

Anthropic was just bragging last week about being able to code without intervention for 30 hours before completely losing focus. They hailed it as a new bench mark. It completed a project that was 11k lines of code.

The max unsupervised run that GPT-5-Codex has been able to pull off is 7 hours.

That's what I mean by the current SOTA demonstrated capabilities.

https://x.com/rohanpaul_ai/status/1972754113491513481

And yet here you have a rando who is saying that he was able able to get an agent to run unsupervised for 100x longer than what the model companies themselves have been able to do and produce 10x the amount of code--months ago.

I'm 100% confident this is fake.

>There's a yt channel where the sessions were livestreamed.

There are a few videos that long, not 3 months worth of videos. Also I spot checked the videos and it the framerate is so low that it would be trivial to cut out the human intervention.

>guaranteed to be written by an LLM

I don't doubt that it was 99.9% written by an LLM, the question is whether he was able to run unsupervised for 3 months or whether he spent 3 months guiding an LLM to write it.

NitpickLawyer

3 months ago

1 reply

I think you are confusing 2 things here. What the labs mean when they announce x hours sessions is on "one session" (i.e. the agent manages its own context via trimming and memory files, etc). What the project I linked did was "run in a bash loop", that basically resets the context every time the agent "finishes".

That would mean that every few hours the agent starts fresh, does the inspect repo thing, does the plan for that session, and so on. That would explain why it took it ~3 months to do what a human + ai could probably do in a few weeks. That's why it doesn't sound too ludicrous for me. If you look at the repo there are a lot of things that are not strictly needed for the initial prompt (make a programming language like go but with genz stuff, nocap).

Oh, and if you look at their discord + repo, lots of things don't actually work. Some examples do, some segfault. That's exactly what you'd expect from "running an agent in a loop". I still think it's impressive nonetheless.

The fact that you are so incredulous (and I get why that is, scepticism is warranted in this space) is actually funny. We are on the right track.

sarchertech

3 months ago

There’s absolutely no difference from what he says he did and what Claude code can do behind the scenes.

If Anthropic thought they could produce anything remotely useful by wiping the context and reprompting every few hours, they would be doing it. And they’d be saying “look at this we implemented hard context reset and we can now run our agent for 30 days and produce an entire language implementation!”

In 3 months or 300 years of operating like this a current agent being freshly reprompted every few hours would never produce anything that even remotely looked like a language implementation.

As soon as its context was poisoned with slightly off topic todo comments it would spin out into writing a game of life implementation or whatever. You’d have millions of lines of nonsense code with nothing useful after 3 months of that.

The only way I see anything like this doing anything approaching “useful” is if the outer loop wipes the repo on every reset as well, and collects the results somewhere the agent can’t access. Then you essentially have 100 chances to one shot the thing.

But at that point you just have a needlessly expensive and slow agent.

Gazoche

3 months ago

1 reply

> written by cc "in a loop" in ~3 months?

What does that mean exactly? I assume the LLM was not left alone with its task for 3 months without human supervision.

NitpickLawyer

3 months ago

1 reply

From the FAQ:

> the following prompt was issued into a coding agent:

> Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang?

> and then the coding agent was left running AFK for months in a bash loop

sarchertech

3 months ago

I don’t buy it at all. Not even Anthropic or Open AI have come anywhere close to something like this.

Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.

I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.

qingcharles

3 months ago

I use LLMs to vibe-code entire tools that I need for my work. They're really banal boring apps that are relatively simple, but they still would have wasted a day or two each to write and debug. Even stuff as simple as laying out the whole UI in a nice pattern. Most of these are now practically one-shots from the latest Claude and GPT. I leave them churning, get coffee, come back and test the finished product.

slyzmud

3 months ago

The two groups are very different but I notice another pattern: you have people who like coding and understanding details of what their are doing, are curious, what to learn about the why and think about edge cases; and there's another group of people who just want to code something, make a test pass, show a nice UI and that's it, but don't think much about edge cases or maintainability. The only thing they think is "delivering value" to customers.

Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.

regularfry

3 months ago

The function of technological progress, looked at through one lens, is to commoditise what was previously bespoke. LLMs have expanded the set of repeatable things. What we're seeing is people on the one hand saying "there's huge value in reducing the cost of producing rote assets", and on the other "there is no value in trying to apply these tools to tasks that aren't repeatable".

Both are right.

sidgtm

3 months ago

1 reply

As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design.

seunosewa

3 months ago

1 reply

Whatdo the notes look like?

sidgtm

3 months ago

specific inputs e.g. move, color change, or giving specific inputs for interaction piece.

rossant

3 months ago

2 replies

Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.

A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.

Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.

I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.

Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...

These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!

ivape

3 months ago

5 replies

You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.

It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.

doikor

3 months ago

1 reply

I would generalise it to you can’t trust LLMs to generate any kind of unique identifier. Sooner or later it will hallucinate a fake one.

wat10000

3 months ago

I would generalize it further: you can't trust LLMs.

They're useful, but you must verify anything you get from them.

grey-area

3 months ago

4 replies

Or just not bother. It sounds pretty useless if it flunks on basic tasks like this.

Perhaps you’ve been sold a lie?

ivape

3 months ago

2 replies

Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.

You have to be able to see what this thing can actually do, as opposed to what it can’t.

sebtron

3 months ago

1 reply

> Well, you see it hallucinates on long precise strings

But all code is "long precise strings".

ogogmad

3 months ago

He obviously means random unstructured strings, which code is usually not.

grey-area

3 months ago

I can’t even tell if you’re being sarcastic about a terrible tool or are hyping up LLMs as intelligent assistants and telling me we’re all holding it wrong.

IanCal

3 months ago

1 reply

They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that.

On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.

Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.

goalieca

3 months ago

3 replies

> I don't think they were ever really sold as that, and we have better tools for that.

We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.

I say they are being sold as a magical do everything tool.

buildbot

3 months ago

3 replies

Would you hire a PhD to copy URLs by hand? Would them having PhD make it less likely they’d make a mistake than an high school student doing the same?

goalieca

3 months ago

1 reply

Grad students and even post docs often do a lot of this manual labour for data entry and formatting. Been there, done that.

IanCal

3 months ago

Manual data entry has lots of errors. All good workflows around this base themselves on this fact.

parineum

3 months ago

2 replies

A high school student would use copy/paste and the urls would be perfect duplicates..

baq

3 months ago

1 reply

LLMs aren’t high school students, they’re blobs of numbers which happen to speak English if you poke them right. Use the tool when it’s good at what it does.

parineum

3 months ago

And the people who are causing this confusion are the CEOs of the companies saying that the newest model is a PHD in your pocket.

IanCal

3 months ago

> A high school student would use copy/paste and the urls would be perfect duplicates..

Did the LLM have this?

hitarpetar

3 months ago

I would not hire anyone for a role that requires computer use who does not know how to use copy/paste

IanCal

3 months ago

2 replies

Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me.

Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.

mbesto

3 months ago

What you are describing is "dead reasoning zones".[0]

    "This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.

    But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model."

https://jeremyberman.substack.com/p/how-i-got-the-highest-sc...

hitarpetar

3 months ago

saddest goalpost ever

laterium

3 months ago

2 replies

They are lying, because their salary depends on them lying about it. Why does it even matter what they're saying? Why don't we listen to scientists, researchers, practicioners and the real users of the technology and stop repeating what the CEOs are saying?

The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.

When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?

grey-area

3 months ago

1 reply

So questioning the utility of LLMs for knowledge work is now akin to a conspiracy theory?

laterium

3 months ago

1 reply

Not what I said at all. Question it all what you want. But disproving outrageous CEO claims doesn't get you there. Whether LLMs are AGI/ASI that will replace everyone is seperate from whether they are useful today as tools. Attacking the first claim doesn't mean much for the second claim, which is the more interesting one.

grey-area

3 months ago

I'm questioning the basic utility. They are text generation machines. This makes them unsuitable for any work that requires accuracy or understanding, which is the vast majority of knowledge work.

culll_kuprey

3 months ago

2 replies

> When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?

Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.

laterium

3 months ago

It's a new technology which lends itself well to outrageous claims and marketing, but the analogy stands. The CEOs don't get to define the narrative or stand as strawman targets for anti-AI folks to dunk on, sorry. Elon has been repeating "self driving next year" for a decade+ at this point, that doesn't make what Waymo did unimpressive. This level of cynicism is unwarranted is what I'm saying.

IanCal

3 months ago

You shouldn’t - that’s the point of the comparison. If some insane dentists started saying this you should not stop brushing your teeth!

baq

3 months ago

1 reply

Read about the jagged frontier. IanCal is right: this is a perfect example of using the tool wrong; you’ve focused on a very narrow use case which is surprisingly hard for the matmuls to not mess up and extrapolate, but extrapolation is incorrect here because the capability frontier is fractal and not continuous.

grey-area

3 months ago

1 reply

It’s not surprisingly hard at all, when you consider they have no understanding of the tasks they do nor of the subject material. It’s just a good example of the types of tasks (anything requiring reliability or correct results) that they are fundamentally unsuited to.

Sadly it seems the best use-case for LLMs at this point is bamboozling humans.

baq

3 months ago

When you take a step back it's surprising that these tools can be actually useful at all in nontrivial tasks, but being surprised doesn't matter in the grand scheme of things. Bamboozling rarely enough for harnesses to keep them in line and ability to inference-time self-correct when bamboozling is detected either by the model itself or by the harness is very useful at least in my work. It's a question of using the tool correctly and understanding its limitations, which is hard if you aren't willing to explore the boundaries and commit to doing it every month basically.

seanw265

3 months ago

I suspect you haven't tried a modern mid-to-large-LLM & Agent pair for writing code. They're quite capable, even if not suited for all tasks.

hansmayer

3 months ago

1 reply

Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?

IanCal

3 months ago

3 replies

They easily check a bunch of those boxes.

> why don´t we stop pretending that we as users are stupid and don´t know how to use them

This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!

hansmayer

3 months ago

1 reply

The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more.

IanCal

3 months ago

1 reply

> Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on.

If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.

Let's do a quick overview of recent chats for me:

* Identifying and validating a race condition in some code

* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code

* Identifying an async bug two good engineers couldn't find in a codebase they knew well

* Finding performance issues that had gone unnoticed

* Digging through synapse documentation and github issues to find a specific performance related issue

* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed

* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money

* Going through funding opportunities and matching them against a charity I want to help in my local area

* Building a search integration for my local library to handle my kids reading challenge

* Solving a series of VPN issues I didn't understand

* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.

> the folks pushing or better said

If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.

hansmayer

3 months ago

1 reply

Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced.

IanCal

3 months ago

I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want.

> Trust me, it sucks

Ok. I'm convinced.

> and under-delivers.

Compared to what promise?

> I am sure we will see those 10x apps rolling in soon, right?

Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).

> It's only been like 4 years since the revolutionary magic machine was announced.

It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.

If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.

Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:

Why are you paying for something that solves literally no problems for you?

mbesto

3 months ago

1 reply

> This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!

The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s

IanCal

3 months ago

1 reply

If you base all your tech workings on the promises of CEOs you'll fail badly, you should not be surprised by this.

mbesto

3 months ago

Thanks for the advice...

hitarpetar

3 months ago

it's amazing that you picked another dark pattern as your comparison

jollyllama

3 months ago

1 reply

> You’re just not using LLMs enough.

> You can never trust the LLM to generate a url

This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.

This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.

IanCal

3 months ago

1 reply

It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong".

> I'm sick of hearing "you're doing it wrong"

That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.

> when the real answer is "this tool can't do that."

That is what they said.

jollyllama

3 months ago

1 reply

> If you use them regularly you wouldn't see a set of urls without thinking...

Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.

IanCal

3 months ago

This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work!

> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".

But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.

"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".

fwip

3 months ago

I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch.

Edit: I think I'm just regurgitating the article here.

worldsayshi

3 months ago

5 replies

This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.

Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.

rullelito

3 months ago

1 reply

LLMs are turning into LLMs+hard-coded fixes for every imaginable problem.

worldsayshi

3 months ago

Why hard coded?

lenkite

3 months ago

1 reply

In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.

I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.

When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.

Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.

mr_mitm

3 months ago

A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves.

rossant

3 months ago

I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.

cimi_

3 months ago

Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.

Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.

Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.

exe34

3 months ago

> that checks that links are not broken?

Can you spot the next problem introduced by this?

giancarlostoro

3 months ago

Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if?

IanCal

3 months ago

Editing tools are easy to add it’s just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too.

I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.

Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.

There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.

3 months ago

IaC, and DSLs in general.

hu3

3 months ago

I have seen LLMs in VSCode Copilot ask to execute 'mv oldfile.py newfile.py'.

So there's hope.

But often they just delete and recreate the file, indeed.

ziotom78

3 months ago

I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming.

ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.

baq

3 months ago

they're getting better at asking questions; I routinely see search calls against the code base index. they just don't ask me questions.

210 more comments available on Hacker News

View full discussion on Hacker News

ID: 45523537Type: storyLast synced: 11/20/2025, 8:14:16 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN