If You Are Good at Code Review, You Will Be Good at Using AI Agents
Mood
heated
Sentiment
negative
Category
other
Key topics
The article suggests that being good at code review translates to being good at using AI agents, but the discussion reveals significant skepticism and concerns about the effectiveness and implications of AI-generated code.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
116
Day 1
Avg / period
26.7
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 20, 2025 at 12:59 AM EDT
2 months ago
Step 01 - 02First comment
Sep 20, 2025 at 4:07 AM EDT
3h after posting
Step 02 - 03Peak activity
116 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 25, 2025 at 6:23 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
... Function names compose much of the API.
The API is the structure of the codebase.
This isn't some triviality you can throw aside as unimportant, it is the shape that the code has today, and limits and controls what it will have tomorrow.
It's how you make things intuitive, and it is equally how you ensure people follow a correct flow and don't trap themselves into a security bug.
And sometimes the LLM just won't go in the direction you want, but that's OK - you just have to go write those bits of code.
It can be suprising where it works and where it doesn't.
Just go with those first suggestions though and the code will end up rough.
I'm not sure if you're saving any time there, though. Perhaps if you give an LLM task before ending the work day so it can churn away for a while unattended, it may generate a decent implementation. There's a good chance you need to throw out the work too; you can't rely on it, but it can be a nice bonus if you're lucky.
I've found that this only works on expensive models with large context windows and limited API calls, though. The amount of energy wasted on shit code that gets reverted must be tremendous.
I hope the AI industry makes true on its promise that it'll solve the whole inefficiency problem because the way things are going now, the industry isn't sustainable.
> ...You’ll be forever tweaking individual lines of code, asking for a .reduce instead of a .map.filter, bikeshedding function names, and so on. At the same time, you’ll miss the opportunity to guide the AI away from architectural dead ends.
I think a good review will often do both, and understand that code happens at the line level and also the structural level. It implies a philosophy of coding that I have seen be incredibly destructive firsthand — committing a bunch of shit that no one on a team understands and no one knows how to reuse.
This is distinctly not the api, but an implementation detail.
Personally, i can ask colleagues to change function names, rework hierarchy, etc. But leave this exact example be, as it does not have any material difference difference - regardless of my personal preference.
I do a lot of code reviews, and one of the main things I ask for, after bug fixes, is renaming things for readers to understand at first read unambiguously and to match the various conventions we use throughout the codebase.
Ex: new dev wrote "updateFoo()" for a method converting a domain thing "foo" from its type in layer "a" to its type in layer "b", so I asked him to use "convertFoo_aToB()" instead.
I really believe there are people out there that produce good code with these things, but all I've seen so far has been tragic.
Luckily, I've witnessed a few snap out of it and care again. Literally looks to me as if they had a substance abuse problem for a couple of months.
If you take a critical look at what comes out of contemporary agentic workflows, I think the conclusion must be that it's not there. So yeah, if you're a good reviewer, you would perhaps come to that conclusion much sooner.
Interesting comparison, why not weed or alcohol?
I'm not even anti-LM. Little things—research, "write TS types for this object", search my codebase, go figure out exactly what line in the Django rest framework is causing this weird behavior, —are working great and saving me an hour here and 15m there.
It's really obvious when people lean on it, because they don't act like a beginner (trying things that might not work) or just being sloppy (where there's a logic ot it but there's no attention to detail), but it's like they copy pasted from Stackoverflow search results at random and there are pieces that might belong but the totality is incoherent.
I'm definitely not anti LLM, I use them all the time. Just not for generating code. I give it a go every couple of months, probably wasting more time on it than I should. I don't think I've felt any real advancements since last year around this time, and this agentic hype seems to be a bit ahead of its time, to put it mildly. But I absolutely get a lot of value out of them.
I don't believe this at all, because all I've seen so far is tragic
I would need to see any evidence of good quality work coming from AI assisted devs before I start to entertain the idea myself. So far all I see is low effort low quality code that the dev themself is unable to reason about
The ability to ignore AI and focus on solving the problems has little to do with "fun". If anything it leaves a human-auditable trail to review later and hold accountable devs who have gone off the rails and routinely ignored the sometimes genuinely good advice that comes out of AI.
If humans don't have to helicopter over developers, that's a much bigger productivity boost than letting AI take the wheel. This is a nuance missed by almost everyone who doesn't write code or care about its quality.
The Ironies of Automation paper is something I mention a lot, the core thesis is that making humans review / rubber stamp automation reduces their work quality. People just aren't wired to do boring stuff well.
Serious question: why not?
IMO it should be.
If "progress" is making us all more miserable, then what's the point? Shouldn't progress make us happier?
It feels like the endgame of AI is that the masses slave away for the profit of a few tech overlords.
OpenAI's Codex Cloud just added a new feature for code review, and their new GPT-5-Codex model has been specifically trained for code review: https://openai.com/index/introducing-upgrades-to-codex/
Gemini and Claude both have code review features that work via GitHub Actions: https://developers.google.com/gemini-code-assist/docs/review... and https://docs.claude.com/en/docs/claude-code/github-actions
GitHub have their own version of this pattern too: https://github.blog/changelog/2025-04-04-copilot-code-review...
There are also a whole lot of dedicated code review startups like https://coderabbit.ai/ and https://www.greptile.com/ and https://www.qodo.ai/products/qodo-merge/
Fundamentally, unit tests are using the same system to write your invariants twice, it just so happens that they're different enough that failure in one tends to reveal a bug in another.
You can't reasonably state this won't be the case with tools built for code review until the failure cases are examined.
Furthermore a simple way to help get around this is by writing code with one product while reviewing the code with another.
For unit tests, the parts of the system that are the same are not under test, while the parts that are different are under test.
The problem with using AI to review AI is that what you're checking is the same as what you're checking it with. Checking the output of one LLM with another brand probably helps, but they may also have a lot of similarities, so it's not clear how much.
This isn't true. Every instantiation of the LLM is different. Oversimplifying a little, but hallucination emerges when low-probability next words are selected. True explanations, on the other hand, act as attractors in state-space. Once stumbled upon, they are consistently preserved.
So run a bunch of LLM instances in parallel with the same prompt. The built-in randomness & temperature settings will ensure you get many different answers, some quite crazy. Evaluate them in new LLM instances with fresh context. In just 1-2 iterations you will hone in on state-space attractors, which are chains of reasoning well supported by the training set.
Not all of the mistakes, they generally still have a performance ceiling less than human experts (though even this disclaimer is still simplifying), but this kind of self-critique is basically what makes the early "reasoning" models one up over simple chat models: for the first-n :END: tokens, replace with "wait" and see it attempt other solutions and pick something usually better.
Generating 10 options with mediocre mean and some standard deviation, and then evaluating which is best, is much easier than deliberative reasoning to just get one thing right in the first place more often.
You can take the output of an LLM and feed it into another LLM and ask it to fact-check. Not surprisingly, these LLMs have a high false negative rate, meaning that it won't always catch the error. (I think you agree with me so far.) However the probability of these LLM failures are independent of each other, so long as you don't share context. The converse is that the LLM has a less-than-we-would-like probability of detecting a hallucination, but if it does then verification of that fact is reliable in future invocations.
Combine this together: you can ask an LLM to do X, for any X, then take the output and feed it into some number of validation instances to look for hallucinations, bad logic, poor understanding, whatever. What you get back on the first pass will look like a flip of the coin -- one agent claims it is hallucination, the other agent says it is correct; both give reasons. But feed those reasons into follow-up verifier prompts, and repeat. You will find that non-hallucination responses tend to persist, while hallucinations are weeded out. The stable point is the truth.
This works. I have workflows that make use of this, so I can attest to its effectiveness. The new-ish Claude Code sub-agent capabilities and slash commands are excellent for doing this, btw.
Is it possible that this is just the majority and there’s plenty of folks that dislike actually starting from nothing and the endless iteration to make something that works, as opposed to have some sort of a good/bad baseline to just improve upon?
I’ve seen plenty of people that are okay with picking up a codebase someone else wrote and working with the patterns and architecture in there BUT when it comes to them either needing to create new mechanisms in it or create an entirely new project/repo it’s like they hit a wall - part of it probably being friction, part not being familiar with it, as well as other reasons.
> Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
Presumably because that’s where the most perceived productivity gain is in. As for code review, there’s CodeRabbit, I think GitLab has their thing (Duo) and more options are popping up. Conceptually, there’s nothing preventing you from feeding a Git diff into RooCode and letting it review stuff, alongside reading whatever surrounding files it needs.
For me, it's exactly the opposite:
I love to build things from "nothing" (if I had the possibility, I would even like to write my own kernel that is written in a novel programming language developed by me :-) ).
On the other hand, when I pick up someone else's codebase, I nearly always (if it was not written by some insanely smart programmer) immediately find it badly written. In nearly al cases I tend to be right in my judgements (my boss agrees), but I am very sensitive to bad code, and often ask myself how the programmer who wrote the original code did not yet commit seppuku, considering how much of a shame the code is.
Thus: you can in my opinion only enjoy picking up a codebase someone else wrote if you are incredibly tolerant of bad code.
At least for me, what gives the most satisfaction (even though this kind of satisfaction happens very rarely) if I discover some very elegant structure behind whatever has to be implemented that changes the whole way how you thought about programming (oroften even about life) for decades.
A number of years ago, I wrote a caching/lookup library that is probably some of the favorite code I've ever created.
After the initial configuration, the use was elegant and there was really no reason not to use it if you needed to query anything that could be cached on the server side. Super easy to wrap just about any code with it as long as the response is serializable.
CachingCore.Instance.Get(key, cacheDuration, () => { /* expensive lookup code here */ });
Under the hood, it would check the preferred caching solution (e.g., Redis/Memcache/etc), followed by less preferred options if the preferred wasn't available, followed by the expensive lookup if it wasn't found anywhere. Defaulted to in-memory if nothing else was available.
If the data was returned from cache, it would then compare the expiration to the specified duration... If it was getting close to various configurable tolerances, it would start a new lookup in the background and update the cache (some of our lookups could take several minutes*, others just a handful of seconds).
The hardest part was making sure that we didn't cause a thundering herd type problem with looking up stuff multiple times... in-memory cache flags indicating lookups in progress so we could hold up other requests if it failed through and then let them know once it's available. While not the absolute worst case scenario, you might end up making the expensive lookups once from each of the servers that use it if the shared cache isn't available.
* most of these have a separate service running on a schedule to pre-cache the data, but things have a backup with this method.
When I use an LLM to code I feel like I can go from idea to something I can work with in much less time than I would have normally.
Our codebase is more type-safe, better documented, and it's much easier to refactor messy code into the intended architecture.
Maybe I just have lower expectations of what these things can do but I don't expect it to problem solve. I expect it to be decent at gathering relevant context for me, at taking existing patterns and re-applying them to a different situation, and at letting me talk shit to it while I figure out what actually needs to be done.
I especially expect it to allow me to be lazy and not have to manually type out all of that code across different files when it can just generate them it in a few seconds and I can review each change as it happens.
I'm using it to get faster at building my own understanding of the problem, what needs to get done, and then just executing the rote steps I've already figured out.
Sometimes I get lucky and the feature is well defined enough just from the context gathering step that the implementation is literally just be hitting the enter key as I read the edits it wants to make.
Sometimes I have to interrupt it and guide it a bit more as it works.
Sometimes I realize I misunderstood something as it's thinking about what it needs to do.
One-shotting or asking the LLM to think for you is the worst way to use them.
if the act of writing code is something you consider a burden rather than a joy then my friend you are in the wrong profession
I care deeply about the code quality that goes into the projects I work on because I end up having to maintain it, review it, or fix it when it goes south, and honestly it just feels wrong to me to see bad code.
But literally typing out the characters that make up the code? I could care less. I've done that already. I can do it in my sleep, there's no challenge.
At this stage in my career I'm looking for ways to take the experience I have and upskill my teams using it.
I'd be crazy not to try and leverage LLMs as much as possible. That includes spending the time to write good CLAUDE.md files, set up custom agents that work with our codebase and patterns, it also includes taking the time to explain the why behind those choices to the team so they understand them, calling out bad PRs that "work" but are AI slop and teaching them how to get better results out of these things.
Idk man the profession is pretty big and creating software is still just as fun as when I was doing it character by character in notepad. I just don't care to type more than I need to when I can focus on problem solving and building.
The creativity in implementing (e.g an indexed array that, when it grows to large, gets reformated to a less performance hashmap) is what I imagine being lost and bring people satisfaction. Pulling that off in a clean and not in a complex way... well there is a certain reward in that. I don't have any long term proof but I also hypothesize it helps with maintainability.
But I also see your point, sometimes I need a tool that does a function and I don't care to write it and giving the agent requirements and having it implemented is enough. But typically these tools are used and discarded.
The way I see it these tools allow me to use my actual brainpower mostly on those problems. Because all the rote work can now be workably augmented away, I can choose which problems to actually focus on "by hand" as it were. I'd never give those problems to an LLM to solve. I might however ask it to search the web for papers or articles or what have you that have solved similar problems and go from there.
If someone is giving that up then I'd question why they're doing that.. No one is forcing them to.
It's the problem solving itself that is fun, the "layer" that it's in doesn't really make a difference to me.
An LLM can do it in two minutes while I fetch coffee, then I can proceed to add the complex bits (if there are any)
i don't disagree with you but if "adding one more CRUD endpoint" and similar rote tasks represent any significant amount of your engineering hours, especially in the context of business impact, then something is fundamentally broken in your team, engineering org, or company overall
time spent typing code into an editor is usually, hopefully!, approximately statistically 0% of overall engineering time
I therefore think it makes the most sense to just feed it requirements and issues, and telling it to provide a solution.
Also unless you're starting a new project or big feature with a lot of boiler plate, in my experience it's almost never necessary to make a lot of files with a lot of text in it at once.
Not me. I enjoy figuring out the requirements, the high-level design, and the clever approach that will yield high performance, or reuse of existing libraries, or whatever it is that will make it an elegant solution.
Once I've figured all that out, the actual process of writing code is a total slog. Tracking variables, remembering syntax, trying to think through every edge case, avoiding off-by-one errors. I've gone from being an architect (fun) to slapping bricks together with mortar (boring).
I'm infinitely happier if all that can be done for me, everything is broken out into testable units, the code looks plausibly correct, and the unit tests for each function cover all cases and are demonstrably correct.
Then after going back and forth between thinking about it and trying to build it a few times, after a while you discover the real solution.
Or at least that's how it's worked for me for a few decades, everyone might be different.
That's why you have short functions so you don't have to track that many variable. And use symbol completion (a standard in many editors).
> trying to think through every edge case, avoiding off-by-one errors.
That is designing, not coding. Sometimes I think of an edge case, but I'm already on a task that I'd like to finish, so I just add a TODO comment. Then at least before I submit the PR, I ripgrep the project for this keyword and other.
Sometimes the best design is done by doing. The tradeoffs become clearer when you have to actually code the solution (too much abstraction, too verbose, unwieldy,...) instead of relying on your mind (everything seems simpler)
And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function. Edge cases are not "todos", they're correctly handling all possible states.
> Sometimes the best design is done by doing.
I mean, sure go ahead and prototype, rewrite, etc. That doesn't change anything. You can have the AI do that for you too, and then you can re-evaluate and re-design. The point is, I want to be doing that evaluation and re-designing. Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already.
Aka the scope. And the namespace of whatever you want to access. Which is a design problem.
> And it's not about symbol completion, it's about remembering all the obscure differences in built-in function names and which does what
That's what references are for. And some IDEs bring it right alongside the editor. If not, you have online and offline references. You remember them through usage and semantics.
> And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function.
It's not. You define the happy path and error cases as part of the specs. But they're generally lacking in precision (full of ambiguities) and only care about the essential complexity. The accidental complexity comes as part of the platform and is also part of the design. Pushing those kind of errors as part of coding is shortsightedness.
> Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already
That is like saying "Not typing all the text and keeping track of words and punctuation and paragraphs and signatures. English is boring as hell and I've written more than enough..."
If you don't like formality, say so. I've never had anyone describe coding as you did. No one things about those stuff that closely. It's like a guitar player complaining about which strings to strike with a finger. Or a race driver complaining about the angle of the steering wheel and having to press the brake.
The simple fact is that I find there's very little creative satisfaction to be found in writing most functions. Once you've done it 10,000 times, it's not exactly fun anymore, I mean unless you're working on some cutting-edge algorithm which is not what we're doing 99.9% of the time.
The creative part becomes in the higher level of design, where it's no longer rote. This is the whole reason why people move up into architecture roles, designing systems and libraries and API's instead of writing lines of code.
The analogies with guitar players or race car drivers or writers are flawed, because nothing they do is rote. Every note matters, every turn, every phrase. They're about creativity and/or split-second decision making.
But when you're writing code, that's just not the case. For anything that's a 10- or 20- line function, there isn't usually much creativity there, 99.99% of the time. You're just translating an idea into code in a straightforward way.
So when you say, "Developers like _writing_ and that gives the most job satisfaction." That's just not true. Especially not for many experienced devs. Developers like thinking, in my experience. They like designing, the creative part. Not the writing part. The writing is just the means to the end.
Senior developers love removing code.
Code review is probably my favorite part of the job, when there isn’t a deadline bearing down on me for my own tasks.
So I don’t really agree with your framing. Code reviews are very fun.
Partially sarcastic but I do personally use LLMs to guide my communication in very limited cases:
1. It's purely business related, and
2. I'm feeling too emotionally invested (or more likely, royally pissed off) and don't trust myself to write in a professional manner, and
3. I genuinely want the message to sound cold, corporate, and unemotional
Number 3 would fit you here. These people are not being respectful to you in presenting code for review that respects your time. Why should you take the time to write back personally?
It should be noted that this accounts for maybe 5% of my business communications, and I'm careful not to let that number grow.
Because it's 3 sentences, if you want to be way more polite and verbose than necessary.
"I will close PRs if they appear to be largely LLM-generated. I am always happy to review something with care and attention if it shows the same qualities. Thanks!"
The idea is to get your coworkers to stop sending you AI slop, send them AI slop in retaliation?
And then what if the person denies it?
They're either lying about using AI, or they're incompetent enough to produce AI quality (read: Garbage) code, either way the company should let them go
If it's only happened a few times you might first try setting some ground rules for contributions. Really common for innersource repos to have a CONTRIBUTING.md file or similar. Add a checkbox to your PR template that the dev has to check to indicate they've read it, then wait and see.
Imagine that your boss came to you, the tech lead of a small team, and said “okay, instead of having five competent people, your team will now have 25 complete idiots. We expect that their random flailing will sometimes produce stuff that kinda works, and it will be your job to review it all.” Now, you would, of course, think that your boss had gone crazy. No-one would expect this to produce good results. But somehow, stick ‘AI’ on this scenario, and a lot of people start to think “hey, maybe that could work.”
If the engineer, doing the implementation is top-shelf, you can get very good results from a “flawed” process (in quotes, because it’s not actually “bad.” It’s just a process that depends on the engineer being that particular one).
Silicon Valley is obsessed with process over people, manifesting “magical thinking” that a “perfect” process eliminates the need for good people.
I have found the truth to be in-between. I worked for a company that had overwhelming Process, but that process depended on good people, so it hired top graduates, and invested huge amounts of money and time into training and retention.
Its the content that matters, not the process.
The first is phenomenal until someone makes a mistake and brings in a manager or supervisor from the C category that talks the talk but doesn't walk the walk.
If you accidentally end up in one that turns out to be the later. It's maddening trying to get anything accomplished if the task involves anyone else.
Hire slow, fire fast.
This idea is called "evolution"...
> as long as you have good quality control
...and it's QA is death on every single level of the systems: cell, organism, species, and ecosystem. You must consider that those devs or companies with not-good-enough QA will end up dead (from a business perspective).
Sounds like a stupid path forward to me
Like, evolution is not _good_ at ‘designing’ things.
My human co-workers generally have good faith. Even the developer who was clearly on the verge of getting a role elsewhere without his heart in it-- he tried to solve the problems assigned to him, not some random delusion that the words happened to echo. I don't have that level of trust with AI.
If there's a misunderstanding the problem or the context, it's probably still the product of a recognizable logic flow that you can use to discuss what went wrong. I can ask Claude "Why are you converting this amount from Serbian Dinars to Poppyseed Bagels in line 476?" but will its answer be meaningful?
Human code review often involves a bit of a shared background. We've been working with the same codebases for several years, so we're going to use existing conventions. In this situation, the "AI knows all and sees all" becomes an anti-feature-- it may optimize for "this is how most people solve this task from a blank slate" rather than "it's less of a cognitive burden for the overall process if your single change is consistent with 500 other similar structures which have been in place since the Clinton administration."
There may be ways to try to force-feed AI this behaviour, but the more effort you devote to priming and pre-configuring the machine, the less you're actually saving over doing the actual work in the first place.
I'm not sure how this translates to programming, code review is too expensive, but for short code you can try https://en.wikipedia.org/wiki/Superoptimization
Well you've just described an EKF on a noisy sensor.
I mean the average person surely knows at least what a Jacobian is, right? Right? /s
Quality needs to come from the process, not the people.
Choosing to use a process known to be flawed, then hoping that people will catch the mistakes, doesn't seem like a great idea if the goal is quality.
The trouble is that LLMs can be used in many ways, but only some of those ways play to their strengths. Management have fantasies of using AI for everything, having either failed to understand what it is good for, or failed to learn the lessons of Japan/Deming.
You're also describing the software development process prior to LLMs. Otherwise code reviews wouldn't exist.
With humans and code reviews now two humans looked at it. With LLM and code review of the LLM output now one human looked at it, so its not the same. LLM are still far from as reliable as humans or you could just tell the LLM to do code reviews and then it builds the entire complex product itself.
0. The dawn of video games had many titles with 1 person responsible for programming. This remains the case many indy games and small software apps and services. It's a skill that requires expertise and/or dedication.
Use of AI seems to be a regression in this regard, at least as currently used - "look ma, no hands! I've just vibe coded an autopiliot". The current focus seems to be on productivity - how many more lines of code or vibe-coded projects can you churn out - maybe because AI is still basically a novelty that people are still learning how to use.
If AI is to be used productively towards achieving business goals then the focus is going to need to mature and change to things like quality, safety, etc.
Not sure which Japanese school of management you're following, but I think Toyota-style goes against that. The process gives more autonomy to workers than, say, Ford-style, where each tiny part of the process is pre-defined.
I got the impression that Toyota-style was considered to bring better quality to the product, even though it gives people more autonomy.
It's a bit like Warren Buffet saying he only wants to invest in companies that could be run by an idiot, because one day they will be.
Edward Deming actually worked with both Toyota and Ford, perhaps more foundationally at Toyota, bringing his process-based-quality ideas to both. Toyota's management style is based around continuous process improvement, combined with the employee empowerment that you refer to.
Deming’s process was about how to operate a business in a capital-intensive industry when you don’t have a lot of capital (with market-acceptable speed and quality). That you could continue to push it and raise quality as you increased the amount of capital you had was a side-effect, and the various Japanese automakers demonstrated widely different commitments to it.
And I’m sure you know that he started formulating his ideas during the Great Depression and refined them while working on defense manufacturing in the US during WWII.
Third option: they want to automate all jobs before the competition does. Think of it as AWS, but for labor.
Its the same with any "general" tech. I've seen it since genetic algorithms were all the rage. Everyone reaches for the most general tool, then assumes everything that tool might be used for is now a problem or domain they are an expert in, with zero context into that domain. AI is this times 100x, plus one layer more meta, as you can optimize over approaches with zero context.
Normally, if you want to achieve some goal, there is a whole pile of tasks you need to be able to complete to achieve it. If you don't have the ability complete any one of those tasks, you will be unable to complete the goal, even if you're easily able to accomplish all the other tasks involved.
AI raises your capability floor. It isn't very effective at letting you accomplish things that are meaningfully outside your capability/comprehension, but if there are straightforward knowledge/process blockers that don't involve deeper intuition it smooths those right out.
Nothing has changed. Few projects start with you knowing all the answers. In the same way AI can help you learn, you can learn from books, colleagues, and trial and error for tasks you do not know.
Before AI, if I had the knowledge/skill to do something on the large scale, but there were a bunch of minute/mundane details I had to figure out before solving the hard problems, I'd just lose steam from the boredom of it and go do something else. Now I delegate that stuff to AI. It isn't that I couldn't have learned how to do it, it's that I wouldn't have because it wouldn't be rewarding enough.
You're probably envisioning a more responsible use if it (floor raising, "meaningfully inside your comprehension"), that is actually not what I'm referring to at all ( "assumes everything that tool might be used for is now a problem or domain they are an expert in"). A meta tech can be used in many ways and yours is close to what I believe the right method is. But I'm asserting that the danger is massive over reliance and over confidence in the "transferability".
Supermarket vegetables.
Quite a bit of it, like Tomatoes and Strawberries, is just crap. Form over substance. Nice color and zero flavor. Selected for delivery/shelf-life/appearance rather actually being any good.
From an economics POV, that's the correct test.
I was also considering the way the US food standards allows a lot of insect parts in the products, but wasn't sure how to phrase it.
Maybe we could stop filtering everything through this bullshit economics race to the bottom then
A race to the bottom leaves you like Boeing or Intel.
Late stage capitalism is not a must.
Your list of winners are optimising for what the market cares about, exactly like the supermarkets (who are also mostly winners) are optimising for what the market cares about. For most people, for food specifically, that means "cheap". Unavoidably, because most people have less money than they'd like. Fancy food is rare treat for many.
Apple software currently has a reputation for buggy UI; Oracle has a reputation for being litigious; that just leaves Nvidia who are printing money selling shovels in two sucessive gold rushes, which is fine for a business and means my investment is way up, but also means for high end graphics cards consumer prices are WTF and availability is LOL.
See, the only way is not a race to the bottom like all late stage capitalist claim.
What is your point?
I don't know how the US compares to other countries in terms of "insects per pound" standards, but having some level of insects is going to be inevitable.
For example, how could you guarantee that your wheat, pre-milling, has zero insects in it, or that your honey has no bee parts in it (best you can do is strain it, then anything that gets through the straining process will be on your toast).
2. Quality control is key to good processes as well. Code review is literally a best practice in the software industry. Especially in BigTech and high-performing organizations. That is, even for humans, including those that could be considered the cream of the industry, code review is a standard step of the delivery process.
3. People have posted their GitHub profiles and projects (including on this very forum) to show how AI is working out for them. Browse through some of them and see how much "endless broken nonsense" you find. And if that seems unscientific, well go back to point 1.
So right off the bat, I don't trust you. Anyway, I picked one study from the search to give you the benefit of the doubt. It compared leetcode in the browser to LLM generation. This tells us absolutely nothing about real world development.
What made the METR paper interesting was that they studied real projects, in the real world. We all know LLMs can solve well bounded problems in their data sets.
As for 3 I've seen a lot of broken nonsense. Let me know when someone vibe codes up a new mobile operating system or a competitor to KDE and Gnome lol
Alternatively, a search is a way to show that basing your opinions on limited personal experience or a single source is silly given the vast amount of other research out there that largely contradicts it. Worse if that single source itself happens to have flaws that are not sufficiently discussed, e.g. at least one of the 16 participants from the METR study deliberately filtered out large tasks that he strongly prefered do only with AI -- what does that mean for its results?
https://xcancel.com/ruben_bloom/status/1943536052037390531
> Many of the studies in that search don't have anything to do with programming at all.
That's fair, but unfortunately due to the limits of keyword search. For instance "medical coding" is not programming-related at all, but is being impacted by LLMs and gets caught in the keyword search ¯\_ (ツ)_/¯
Anyway if your preference is "real-world projects", here are a couple specific studies (including one that I had already separately mentioned in the linked comment) at much larger scales that show significant productivity boosts of LLM-assisted programming at doing their regular, day-job tasks:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566 (4867 developers across 3 large companies including Microsoft)
https://www.bis.org/publ/work1208.pdf (1219 programmers at a Chinese BigTech)
There are many more, but I left them out as they are based on other methodologies such as the use of standardized tasks for better comparability, or empirical analysis of open-source commits, or developer surveys, or student projects, which may get dismissed as "not an RCT on real-world tasks." Interestingly they all show comparable, positive results, so consider that it's not as straightforward to dismiss other studies as being irrelevant.
Also, a fab is intended to make fully functioning chips, which sometimes it fails to achieve. An LLM is NOT designed to give correct output, just plausible output. Again, not the same goal.
LLMs are an implementation of bogosort. But less efficient.
This is exactly the point of corporate Agile. Management believes that the locus of competence in an organization should reside within management. Depending on competent programmers is thus a risk, and what is sought is a process that can simulate a highly competent programmer's output with a gang of mediocre programmers. Kinda like the myth that you can build one good speaker out of many crappy ones, or the principle of RAID which is to use many cheap, failure-prone drives to provide the reliability guarantees of one expensive, reliable drive (which also kinda doesn't work if the drives came from the same lot and are prone to fail at about the same time). Every team could use some sort of process, but usually if you want to retain good people, this takes the form of "disciplines regarding branching, merging, code review/approval, testing, CI, etc." Something as stifling as Scrum risks scaring your good people away, or driving them nuts.
So yes, people do expect it to work, all the time. And with AI in the mix, it now gains very nice "labor is more fungible with capital" properties. We're going to see some very nice, spectacular failures in the next few years as a result, a veritable Perseid meteor shower of critical systems going boom; and those companies that wish to remain going concerns will call in human programmers to clean up the mess (but probably lowball on pay and/or try to get away with outsourcing to places with dirt-cheap COL). But it'll still be a rough few years for us while management in many orgs gets high off their own farts.
I haven't had to review too much AI code yet, but from what I've seen it tends to be the kind of code review that really requires you to think hard and so seems likely to lead to mistakes even with decent code reviewers. (I wouldn't say that I'm a brilliant code reviewer, but I have been doing open source maintenance full-time for around a decade at this point so I would say I have some experience with code reviews.)
I predict many disastrous "AI" failures because the designers somehow believed that "some humans capable of constant vigilant attention to detail" was an easy thing they could have.
I have been messing around with getting AI to implement novel (to me) data structures from papers. They're not rocket science or anything but there's a lot of detail. Often I do not understand the complex edge cases in the algorithms myself so I can't even "review my way out of it". I'm also working in go which is usually not a very good fit for implementing these things because it doesn't have sum types; lack of sum types oten adds so much interface{} bloat it would render the data structure pointless. Am working around with codegen for now.
What I've had to do is demote "human review" a bit; it's a critical control but it's expensive. Rather, think more holistically about "guard rails" to put where and what the acceptance criteria should be. This means that when I'm reviewing the code I am reasonably confident it's functionally correct, leaving me to focus on whether I like how that is being achieved. This won't work for every domain, but if it's possible to automate controls, it feels like this is the way to go wherever possible.
The "principled" way to do this would be to use provers etc, but being more of an engineer I have resorted to ruthless guard rails. Bench tests that automatically fail if the runtime doesn't meet requirements (e.g. is O(n) instead of O(log n)) or overall memory efficiency is too low - and enforcing 100% code coverage from both unit tests AND fuzzing. Sometimes the cli agent is running for hours chasing indexes or weird bugs; the two main tasks are preventing it from giving up, and stopping it from "punting" (wait, this isn't working, let me first create a 100% correct O(n) version...) or cheating. Also reminding it to check AGAIN for slice sharing bugs which crop up a surprising % of the time.
The other "interesting" part of my workflow right now is that I have to manually shuffle a lot between "deep research" (which goes and reads all the papers and blogs about the data structure) and the cli agent which finds the practical bugs etc but often doesn't have the "firepower" to recognise when it's stuck in a local maximum or going around in circles. Have been thinking about an MCP that lets the cli agent call out to "deep research" when it gets really stuck.
Even if the process weren’t technically bad, it would still be shit. Doing code review with a human has meaning in that the human will probably learn something, and it’s an investment in the future. Baby-sitting an LLM, however, is utterly meaningless.
Or more broadly, the existence of complex or any life.
Sure, it's not the way I would pick to do most things, but when your buzzword magical thinking so deep all that you have is a hammer, even if it doesn't look like a nail you will force your wage slaves to hammer it anyway until it works.
As to your other cases.. injection molded plastic parts for things like the spinning t bar spray arm in some dishwashers. Crap molds, pass to low wage or temp to razorblade fix by hand and box up. Personally worked such a temp job before, among others so yes that bad output manual qc and fix up abounds still.
And if we are talking high failure rates... see also chip binning and foundry yields in semiconductors.
Just have to look around to see the dubious seeming is more the norm.
In a strange kind of analogy, flowing water can cause a lot of damage.. but a dam built to the right specification and turbines can harness that for something very useful… the art is to learn how to build that dam
The pattern I see over and over is a team aimlessly putting a long through tickets in sprints until an engineer who knows how to solve the problem gets it on track personally.
For the core parts you cannot let go of the reins. You have to keep steering it. You have to take short breaks and reload the code into the agent as it starts acting confused. But once you get the hang of it, things that would take you months of convincing yourself and picking yourself back up to continue becomes a day's work.
Once you have a decent amount of work done, you can have the agent read your code as documentation and use it to develop further.
The way team leads seem to get used is people who are good at code get a little more productive as more people are told to report to them. What is happening now is the senior-level engineers all automatically get the same option: a team of 1-2 mid-level engineers on the cheap thanks to AI which is entirely manageable. And anyone less capable gets a small team, a rubber duck or a mentor depending on where they fall vs LLM use.
Of course, the real question is what will happen as the AIs get into the territory traditionally associated with 130+ IQ ranges and the engineers start to sort out how to give them a bit more object persistence.
- Rewrite it yourself?
- Tell AI to generate it again? — will lead to worse code than the first.
- Write the long prompt (like 6 page) even longer and hope it works this time?
I have only had real advantages with AI for helping me plan changes, and for it helping me to review my code. Getting it to write code for me has been somewhat helpful, but only for simple tedious changes or first drafts. But it is definitely not something I want to leverage by getting AI to produce more and more code that I then have to filter through and review. No thank you. I feel like this is really the wrong focus for implementing AI into your workflows.
Everyone I know trying to use AI in large codebases has had similar experiences. AI is not good enough at following the rules of your codebase yet (i.e., following structure, code style, library usage, re-using code, refactoring, etc...). This makes it far less useful for writing code changes and additions. It can still be useful for small changes, or for writing first drafts of functions/classes/interfaces, but for more meaningful changes it often fails.
That is why I believe that right now, if you want to maintain a large codebase, and maintain a high bar for quality, AI tools are just not good enough at writing most code for you yet. The solution to this is not to get AI to write even more code for you to review and throw out and iterate upon in a frustrating cycle. Instead, I believe it is to notice where AI is helpful and focus on those use-cases, and avoid it when it is not.
That said, AI labs seem to be focusing a lot of effort on improving AI for coding right now, so I expect a lot of progress will be made on these issues in the next few years.
1. Give it requirements
2. Tell it to ask me clarifying questions
3. When no more questions, ask it to explain the requirements back to me in a formal PRD
4. I criticize it
5. Tell it to come up with 2 alternative high level designs
6. I pick one and criticize it
7. Tell it to come up with 2 alternative detailed TODO lists
8. I pick one and criticize it
9. Tell it to come up with 2 alternative implementations of one of the TODOs
10. I pick one and criticize it
11. Back to 9
I usually “snapshot” outputs along the way and return to them to reduce useless context.
This is what produces the most decent results for me, which aren’t spectacular but at the very least can be a baseline for my own implementation.
It’s very time consuming and 80% of the time I end up wondering if it would’ve been quicker to just do it all by myself right from the start.
It's still time-consuming, and it probably would be faster for me to do it myself, but I can't be bothered manually writing lines of code any more. I maybe should switch to writing code with the LLM function by function, though.
Maybe you should consider a change of career :/
So in other words, if you are good at code review you are also good enough at writing code that you will be better off writing it yourself for projects you will be responsible for maintaining long term. This is true for almost all of them if you work at a sane place or actually care about your personal projects. Writing code for you is not a chore and you can write it as fluently and quickly as anything else.
Your time "using AI" is much better spent filling in the blanks when you're unfamiliar with a certain tool or need to discover a new one. In short, you just need a few google searches a day... just like it ever was.
I will admit that modern LLMs have made life easier here. AI summaries on search engines have indeed improved to the point where I almost always get my answer and I no longer get hung up meat-parsing poorly written docs or get nerd-sniped pondering irrelevant information.
37 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.