Your Job Is to Deliver Code You Have Proven to Work
Key topics
The debate rages on: is delivering code that "works" the ultimate job requirement, or is it just one aspect of solving customer problems? Commenters weigh in, with some arguing that automated testing is enough to prove code works, while others insist that manual testing is essential to catch unexpected issues. The author, simonw, clarifies that manual testing doesn't have to be the first step, but it's crucial to get done, citing the value of "seeing something" that automated tests might miss. As the discussion unfolds, a consensus emerges that a combination of automated and manual testing is the way to go.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
10m
Peak period
143
0-12h
Avg / period
32
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 18, 2025 at 9:52 AM EST
15 days ago
Step 01 - 02First comment
Dec 18, 2025 at 10:02 AM EST
10m after posting
Step 02 - 03Peak activity
143 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 26, 2025 at 10:26 AM EST
7 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Your job is to solve customer problems. Their problems may only be solvable with code that is proven to work, but it is equally likely (I dare say even more likely) that their problem isn't best solved with code at all, or even solved with code that doesn't work properly but works well enough.
From the post and the example he links, the point is that if you don't at least look at the running code, you don't know that it works.
In my opinion the point is actually well illustrated by Chris's talk here:
https://v5.chriskrycho.com/elsewhere/seeing-like-a-programme...
(summary of the relevant section if you're not going to click)
>>>
In the talk "Seeing Like a Programmer," Chris Krycho quotes the conductor and composer Eímear Noone, who said:
> "The score is potential energy. It's the potential for music to happen, but it's not the music."
He uses this quote to illustrate the distinction between "software as artifact" (the code/score) and "software as system" (the running application/music). His point is that the code itself is just a static artifact—"potential energy"—and the actual "software" only really exists when that code is executed and running in the real world.
Your tests run the code. You know it works. I know the article is trying to say that testing is not comprehensive enough, but my experience disagrees. But I also recognize that testing is not well understood — and if you don't understand it well, you can get caught not testing the right things or not testing what you think you are. I would argue that you would be better off using that time to learn how to write great tests instead of using it to manually test your code, but to each their own.
I've seen people only run tests and break things (because the thing they broke wasn't covered by tests), I've seen people try to fix things and not verify that their fix works, etc
Good tests are sufficient in many cases to be confident that your code still works. But in general tests don't cover a lot of fundamental behavior, and if you don't exercise that fundamental behavior in one way or another, then you don't know that your code works
Outside in testing is great but I typically do automated outside in testing and only manual at the end. The loop process of testing needs to be repeatable and fast, manual is too slow
I've lost count of the number of times I've skipped it because the automated test passed and then found there was some dumb but obvious bug that I missed, instantly exposed when I actually exercised the feature myself.
There's a lot of pedantry here trying to argue that there exists some feature which doesn't need to be "manually" tested, and I think the definition of "manual" can be pushed around a lot. Is running a program that prints "OK" a manual test or not? Is running the program and seeing that it now outputs "grue" rather than "bleen" manual? Does verifying the arithmetic against an Excel spreadsheet count?
There are programs that almost can't be manual, and programs that almost have to be manual. I remember when working on PIN pad integration we looked into getting a robot to push the buttons on the pad - for security reasons there's no way of injecting input automatically.
What really matters is getting as close to a realistic end user scenario as possible.
[1] As far as I can tell. If there are good solutions for this too, I'd love to learn.
Unit testing, whether manual or automated, typically catches about 30% of bugs.
End to end testing and visual inspection of code are both closer to 70% of bugs.
Of course that is not a panacea. What can happen in the real world is not truly understanding what the software needs to do. This can result in tests not being aligned with what the software actually needs. It is quite reasonable to call the outcome of that "bugs", but tests cannot catch that either. The tests are where the problem lies!
Most aspects of software a pretty clear cut, though. You can reasonably define a full contract. UX is a particular area where I've struggled to find a way to determine what the software needs before seeing it, though. There is seemingly no objective measure that can be applied in determining if a UX is going to work in order to encode that in a contract. Of course, as before, I'm quite interested to learn how others are solving that problem.
I vibe code a lot of stuff for myself, mostly for viewing data, when I don’t really need to care how it works. I’m coming around to the idea that outside of some specific circumstances where everyone has agreed they don’t need to care about or understand the code, team vibe coding is a bad practice.
If I’m paying an engineer, it’s for their work, unless explicitly agreed otherwise.
I think vibe coding is soon going to be seen the same way as “research” where you engage an offshore team (common e.g. in consulting) to give you a rundown on some topic and get back the first five google search results. Everyone knows how to do that, if it’s what they wanted they wouldn’t be hiring someone to do it.
The second time it happens they gotta go.
I would find the expectation that I need to attach a screenshot insulting. And the understanding that my peers test their code to produce a screenshot would be pretty demoralizing.
Is anyone else seeing this in their orgs? I'm not...
[0] https://en.wikipedia.org/wiki/Ward_Cunningham#Law
You could intuitively think it's just a difference of degree, but it's more akin to a difference of kind. Same for a nuke vs a spear, both are weapons, no one argues they're similar enough that we can treat them the same way
At the end of the day we're not performing war by poking other people with long sticks and we're not getting the word out by sending out a carrier pigeon.
Methods and medium matters.
LLMs can't do this.
Your code is unambiguously better than any LLM code if you can comment a link to the stackoverflow post you copied it from.
This is not a truism. "My" code might come from an LLM and that's fine if I can be reasonably confident it works. I might try to gain that confidence by testing the code and reading it to understand what it's doing. It is also true of blog post code, regardless of how I refer to the code; if I link to the blog post, it's because it does a better job of explaining than I ever could in code comments. Whether LLMs make one more productive is hard to measure but it seems to be missing the point to write this.
The point is, including the code is a choice and one should be mindful of it, no matter the code's origin. At that point, this comes off like you just have something to prove; there doesn't seem to be a reason not to use the LLM code if you know it works and you know why it works.
That's also true if I author the code myself; I can't go to anyone for help with it, so if it doesn't work then I have to figure out why.
> Believing you know how it works and why it works is not the same as that actually being the case.
My series of accidental successes producing working code is honestly starting to seem like real skill and experience at this point. Not sure what else you'd call it.
But it's built on top of things that are understood. If it doesn't work, then either:
• You didn't understand the problem fully, so the approach you were using is wrong.
• You didn't understand the language (library, etc) correctly, so the computer didn't grasp your meaning.
• The code you wrote isn't the code you intended to write.
This is a much more tractable situation to be in than "nobody knows what the code means, or has a mental model for how it's supposed to operate", which is the norm for a sufficiently-large LLM-produced codebase.
> My series of accidental successes
That somewhat misses the point. To write working code, you must have some understanding of the relationship between your intention and your output. LLMs have a poor-to-nonexistent understanding of this relationship, which they cover up with the ability to regurgitate (permutations of) a large corpus of examples – but this does not grant them the ability to operate outside the domain of those examples.
LLM-generated codebases very much do not lie within that domain: they lack the clues and signs of underlying understanding that human readers and (to an extent) LLMs rely on. Worse, the LLMs do replicate those signals, but they don't encode anything coherent in the signal. Unless you are very used to critically analysing LLM output, this can be highly misleading. (It reminds me of how chess grandmasters blunder, and struggle to even remember, unreachable board positions.)
Believing you know how LLM-generated code works, and why it works, is not the same as that actually being the case – in a very real sense that is different to that of code with human authors.
> Believing you know how LLM-generated code works, and why it works, is not the same as that actually being the case
This is a strawman argument which I'm not really interested to engage. You can assume competence. (In a scenario where one doesn't make these mistakes, what's left in your argument? It is a sufficiently strong claim to say these cannot be avoided such that it is reasonable to dismiss the claim unless supporting evidence is provided. In other words, the solution is as simple as not making these mistakes.) As I wrote up-thread, including the code is a choice and one should be mindful of it.
If "assume competence" means "assume that people do not make the mistakes they are observed to make", then why write tests? Wherefore bounds checking? Pilots are competent, so pre-flight checklists are a waste of time. Your doctor's competent: why seek a second opinion?
It's possible that you're just that good – that you can implement a solution "as simple as not making these mistakes" –, in which case, I'd appreciate if you could write up your method and share it with us mere mortals. But could it also be possible that you are making these mistakes, and simply haven't noticed yet? How would you know if your understanding of the program didn't match the actual program, if you've only tested the region in which the behaviours of both coincide?
Say you start at BigCo and are given access to their million line repo(s) with no docs and are given a ticket to work on. Ugh. You just barely started. But after you've been there for five years, it's obvious to you what the Pequad service does, and you might even know who gave it that name. If the claim is LLMs generate code that's simply incomprehensible by humans, the two counterexamples I have for you are TheDailyWtf.com, and Haskell.
That's not my claim. My claim is that AI-generated code is misleading to people familiar with human-written code. If you've grown up on AI-generated code, I wouldn't expect you to have this problem, much like how chess newbies don't find impossible board states much harder to process than possible ones.
So, I'm agreed on the second part too then.
If you are sufficiently motivated to appear more "productive" than your coworkers, you can force them to review thousands of lines of incorrect AI slop code while you sit back and mess around with your chatbots.
I think this is largely an issue that can be solved culturally within a team, we just unfortunately only have so much input on how other teams work. It doesn't help either when the low quality PRs are coming from Seniors on the other team and their manager doesn't seem to care about the feedback... Corporate politics are fun.
It is really difficult to evaluate but most of the good dev I have seen uses LLMs more as of a code completion improvement than anything else, so around 10-20% more productive at best, but definitely not slowing them down.
the idea that LLMs make developers more productive is delusional.
Reading code sucks, it always has. The flow state we all crave is when the code is in our working memory as an understood construct and we're just translating our mental model to a programming language. You don't get that with LLMs.
Now the power to create tons and tons of code (ie content) is in the hands of everyone and here we are complaining about it just like my wife use to complain about journalism. I think the myth of the highly regarded Software Developer perched in front of the warming glow of a screen solving and automating critical problems is coming to an end. Deservedly really, there's nothing more special about typing words into an editor than, say, framing a house.
Probs fine when you are still in the exploration phase of a startup, scary once you get to some kind of stability
LLMs made him an idiot and now he only writes a sentence by himself when he wants to say the R word
edit: the original comment said "Hang yourself Ret*rd"
Hell, for my hobby projects, I try to keep individual commits under 50-100 lines of code.
If these AIs are so smart, why the giant LOCs?
Sure, it’s cheaper today than yesterday to write out boilerplate, but programming is about eliminating boilerplate and using more powerful abstractions. It’s easy to save time doing lots of repetitive nonsense, stopping the nonsense should be the point.
Developers aren't hired to write code that's never run (at least in my opinion). We're also responsible for running the code/keeping it running.
And if it was repeated... Well I would probably get fired...
And not just from juniors
https://github.com/WireGuard/wireguard-android/pull/82 https://github.com/WireGuard/wireguard-android/pull/80
In that first one, the double pasted AI retort in the last comment is pretty wild. In both of these, look at the actual "files changed" tab for the wtf.
I’d love to hear your thoughts on LLMs, Jason. How do you use them in your projects? Do they play a role in your workflow at all?
I recently reviewed a PR that I suspect is AI generated. It added a function that doesn't appear to be called from anywhere.
It's shit because AI is absolutely not on the level of a good developer yet. So it changes the expectation. If a PR is not AI generated then there is a reasonable expectation that a vaguely competent human has actually thought about it. If it's AI generated then the expectation is that they didn't really think about it at all and are just hoping the AI got it right (which it very often doesn't). It's rude because you're essentially pawning off work that the author should have done to the reviewer.
Obviously not everyone dumps raw AI generated code straight into a PR, so I don't have any problem with using AI in general. But if I can tell that your code is AI generated (as you easily can in the cases you linked), then you've definitely done it wrong.
My eyes were wide open when 2 jobs ago, they said they would be blocking all personal web browsing from work computers. Multiple Software Devs were unhappy because they were using their work laptop for booking flights, dealing with their kids schools stuff and other personal things. They did not have personal computer at all.
Unfortunately, this person is vibe coding completely, and even the PR process is painful: * The coding agent reverts previously applied feedback * Coding agent not following standards throughout the code base * Coding agent re-inventing solutions that already exist * PR feedback is being responded to with agent output * 50k line PRs that required a 10-20 line change * Lack of testing (though there are some automated tests, but their validations are slim/lacking) * Bad error handling/flow handling
(By my organization, I meant my company - this person doesn't report to me or in my tree).
This is hilarious. Not when you're the reviewer, of course, but as a bystander, this is expert-level enterprise-grade trolling.
But LLMs don't really perform well enough on our codebase to allow you to generate things that even appear to work. And I'm the most junior member of my team at 37 years of age, hired in 2019.
I really tried to follow the mandate from on high to use Copilot, but the Agent mode can't even write code that compiles with the tools available to it.
Luckily I hooked it up to gptel so I can at least ask it quick questions about big functions I don't want to read in emacs.
This sounds fucking awesome.
I just had leadership on a big green-field project and IMO most of the team is ... looking forward to retirement a little too much to have really engaged (or taken leadership).
Fully vibe coded, which at least they admitted. And when I pointed out the thing is off by an order of magnitude, and as such doesn't implement said feature — at all — we get pressed on our AI policy, so as to not waste their time.
I don't have an AI policy, like I don't have an IDE policy, but things get ridiculous fast with vibe coding.
Fully vibe coded, which at least they admitted. And when I pointed out the thing is off by an order of magnitude, and doesn't implement said feature — at all — we get pressed on our AI policy, so as to not waste their time.
I don't have an AI policy, like I don't have an IDE policy, but things get ridiculous fast with vibe coding.
People do what they think they will be rewarded for. When you think your job is to write a lot of code then LLMs are great. When you need quality code you start to ask if LLMs are better or not?
I.e. 1-2 times a month, there's an SQL script posted that will be run against prod to "hopefully fix data for all customers who were put into a bad state from a previous code release".
The person who posts this type of message most often is also the one running internal demos of the latest AI flows and trying to get everyone else onboard.
If we are accepting LLM generated code, we should accept LLM generated content as long as it is "proof read" :)
Just a wild thought, nothing serious.
New to me, but I'm on board.
We already delegate accountability to non-humans all the time: - CI systems block merges - monitoring systems page people - test suites gate different things
In practice accountability is enforced by systems, not humans.. humans are defintiely "blamed" after the fact, but the day-to-day control loop is automated.
As agents get better at running code, inspecting ui state, correlating logs, screenshots, etc they're starting to operationally be "accountable" and preventing bad changes from shipping and producing evidence when something goes wrong .
At some point humans role shifts from "i personally verify this works" to "i trust this verification system and am accountable for configuring it correctly".
Thats still responsibility, but kind of different from whats described here. Taken to a logical extreme, the arguement here would suggest that CI shouldn't replace manual release checklists
Human collaboration works on trust.
Part of trust is accountability and consequences. If I get caught embezzling money from my employer I can lose my job, harm my professional reputation and even go to jail. There are stakes!
I computer system has no stakes, and cannot take accountability for its actions. This drastically limits what it makes sense to outsource to that system.
A lot of this comes down to my work on prompt injection. LLMs are fundamentally gullible: an email assistant might respond to an email asking for the latest sales figures by replying with the latest (confidential) sales figures.
If my human assistant does that I can reprimand or fire them. What am I meant to do with an LLM agent?
So the accountability situation for AI seems not that different. You can fire it. Exactly the same as for humans.
That's not a computer being accountable. That's you being accountable for proving that the software can carry out the business requirements without a human. That's not novel. That's how tons of software is already used.
if you put them in a forrest they would not survive and evolve (they are not viable systems alone); they are not taking action without the setup & maintenance (& accountability) of people
Accountability is about what happens if and when something goes wrong. The moon landings were controlled with computer assistance, but Nixon preparing a speech for what happened in the event of lethal failure is accountability. Note that accountability does not of itself imply any particular form or detail of control, just that a social structure of accountability links outcome to responsible person.
Perhaps an unstated and important takeaway here is that junior developers should not be permitted to use an LLMs for the same reason they should not hire people: they have not demonstrated enough skill mastery and judgement to be trusted with the decision to outsource their labor. Outsourcing to a vendor is a decision made by high-level stakeholders, with the ability to monitor the vendor performance, and replace the vendor with alternatives if that performance is unsatisfactory. Allowing junior developers to use LLM is allowing them to delegate responsibility without any visibility or ability to set boundaries on what can be delegated. Also important: you cannot delegate personal growth, and by permitting junior engineers to use an LLM that is what you are trying to do.
From there, I include explicit steps for how to test, including manual testing, and unit test/E2E test commands. If it's something visual, I try to include at least a screenshot, or sometimes even a brief screen capture demonstrating the feature.
Really go out of your way to make the reviewer's life easier. One benefit of doing all of this is that in most cases, the reviewer won't need to reach out to ask simple questions. This also helps to enable more asynchronous workflows, or distributed teams in different time zones.
The Devs went in kicking and screaming. It's almost like writing a description of the change, explaining the problem the code is solving, testing methodology, etc is harder than actually coding.
495 more comments available on Hacker News