Key Takeaways
What is health of your enterprise code base? If it’s anything like ones I’ve experienced it’s a legacy mess then it’s absolutely understandable that an LLMs output is subpar when taking on larger tasks.
Also depends on the models and plan you’re on. There is a significant increase in quality when comparing Cursors default model on a free plan vs Opus 4.5 on a maximum Claude plan.
I think a good exercise is to prohibit yourself from writing any code manually and force yourself to do LLM only, might sound silly but it will develop that skill-set.
Try Claude code in thinking mode with the some super powers - https://github.com/obra/superpowers
I routinely make an implementation plan with Claude and then step away for 15 mins while it spins - the results aren’t perfect but fixing that remaining 10% is better than writing 100% of it myself.
Given it to new people of course carry questions, but most of them (juniors) could just follow the code given an entry point for that task, this from BE to FE.
I use the github copilot premium models available.
> I routinely make an implementation plan with Claude and then step away for 15 mins while it spins - the results aren’t perfect but fixing that remaining 10% is better than writing 100% of it myself.
I have to be honest, I just did this two times and the amount of code that needed to be fixed, and the mental overload to find open bugs was much more than just guide the LLM on every step. But this was a couple of months ago.
But other than that what I’ve found to be the most important is static tooling. Do you have rules that require tests to be run, do you have linters and code formatters that enforce your standards? Are you using well known tools (build tools, dependency management tools etc) or is that bespoke.
But the less sexy answer is that no, you can’t drop an agent cold into a big codebase and expect it to perform miracles. You need to build out agentic flows as a process that you iterate and improve on. If you prompt an agent and it gets it wrong, evaluate why and build out the tools so next time it won’t get it wrong. You slowly level up the capabilities of the tool by improving it over time.
I can’t emphasize enough the difference in agents though. I’ve been doing a lot of ab tests with copilot against other agents and it’s wild how bad it is, even backed with the same models.
In enterprise systems, “full features” built directly on model output tend to fail at the edges: permissions, retries, validation, and auditability. The teams that succeed put a deterministic layer around the model — schemas, tool boundaries, and explicit failure handling.
Once you do that, the LLM stops being the risky part. The architecture is.
The reality is that LLMs/agents are just a new way to write code. You still need to understand, more-or-less, how this feature is going to actually work, and how it needs to be implemented, from start to finish.
The difference is that you don't write the code, you tell the LLM to write the code. Once you've figured out the right "chunk size" an LLM can handle it's faster than doing it yourself.
I've found it's actually a little _harder_ in green field projects because the LLM doesn't have guard rails and examples and existing patterns to follow.
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.