Scaling Llms to Larger Codebases
Key topics
As developers grapple with scaling large language models (LLMs) to larger codebases, a lively debate erupts over the effectiveness of current approaches. Some commenters feel that the discussion is rehashing familiar concepts, while others see value in reiterating these points for newcomers. A key challenge emerges: LLMs often ignore or deviate from provided context, such as prompts and documentation, in unpredictable ways, sparking a desire for a more deterministic language to instruct computers. The conversation highlights the need for better context management, with the author revealing that their codebase is their primary context, and others sharing war stories of LLMs gone awry.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
38m
Peak period
77
0-6h
Avg / period
20
Based on 120 loaded comments
Key moments
- 01Story posted
Dec 22, 2025 at 10:38 AM EST
12 days ago
Step 01 - 02First comment
Dec 22, 2025 at 11:16 AM EST
38m after posting
Step 02 - 03Peak activity
77 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 25, 2025 at 11:56 AM EST
8 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Decent article but it feels like a linkedin rehashing of stuff the people at the edge have already known for a while.
You're not wrong, but it bears repeating to newcomers.
The average LLM user I encounter is still just hammering questions into the prompt and getting frustrated when the LLM makes the same mistakes over and over again.
I'm far from an LLM power user, but this is the single highest ROI practice I've been using.
You have to actually observe what the LLM is trying to do each time. Simply smashing enter over and over again or setting it to auto-accept everything will just burn tokens. Instead, see where it gets stuck and add a short note to CLAUDE.md or equivalent. Break it out into sub-files to open for different types of work if the context file gets large.
Letting the LLM churn and experiment for every single task will make your token quota evaporate before your eyes. Updating the context file constantly is some extra work for you, but it pays off.
My primary use case for LLMs is exploring code bases and giving me summaries of which files to open, tracing execution paths through functions, and handing me the info I need. It also helps a lot to add some instructions for how to deliver useful results for specific types of questions.
I feel like I spend quite a bit of time telling the thing to look at information it already knows. And I'm talking about when I HAVE actually created various documents to use and prompts.
As a specific example, it regularly just doesn't reference CLAUDE.md and it seems pretty random as to when it decides to drop that out of context. That's including right at session start when it should have it fresh.
Overcoming this kind of nondeterministic behavior around creating/following/modifying instructions is the biggest thing I wish I could solve with my LLM workflows. It seems like you might be able to do this through a system of Claude Code hooks, but I've struggled with finding a good UX for maintaining a growing and ever-changing collection of hooks.
Are there any tools or harnesses that attempt to address this and allow you to "force" inject dynamic rules as context?
I’ve found that when my agent flies off the rails, it’s due to an underlying weakness in the construction of my program. The organization of the codebase doesn’t implicitly encode the “map”. Writing a prompt library helps to overcome this weakness, but I’ve found that the most enduring guidance comes from updating the codebase itself to be more discoverable.
Which, I've had it delete the entire project including .git out of "shame", so my claude doesn't get permission to run rm anymore.
Codex has fewer levers but it's deleted my entire project twice now.
(Play with fire, you're gonna get burnt.)
Also, I have extremely frequent commits and version control syncs to GitHub and so on as part of the process (including when it's working on documents or things that aren't code) as a way to counteract this.
Although I suppose a sufficiently devious AI can get around those, it seems to not have been a problem.
.. and then ran a flawed git checkout command that wiped out all unstaged changes, which it immediately realized and after flailing around for five minutes trying to undo eventually came back saying "yeah uh so sorry I may have made a little mistake"
https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...
I would agree with that!
I've been experimenting with having Claude re-write those documents itself. It can take simple directives and turn them into hierarchical Markdown lists that have multiple bullet points. It's annoying and overly verbose for humans to read, but the repetition and structure seems to help the LLM.
I also interrupt it and tell it to refer back to CLAUDE.md if it gets too off track.
Like I said, though, I'm not really an LLM power user. I'd be interested to hear tips from others with more time on these tools.
Better than that, ask the LLM. Better than that, have the LLM ask itself. You do still have make sure it doesn't go off the rails, but the LLM itself wrote this to help answer the question:
### Pattern 10: Student Pattern (Fresh Eyes)
*Concept:* Have a sub-agent read documentation/code/prompts "as a newcomer" to find gaps, contradictions, and confusion points that experts miss.
*Why it works:* Developers write with implicit knowledge they don't realize is missing. A "student" perspective catches assumptions, undefined terms, and inconsistencies.
*Example prompt:* ``` Task: "Student Pattern Review
Pretend you are a NEW AI agent who has never seen this codebase. Read these docs as if encountering them for the first time: 1. CLAUDE.md 2. SUB_AGENT_QUICK_START.md
Then answer from a fresh perspective:
## Confusion Points - What was confusing or unclear on first read? - What terms are used without explanation?
## Contradictions - Where do docs disagree with each other? - What's inconsistent?
## Missing Information - What would a new agent need to know that isn't covered?
## Recommendations - Concrete edits to improve clarity
Be honest and critical. Include file:line references." ```
*Uses cases:* Before finalizing new documentation, evaluating prompts for future Agents.
Like if I go to a restaurant for the first time and the item I order is bad, could I go back and try something else? Perhaps, but I could also go somewhere else.
Claude does have this specific interface for asking questions now. I've only had it choose to ask me questions on its own a very few times though. But I did have it ask clarifying questions before that interface was even a thing, when I specifically asked it to ask me clarifying questions.
Again, like a junior dev. And like a junior dev, it can also help to ask it to ask / check what its doing "mid-way", i.e. watch what it's doing and stop it, when it's running down some rabbit hole you know is not gonna yield results.
"Before you start, please ask me any questions you have about this so I can give you more context. Be extremely comprehensive."
(I got the idea from a Medium article[1].) The LLM will, indeed, stop and ask good questions. It often notices what I've overlooked. Works very well for me!
[1] https://medium.com/@jordan_gibbs/the-most-important-chatgpt-...
> Before you proceed, read the local and global Claude.md files and make sure you understand how we work together. Make sure you never proceed beyond your own understanding.
> Always consult the user anytime you reach a judgment call rather than just proceeding. Anytime you encounter unexpected behavior or errors, always pause and consider the situation. Rather than going in circles, ask the user for help; they are always there and available.
> And always work from understanding; never make assumptions or guess. Never come up with field names, method names, or framework ideas without just going and doing the research. Always look at the code first, search online for documentation, and find the answer to things. Never skip that step and guess when you do not know the answer for certain.
And then the Claude.md file has a much more clearly written out explanation of how we work together and how it's a consultative process where every major judgment call should be prompted to the user, and every single completed task should be tested and also asked for user confirmation that it's doing what it's supposed to do. It tends to work pretty well so far.
You may not like all the opinions of the framework, but the LLM knows them and you don’t need to write up any guidelines for it.
I liked the Rust solution a lot, but it had 200+ dependencies vs Bun’s 5 and Rails’ 20ish (iirc). Rust feels like it inherited the NPM “pull in a thousand dependencies per problem” philosophy, which is a real shame.
Knowing how you would implement the solution beforehand is a huge help, because then you can just tell the LLM to do the boring/tedious bits.
It almost never fails and usually does it in a neat way, plus its ~50 lines of code so I can copy and paste confidently. Letting the agent just go wild on my code has always been a PITA for me.
I feel the same way as you in general -- I don't trust it to go and just make changes all over the codebase. I've seen it do some really dumb stuff before because it doesn't really understand the context properly.
[Research] ask the agent to explain current functionality as a way to load the right files into context.
[Plan] ask the agent to brainstorm the best practices way to implement a new feature or refactor. Brainstorm seems to be a keyword that triggers a better questioning loop for the agent. Ask it to write a detailed implementation plan to an md file.
[clear] completely clear the context of the agent —- better results than just compacting the conversation.
[execute plan] ask the agent to review the specific plan again, sometimes it will ask additional questions which repeats the planning phase again. This loads only the plan into context and then have it implement the plan.
[review & test] clear the context again and ask it to review the plan to make sure everything was implemented. This is where I add any unit or integration tests if needed. Also run test suites, type checks, lint, etc.
With this loop I’ve often had it run for 20-30 minutes straight and end up with usable results. It’s become a game of context management and creating a solid testing feedback loop instead of trying to purely one-shot issues.
For really big features or plans I’ll ask the agent to create linear issue tickets to track progress for each phase over multiple sessions. Only MCP I have loaded is usually linear but looking for a good way to transition it to a skill.
I've model do the complete opposite of that I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instruction are nothing complex at all.
I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.
I notice that sometime the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.
but, why is it a big issue? if it does something bad, just reset the worktree and try again with a different model/agent? They are dirt cheap at 20/m and I have 4 subscription(claude, codex, cursor, zed).
The thing about good abstractions is that you should be able to trust in a composable way. The simpler or more low-level the building blocks, the more reliable you should expect them to be. In LLMs you can't really make this assumption.
> The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.
Yes, that's why we review all code even when written by humans.
The issue to me is that I have no idea of what the code looks like and have to have a reliable first layer model that can summarize current codebase state so I can decide whether the next mutation moves the project forward or reduces technical debt. I can delegate much more that way, while gemini "do first" approach tend to result in many dead ends that I have to unravel.
Biggest step-change has been being able to one-shot file refactors (using the planning framework I mentioned above). 6 months ago refactoring was a very delicate dance and now it feels like it’s pretty much streamlined.
In both instances, literally just restating the exact same request with "No, the request was: [original wording]" was all it took to steer them back and didn't become a pattern of similar failures. But with the unpredictability of how the CLI agents decide to traverse a repo and ingest large amounts of distracting code/docs it seems much too over confident to believe that random, bizarre LLM "reasoning" failures won't still occur from time to time in regular usage even as models improve.
Its like the fallout when a waymo kills a "beloved neighborhood cat". I'm not against cats, and I'm deeply saddened at the loss of any life, but if it's true that (comparable) mile for mile, waymos reduce deaths and injuries, that is a good thing - even if they don't reduce them to zero.
And to be clear, I often feel the same way - but I am wondering why and whether it's appropriate!
LLMs become increasingly error-prone as their memory is fills up. Just like humans.
In VSCode Copilot you can keep track of how many tokens the LLM is dealing with in realtime with "Chat Debug".
When it reaches 90k tokens I should expect degraded intelligence and brace for a possible forced sumarization.
Sometimes I just stop LLMs and continue the work in a new session.
For planning large tasks like "setup playwright tests in this project with some demo tests" I spend some time chatting with Gemini 3 or Opus 4.5 to figure out the most idiomatic easy-wins and possible pitfalls. Like: separate database for playwright tests. Separate users in playwright tests. Skipping login flow for most tests. And so on.
I suspect that devs who use a formal-plan-first approach tend to tackle larger tasks and even vibe code large features at a time.
The biggest gotcha I found is that these LLMs love to assume that code is C/Python but just in your favorite language of choice. Instead of considering that something should be written encapsulated into an object to maintain state, it will instead write 5 functions, passing the state as parameters between each function. It will also consistently ignore most of the code around it, even if it could benefit from reading it to know what specifically could be reused. So you end up with copy-pasta code, and unstructured copy-pasta at best.
The other gotcha is that claude usually ignores CLAUDE.md. So for me, I first prompt it to read it and then I prompt it to next explore. Then, with those two rules, it usually does a good job following my request to fix, or add a new feature, or whatever, all within a single context. These recent agents do a much better job of throwing away useless context.
I do think the older models and agents get better results when writing things to a plan document, but I've noticed recent opus and sonnet usually end up just writing the same code to the plan document anyway. That usually ends up confusing itself because it can't connect it to the code around the changes as easily.
Sounds very functional, testable, and clean. Sign me up.
I have a user prompt saved called clean code to make a pass through the changes and remove unused, DRY and refactor - literally the high points of uncle bob's Clean Code. It works shockingly well at taking AI code and making it somewhat maintainable.
https://gist.github.com/prostko/5cf33aba05680b722017fdc0937f...
After forcing myself over years to apply various OOP principles using multiple languages, I believe OOP has truly been the worst thing to happen to me personally as engineer. Now, I believe what you actually see is just an "aesthetics" issue, moreover it's purely learned aesthetics.
I'd argue writing functional code in C++ (which is multi-paradigm anyway), or Java, or Typescript is fine!
Does the UI shows clearly what portion was done by a subagent?
It’ll report, “Numbers changed in step 6a therefore it worked” [forgetting the pivotal role of step 2 which failed and the agent should have taken step 6b, not 6a].
Or “there is conclusive evidence that X is present and therefore we were successful” [X is discussed in the plan as the reason why action is NEEDED l, not as a success criteria“.
I _think _ that what is going wrong is context overload and my remedy is to have the agent update every step of the plan with results before moving on to the next.
When things seem off I can then clear context and have the agent review results step by step to debug its own work: “review step 2 of the results. Are the stated results confident with final conclusions? Quote lines from the results verbatim as evidence.”
At a basic level, they work akin to git-hooks, but they fire up a whole new context whenever certain events trigger (E.g. another agent finishes implementing changes) - and that hook instance is independent of the implementation context (which is great, as for the review case it is a semi-independent reviewer).
We've taken those prompts, tweaked them to be more relevant to us and our stack, and have pulled them in as custom commands that can be executed in Claude Code, i.e. `/research_codebase`, `/create_plan`, and `/implement_plan`.
It's working exceptionally well for me, it helps that I'm very meticulous about reviewing the output and correcting it during the research and planning phase. Aside from a few use cases with mixed results, it hasn't really taken off throughout our team unfortunately.
IMO, the best way to raise the floor of LLM performance in codebases is by building meaning into the code base itself ala DDD. If your codebase is hard to understand and grok for a human, it will be the same for an LLM. If your codebase is unstructured and has no definable patterns, it will be harder for an LLM to use.
You can try to overcome this with even more tooling and more workflows but IMO, it is throwing good money after bad. it is ironic and maybe unpopular, but it turns out LLMs prove that all the folks yapping about language and meaning (re: DDD) were right.
DDD & the Simplicity Gospel:
https://oluatte.com/posts/domain-driven-design-simplicity-go...
I've been having a lot of fun taking my larger projects and decomposing them into various structures made from nix flakes. If I launch claude code in a flake devshell it has access to only those tools, and it sees the flake.nix and assumes that the project is bounded by the CWD even though it's actually much larger, so its context is small and it doesn't get overwhelmed.
Inputs/outputs are a nice language agnostic mechanism for coordinating between flakes (just gotta remember to `nix flake update --update-input` when you want updated outputs from an adjacent flake).
I've been running with the idea for a few weeks, maybe it's dumb, but I'd be surprised if this kind of rethinking didn't eventually yield a radical shift in how we organize code.
You can use it as an alternative to `git bisect` where only you're only bisecting the history of a single subflake. I imagine writing a new test that indicates the presence of an old bug, and then going back in time to see when the bug was reintroduced. With git bisect, going back in time means your new test goes away too.
The focus mainly seems to be on enhancing existing workflows to produce code we currently expect - often you hear its like a junior dev.
The type of rethinking you outlined could have code organised in such a way a junior dev would never be able to extend but our 'junior dev' LLM can iterate through changes easily.
I care more about the properties of software e.g. testable, extendable, secure than how it organised.
Gets me to think of questions like
- what is the correlation between how code is organised vs its properties? - what is the optimal organisation of code to facilitate llms to modify and extend software?
https://github.com/MatrixManAtYrService/poag
I'm especially pleased with how explicit it makes the inner dependency graph. Today I'm tinkering with pact (https://docs.pact.io/) and the need to add the pact contracts generated during consumer testing as flake outputs so they can then be inputs to whichever flake does provider testing is potentially a bit more work than it would be under other schemes, but it also make the dependency direction into a first class citizen and not an implementation detail, which I like.
But it's still not completely right. LLMs are actually great to tell you about things you know little about. You just have to take names, ideas, and references from it, not facts.
(And that makes agentic coding almost useless, by the way.)
Burn through your token limit in agent mode just to thrash around a few more times trying to identify where the agent "misunderstood" the prompt.
The only time LLM's work as coding agents for me is tightly scoped prompts with a small isolated context.
Just throwing an entire codebase into an LLM in an agentic loop seems like a fools errand.
I have the complete opposite experience, where once some patterns already exist 2-3 times in the codebase, the LLMs start to accurately replicating them instead of trying to solve everything as one-off solutions.
> You can’t be inconsistent if there are no existing patterns.
"Consistency" shouldn't be equated to "good". If that's your only metric for quality and you don't apply any taste you'll quickly end of with a unmaintainable hodgepodge of second-grade libraries if you let an LLM do its thing in a greenfield project.
I’m working on a fairly messy ingestion pipeline (Instagram exports → thumbnails → grouped “posts” → frontend rendering). The data is inconsistent, partially undocumented, and correctness is only visible once you actually look at the rendered output. That makes it a bad fit for naïve one-shotting.
What’s worked is splitting responsibility very explicitly: • Human (me): judge correctness against reality. I look at the data, the UI, and say things like “these six media files must collapse into one post”, “stories should not appear in this mode”, “timestamps are wrong”. This part is non-negotiably human. • LLM as planner/architect: translate those judgments into invariants and constraints (“group by export container, never flatten before grouping”, “IG mode must only consider media/posts/*”, “fallback must never yield empty output”). This model is reasoning about structure, not typing code. • LLM as implementor (Codex-style): receives a very boring, very explicit prompt derived from the plan. Exact files, exact functions, no interpretation, no design freedom. Its job is mechanical execution.
Crucially, I don’t ask the same model to both decide what should change and how to change it. When I do, rework explodes, especially in pipelines where the ground truth lives outside the code (real data + rendered output).
This also mirrors something the article hints at but doesn’t fully spell out: the codebase isn’t just context, it’s a contract. Once the planner layer encodes the rules, the implementor can one-shot surprisingly large changes because it’s no longer guessing intent.
The challenges are mostly around discipline: • You have to resist letting the implementor improvise. • You have to keep plans small and concrete. • You still need guardrails (build-time checks, sanity logs) because mistakes are silent otherwise.
But when it works, it scales much better than long conversational prompts. It feels less like “pair programming with an AI” and more like supervising a very fast, very literal junior engineer who never gets tired, which, in practice, is exactly what these tools are good at.
But the summary here is that with the right guidance, AI currently crushes it on large codebases.
Good read but I wouldn't fully extend the garbage in, garbage out principle to the LLMs. These massive LLMs are trained on internet-scale data, which includes a significant amount of garbage, and still do pretty good. Hallucinations are due to missing or misleading context than from the noise alone. Tech debt heavy code bases though unstructured still provides information-rich context.
Of course, but the problem is the converse: There are too many situations where a peer engineer will know what to do but the agent won't.
> Think of these as bumper rails. You can increase the likelihood of an LLM reaching the bowling pins by making it impossible to land in the gutter.
Sort of, but this is also a little similar to a claim that P = NP. Having a an efficient way to reliably check if a solution is correct is not the same as a reliable way to find a solution. The likelihood may well be higher yet still not high enough. Even though theoretically NP problems are strictly easier than EXPTIME ones, in practice, in many situations (though not all) they are equally intractable.
Obviously, mileage mauy vary dependning on the task and the domain, but if it's true that coding models will get significantly better, then the best course of action may well be, in many cases, to just wait until they do rather than spend a lot of effort working around their currentl limitation, effort that will be wasted if and when capabilities improve.
I've seen some impressive output so far, and have a couple friends that have been using AI generation a lot... I'm trying to create a couple legacy (BBS tech related, in Rust) applications to see how they land. So far mostly planning and structure beyond the time I've spent in contemplation. I'm not sure I can justify the expense long term, but wanting to experience the fuss a bit more to have at least a better awareness.
I’d like to see dynamic task-specific context building. Write a prompt and the model starts to collect relevant instructions.
Also a review loop to check that instructions were followed.
This is a good article, but misses one of the most important advances this year - the agentic loop.
There are always going to be limits to how much code a model can one-shot. Give it the ability to verify its changes and iterate, massively increase its ability to write sizeable chunks of verified and working code.
I find it to be a good thing that the code must be read in order to be production-grade, because that implies the coder must keep learning.
I worry about the collapse in knowledge pipeline when there is very little benefit to overseeing the process...
I say that as a bad coder who can and has done SO MUCH MORE with llm agents. So I'm not writing this as someone who has an ideal of coding that is being eroded. I'm just entering the realm of "what elite coding can do" with LLMs, but I worry for what the realm will lose, even as I'm just arriving