Key Takeaways
CC has built in subagents (including at least one not listed) that work very well: https://code.claude.com/docs/en/sub-agents#built-in-subagent...
this was not the case in the past, I swore off subagents, but they got good at some point
I find it doing what I in the past had to interrupt and tell it to do fairly frequently now
(Should have 2025 in the title? Time flies)
https://www.tbench.ai/leaderboard/terminal-bench/2.0 says yes, but not as much as you'd think. "Terminus" is basically just a tmux session and LLM in a loop.
For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.)
I was having codex organize my tv/movie library the other day by having it generate. most of the files were not properly labeled. I had codex generate transcripts, manually search the movie db to find descriptions of show episodes, and match the show descriptions against the transcripts to figure out which episode/season the show belonged to.
Claude Code could have parallelized those manual checks and finished that task at 8x the speed.
I'd say it's similar to how a "make your own relational DB" article might feature a basic B-tree with merge-joins. Yeah, obviously real engines have sophisticated planners, multiple join methods, bloom filters, etc., but the underlying mental model is still accurate.
Here’s a reframing:
If you asked people “what would you rather work with, today’s Claude Code harness with sonnet 3.7, or the 200 line agentic loop in the article with Opus 4.5, which would you choose?”
I suspect many people would choose 3.7 with the harness. If that is true, then I’d say the article is no longer useful for understanding Claude Code.
3.7 was kinda dumb, it was good at vibe UIs but really bad at a lot of things and it would lie and hack rewards a LOT. The difference with Opus 4.5 is that when you go off the Claude happy path, it holds together pretty well. With Sonnet (particularly <=4) if you went off the happy path things got bad in a hurry.
But skills do improve model performance, OpenAI posted some examples of how it massively juiced up their results on some benchmarks.
It's not.
I've done this (although not with all these tools).
For a reasonable sized project it's easy to tell the difference in quality between say Grok-4.1-Fast (30 on AA Coding Index) and Sonnet 4.5 (37 on AA).
Sonnet 3.7 scores 27. No way I'm touching that.
Opus 4.5 scores 46 and it's easy to see that difference. Give the models something with high cyclomtric complexity or complex dependency chains and Grok-4.1-Fast falls to bits, Opus 4.5 solves things.
Look at https://github.com/SWE-agent/mini-swe-agent for proof
There’s a reason Cursor poached Boris Cherney and Cat Wu and Anthropic hired them back!
I actually wrote my own simple agent (with some twists) in part so I could compare models.
Opus 4.5 is in a completely different league to Sonnet 4.5, and 3.7 isn't even on the same planet.
I happily use my agent with Opus but there is no world in which I'd use a Sonnet 3.7 level model for anything beyond simple code completion.
#!/usr/bin/env bash
set -euo pipefail
# Fail fast if OPENAI_API_KEY is unset or empty
: "${OPENAI_API_KEY:?set OPENAI_API_KEY}"
MODEL="${MODEL:-gpt-5.2-chat-latest}"
extract_text_joined() {
# Collect all text fields from the Responses API output and join them
jq -r '[.output[]?.content[]? | select(has("text")) | .text] | join("")'
}
apply_writes() {
local plan="$1"
echo "$plan" | jq -c '.files[]' | while read -r f; do
local path content
path="$(echo "$f" | jq -r '.path')"
content="$(echo "$f" | jq -r '.content')"
mkdir -p "$(dirname "$path")"
printf "%s" "$content" > "$path"
echo "wrote $path"
done
}
while true; do
printf "> "
read -r USER_INPUT || exit 0
[[ -z "$USER_INPUT" ]] && continue
# File list relative to cwd
TREE="$(find . -type f -maxdepth 6 -print | sed 's|^\./||')"
USER_JSON="$(jq -n --arg task "$USER_INPUT" --arg tree "$TREE" \
'{task:$task, workspace_tree:$tree,
rules:[
"Return ONLY JSON matching the schema.",
"Write files wholesale: full final content for each file.",
"If no file changes are needed, return files:[]"
] }')"
RESP="$(
curl -s https://api.openai.com/v1/responses \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d "$(jq -n --arg model "$MODEL" --argjson user "$USER_JSON" '{model:$model,input:[{role:"system",content:"You output only JSON file-write plans."},{role:"user",content:$user}],text:{format:{type:"json_schema",name:"file_writes",schema:{type:"object",additionalProperties:false,properties:{files:{type:"array",items:{type:"object",additionalProperties:false,properties:{path:{type:"string"},content:{type:"string"}},required:["path","content"]}}},required:["files"]}}}')"
)"
PLAN="$(printf "%s" "$RESP" | extract_text_joined)"
apply_writes "$PLAN"
doneHere's an agent in 24 lines of PHP, written in 2023. But it relies on `llm` to do HTTP and JSON.
Claude Code already codes Claude Code.
The limit is set by the amount of GPUs and energy supply.
Claude Code could code all the Claudes Claude Code could code, because Claude Code already coded the Claude that codes Claude Code.
Or more philosophically: The answer is recursively infinite, because each Claude that gets coded can then help code the next Claude, creating an ever-improving feedback loop of Claude-coding Claudes. It's Claudes all the way down!
For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.
To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)
Still, good article. Agents really are just tools in a loop. It's not rocket science.
Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?
I can think of a pet theory on it stopping early -- that positive tool responses and such bias it towards thinking it's complete (could be extremely wrong)
Who said anything about "thinking"? Smaller models were notorious for getting stuck repeating a single word over and over, or just "eeeeeee" forever. Larger models only change probabilities, not the fundamental nature of the machine.
So infinite loops are more of a default, and the question is how to avoid them. Picking randomly (non-zero temperature) helps prevent repetition sometimes. Other higher-level patterns probably prevent this from happening most of the time in more sophisticated LLM's.
Since one of the replies asked for an example: the agent works for a bit and just stops. We’ve all seen cases where the agent simply says “ok, let me read the blah.py to understand the context better”, and just stops. It has essentially forgotten to use a tool for its next edit or read etc.
For example
- how can I reliably have a decision block to end the loop (or keep it running)?
- how can I reliably call tools with the right schema?
- how can I reliably summarize context / excise noise from the conversation?
Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.
Perhaps I should accept that it's a feature and not a bug? :)
Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...
Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)
> - how can I reliably call tools with the right schema?
This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)
The TODO injection nyellin mentions is a good example. It's not sophisticated ML - it's bookkeeping. But without it, the agent will confidently declare victory three steps into a ten-step task. Same with subagents - they're not magic, they're just a way to keep working memory from getting polluted when you need to go investigate something.
The 200-line version captures the loop. The production version captures the paperwork around the loop. That paperwork is boring but turns out to be load-bearing.
I assign a <5% probability that GP comment was AI written. It's easy to tell, because AI writing has no soul.
If it's actually AI, the pattern becomes extremely reading them back-to-back. If not, I'll happily give them the benefit of the doubt at that point.
The other day on Reddit I saw a post in r/sysadmin that absolutely screamed karma farming AI and it was really depressing seeing a bunch of people defending them as the victim of an anti-AI mob without noticing the entire profile was carbon copies of generic "Does anyone else dislike [Tool X], am I alone? [generic filler] What does everyone else think?" posts.
In fact, can someone link me to a disputed comment that the consensus ends up being it's actually AI? I don't think I've seen one.
It's a bot. Period.
It's basically that. I can't explain it (I tried listing the tells in a comment below), but it's not just a list of things you notice. You notice the whole message, the cadence, the phrases that "add nothing". You play with enough models, you see enough generations and you start to "see it".
If you'd like to check for yourself, check that user's comment history. It will become apparent after a few messages. They all have these tells. I don't know how else to explain it, but it's there.
That's certainly a novel and confusing entry in my search history.
And then, as an exercise, ask yourself why you were willing to give this comment leniency?
Not 200 lines of Python.
Big models (like Claude Opus 4.5) can (and do) just RL-train this into the main model.
gemini 3 has pretty clearly been trained for this workflow of text output, since it can actually get the right calls in the first shot most of the time, and pays attention to the end of the context and not just the start.
gemini 3 is sitting within a format of text that it has been trained to be in, where for gemini 2, it only had the prompt to tell it how to work within the tool
Having said that, I think if you're going to write an article like this and call it "The Emperor Has No Clothes: How to Code Claude Code in 200 Lines of Code", you should at least include a reference to Thorsten Ball's excellent article from wayyy back in April 2025 entitled "How to Build an Agent, or: The Emperor Has No Clothes" (https://ampcode.com/how-to-build-an-agent)! That was (as far as I know) the first of these articles making the point that the core of a coding agent is actually quite simple (and all the deep complexity is in the LLM). Reading it was a light-bulb moment for me.
FWIW, I agree with other commenters here that you do need quite a bit of additional scaffolding (like TODOs and much more) to make modern agents work well. And Claude Code itself is a fairly complex piece of software with a lot of settings, hooks, plugins, UI features, etc. Although I would add that once you have a minimal coding agent loop in place, you can get it to bootstrap its own code and add those things! That is a fun and slightly weird thing to try.
(By the way, the "January 2025" date on this article is clearly a typo for 2026, as Claude Code didn't exist a year ago and it includes use of the claude-sonnet-4-20250514 model from May.)
Edit: and if you're interested in diving deeper into what Claude Code itself is doing under the hood, a good tool to understand it is "claude-trace" (https://github.com/badlogic/lemmy/tree/main/apps/claude-trac...). You can use it to see the whole dance with tool calls and the LLM: every call out to the LLM and the LLM's responses, the LLM's tool call invocations and the responses when tools run, etc. When Claude Skills came out I used this to confirm my guess about how they worked (they're a tool call with all the short skill descriptions stuffed into the tool description base prompt). Reading the base prompt is also interesting. (Among other things, they explicitly tell it not to use emoji, which tracks as when I wrote my own agent it was indeed very emoji-prone.)
There's a reason they won the agent race, their models are trained to use their own tools.
That said, I think it's hard to say how much of a difference it really makes in terms of making Claude Code specifically better than other coding agents using the same LLM (versus just making the LLM better for all coding agents using roughly similar tools). There is probably some difference, but you'd need to run a lot of benchmarks to find out.
Given the stance of the article, just the transcript formats reveals what might be a surprisingly complex system once you dig in.
For Claude Code, beyond the basic user/assistant loop, there's uuid/parentUuid threading for conversation chains, queue-operation records for handling messages sent during tool execution, file-history-snapshots at every file modification, and subagent sidechains (agent-*.jsonl files) when the Task tool spawns parallel workers.
So "200 lines" captures the concept but not the production reality of what is involved. It is particularly notable that Codex has yet to ship queuing, as that product is getting plenty of attention and still highly capable.
I have been building Contextify (https://contextify.sh), a macOS app that monitors Claude Code and Codex CLI transcripts in real-time and provides a CLI and skill called Total Recall to query your entire conversational history across both providers.
I'm about to release a Linux version and would love any feedback.
[1] With the exception of Claude Code Web, which does expose "sessions" or shared transcripts between local and hosted execution environments.
You can find out not just what you did and did not do but why. It is possible to identify unexpectedly incomplete work streams, build a histogram of the times of day you get most irritated with the AI, etc.
I think it is very cool and I have a major release coming. I'd be very appreciative of any feedback.
They're cool demos/POCs of real-world things, (and indeed are informative to people who haven't built AI tools). The very first version of Claude Code probably even looked a lot like this 200 line loop, but things have evolved significantly from there
I don't think it serves the same purpose. Many people understand the difference between a 200 lines twitter prototype and the real deal.
But many of those may not understand what the LLM client tool does and how it relates to the LLM server. It is generally consumed as one magic black box.
This post isn't to tell us how everyone can build a production grade claude-code; it tells us what part is done by the CLI and what part is done by the LLM's which I think is a rather important ingredient in understanding the tools we are using, and how to use them.
How many lines would you estimate it takes to capture that production reality of something like CC? I ask because I got downvoted for asking that question on a different story[1].
I asked because in that thread someone quoted the CC dev(s) as saying:
>> In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed.
My feeling is that a tool like this, while it won't be 200 lines, can't really be 40k lines either.
[1] If anyone is interested, https://news.ycombinator.com/item?id=46533132
It’s telling that they can’t fix the screen flickering issue, claiming “the problem goes deep.”
For example there’s a session-search skill and corresponding agent that can do:
aichat search —json [search params]
So you can ask Claude Code to use the searcher agent to recover arbitrary context of prior work from any of your sessions, and build on that work in a new session.
This has enabled me to completely avoid compaction.[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
I came across this one the other day: https://github.com/kulesh/catsyphon
So, thanks for your comment and answering all the questions I had just now about "wait, did I wake up in a parallel universe where I didn't write the post but someone else did?"
I do wonder whether the path here was:
1) You wrote the article in April 2025
2) The next generation of LLMs trained on your article
3) The author of TFA had a similar idea, and heavily used LLMs to help write the code and the article, including asking the LLM to think of a catchy title. And guess what title the LLM comes up with?
There are also less charitable interpretations, of course. But I'd rather assume this was honestly, if sloppily, done.
Yes and no. OpenCode is a great example of yes. But at the same time Anthropic gets to develop both client and model together. THey get to use the signals from the client, and "bake in" some of the things into the model. So their model will work best with their client. And somewhat less competent with other clients (you can kinda sorta see that today with opus in cc vs. in cursor).
For example, cc was (to my knowledge) the first client to add <system_reminder> tags from time to time. How often, how the model used them and so on? That's basically "signals", and they work together. And it works beautifully, as cc seems to stay on task better than OpenCode while using the same model.
The other advantage Anthropic have is just that they can sell CC subscriptions at lower cost because they own the models. But that's a separate set of questions that don't really relate to technical capabilities.
Anyhow, to follow up on your point, I do find it surprising that Claude Code is still (it seems?) definitively leading the pack in terms of coding agents. I've tried Gemini CLI and Codex and they feel distinctly less good, but I'm surprised we haven't seen too many alternatives from small startups or open source projects rise to the top as well. After all, they can build on all the lessons learned from previous agents (UX, context management, features people like such as Skills etc.). Maybe we will see more of this in 2026.
A single person, probably not. But a group of dedicated FOSS developers who together build a wide community contributing to one open model that could be continuously upgraded? Maybe.
Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)
https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...
https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...
I find it fascinating that while in theory one could just append these as reasoning tokens to the context, and trust the attention algorithm to find the most recent TODO list and attend actively to it... in practice, creating explicit tools that essentially do a single-key storage are far more effective and predictable. It makes me wonder how much other low-hanging fruit there is with tool creation for storing language that requires emphasis and structure.
For coding, I actually fully take over the todo list in codex + claude: https://github.com/graphistry/pygraphistry/blob/master/ai/pr...
In Louie.ai, for investigations, we're experimenting with enabling more control of it, so you can go with the grain, vs that kind of wholecloth replacement
And on a separate note - it looks like you're making a system for dealing with graph data at scale? Are you using LLMs primarily to generate code for new visualizations, or also to reason directly about each graph in question? To tie it all together, I've long been curious whether tools can adequately translate things from "graph space" to "language space" in the context of agentic loops. There seems to be tremendous opportunity in representing e.g. physical spaces as graphs, and if LLMs can "imagine" what would happen if they interacted with them in structured ways, that might go a long way towards autonomous systems that can handle truly novel environments.
RE:Other note, yes, we have 2 basic goals:
1. Louie to make graphs easier. Especially when connected to operational databases (splunk, kusto, elastic, big query, ...). V1 was generating graphistry viz & GFQL queries. We're now working on louie inside of graphistry, for more dynamic control of the visual analysis environment ("filter to X and color Y as Z"), and as you say, to go straight to the answer too ("what's going on with account/topic X").
2. Louie has been seeing wider market interest beyond graph, basically "AI that investigates" across those operational DBs (& live systems). You can think of it as vibe coding is code-oriented, while louie is vibe investigating that is more data-oriented. Ex: Native plans don't think in unit tests but cross-validation, and instead of grepping 1,000 files, we get back a dataframe of 1M query results and pass that between the agents for localized agentic retrieval on that vs rehammering db. The CCC talk gives a feel for this in the interactive setting.
It is a bit weird why anthropic doesn't make that available more openly. Depending on your preferences there is stuff in the default system prompt that you may want to change.
I personally have a list of phrases that I patch out from the system prompt after each update by running sed on cc's main.js
> Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked.
When did they add this? Real shame because the abundance of emojis in a readme was a clear signal of slop.
That's for complicated stuff. For throw-away stuff I don't need to maintain past 30 days like a script I'll just roll the dice and let it rip.
Like a manual sub agents approach. I try not to pollute the Claude code session context with meanderings to much. Do that in the chat and bring the condensed ideas over.
The folder will contain a plan file and a changelog. The LLM is asked to continously update the changelog.
When I open a new chat, I attach the folder and say: onboard yourself on this feature then get back to me.
This way, it has context on what has been done, the attempts it did (and perhaps failed), the current status and the chronological order of the changes (with the recent ones being usually considered more authoritative)
What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.
We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.
Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.
I'm guessing a self-updating plan there is sufficient. I'm not actually convinced today's current plan <> todolist flow makes sense - in the linked PLAN.md, it gets unified, and that's how we do ai coding. I don't have evals on this, but from a year of vibes coding/engineering, that's what we experientially reached across frontier coding models & tools. Nowadays we're mixing in evals too, but that's a more complicated story.
We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.
Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)
I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.
I'll definitely check out the video later. Thanks!
In the CCC video, you may enjoy the section on how we are moving to eval-driven AI coding for how we more methodically improve agents. Even more so, the slides before on motivating why it gets harder to improve quality as you go on.
One big rub is it's one of those areas where people grossly misunderestimate what is needed for the quality goals they're likely targeting, and if a long-living artifact to be maintained, the on-going costs. It's similar to junior engineers or short-term contractors who never had to build production-grade software before and haven't had to live with their decisions: These are quite learnable engineering skills, and I've found it useful to burn your fingers before having confidence in the surprising weight of cost/benefit decisions. The more autonomy and expectations you are targeting for the agent, the more so.
This is supposed to be an emulator of Claude’s own TodoWrite and TodoRead, which does a full update of a todo.json for every task update. A nice use of composition of edit tool - https://github.com/joehaddad2000/claude-todo-emulator
By extending Claude Todo emulator, It was possible to make the agent come up with Multi-step Hierarchical plans and follow it and track updates on it for usecases like Oncall Troubleshooting Runbooks.
PS: the above open source repo does not provide single task update as a tool, which is not hard to implement on your own
And in the event of context compression, the TODO serves as a compact representation of the session.
We did just that back then and it worked great, we used it in many projects after that.
- https://github.com/rcarmo/bun-steward
- https://github.com/rcarmo/python-steward (created with the first one)
And they're self-replicating!
I think it's a great way to dive into the agent world
This is a really nice open source coding agent implementation. The use of async is interesting.
74 more comments available on Hacker News
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.