You Should Write an Agent
Mood
heated
Sentiment
mixed
Category
other
Key topics
The article 'You should write an agent' encourages developers to build their own AI agents, sparking a lively discussion on the ease and challenges of creating such agents, as well as their potential applications and limitations.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
57m
Peak period
155
Day 1
Avg / period
53.3
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 6, 2025 at 3:37 PM EST
20 days ago
Step 01 - 02First comment
Nov 6, 2025 at 4:33 PM EST
57m after posting
Step 02 - 03Peak activity
155 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 10, 2025 at 10:07 AM EST
17 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
I am looking for a language or library agnostic pattern like we have MVC etc. for web applications. Or Gang of Four patterns but for building agents.
My own dumb TUI agent, I gave a built in `lobotomize` tool, which dumps a text list of everything in the context window (short summary text plus token count), and then lets it Eternal Sunshine of the Spotless Agent things out of the window. It works! The models know how to drive that tool. It'll do a series of giant ass log queries, filling up the context window, and then you can watch as it zaps things out of the window to make space for more queries.
This is like 20 lines of code.
Design patterns can't help you here. The hard part is figuring out what to do; the "how" is trivial.
that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.
What products or companies are the gold standard of agent implementation right now?
Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.
But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.
If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.
I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.
All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/
Is this useful for you?
That would certainly be nice! That's why we have been overhauling shell with https://oils.pub , because shell can't be described as that right now
It's in extremely poor shape
e.g. some things found from building several thousand packages with OSH recently (decades of accumulated shell scripts)
- bugs caused by the differing behavior of 'echo hi | read x; echo x=$x' in shells, i.e. shopt -s lastpipe in bash.
- 'set -' is an archaic shortcut for 'set +v +x'
- Almquist shell is technically a separate dialact of shell -- namely it supports 'chdir /tmp' as well as cd /tmp. So bash and other shells can't run any Alpine builds.
I used to maintain this page, but there are so many problems with shell that I haven't kept up ...
https://github.com/oils-for-unix/oils/wiki/Shell-WTFs
OSH is the most bash-compatible shell, and it's also now Almquist shell compatible: https://pages.oils.pub/spec-compat/2025-11-02/renamed-tmp/sp...
It's more POSIX-compatible than the default /bin/sh on Debian, which is dash
The bigger issue is not just bugs, but lack of understanding among people who write foundational shell programs. e.g. the lastpipe issue, using () as grouping instead of {}, etc.
---
It is often treated like an "unknowable" language
Any reasonable person would use LLMs to write shell/bash, and I think that is a problem. You should be able to know the language, and read shell programs that others have written
https://op.oils.pub/aports-build/published.html
We also don't appear to be unreasonably far away from running ~~ "all shell scripts"
Now the problem after that will be motivating authors of foundational shell programs to maintain compatibility ... if that's even possible. (Often the authors are gone, and the nominal maintainers don't know shell.)
As I said, the state of affairs is pretty sorry and sad. Some of it I attribute to this phenomenon: https://news.ycombinator.com/item?id=17083976
Either way, YSH benefits from all this work
Oh... oh I know how about... UNIX Philosophy? No... no that'd never work.
/s
I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.
Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D
https://github.com/simonw/llm-cmd is what i use as the "actually good ffmpeg etc front end"
and just to toot my own horn, I hand Simon's `llm` command lone tool access to its own todo list and read/write access to the cwd with my own tools, https://github.com/dannyob/llm-tools-todo and https://github.com/dannyob/llm-tools-patch
Even with just these and no shell access it can get a lot done, because these tools encode the fundamental tricks of Claude Code ( I have `llmw` aliased to `llm --tool Patch --tool Todo --cl 0` so it will have access to these tools and can act in a loop, as Simon defines an agent. )
You can have a simple dashboard site which collects the data from our shell scripts and shows your a summary or red/green signals so that you can focus on things which are interested in.
In the toy example, you explicitly restrict the agent to supply just a `host`, and hard-code the rest of the command. Is the idea that you'd instead give a `description` something like "invoke the UNIX `ping` command", and a parameter described as constituting all the arguments to `ping`?
I was telling a friend online that they should bang out an agent today, and the example I gave her was `ps`; like, I think if you gave a local agent every `ps` flag, it could tell you super interesting things about usage on your machine pretty quickly.
But... you don't have to use that at all. You can use pure prompting with ANY good LLM to get your own custom version of tool calling:
Any time you want to run a calculation, reply with:
{{CALCULATOR: 3 + 5 + 6}}
Then STOP. I will reply with the result.
Before LLMs had tool calling we called this the ReAct pattern - I wrote up an example of implementing that in March 2023 here: https://til.simonwillison.net/llms/python-react-patternAnd that's why I won't touch 'em. All the agents will be abandoned when people realize their inherent flaws (security, reliability, truthfulness, etc) are not worth the constant low-grade uncertainty.
In a way it fits our times. Our leaders don't find truth to be a very useful notion. So we build systems that hallucinate and act unpredictably, and then invest all our money and infrastructure in them. Humans are weird.
Obviously I’m reasonably willing to believe that you are an exception. However every person I’ve interacted with who makes this same claim has presented me with a dumpster fire and expected me to marvel at it.
In a weird way it sort of reminds me of Common Lisp. When I was younger I thought it was the most beautiful language and a shame that it wasn't more widely adopted. After a few decades in the field I've realized it's probably for the best since the average dev would only use it to create elaborate foot guns.
All I see in your post is equivalent to something like: you're surrounded by boot camp coders who write the worst garbage you've ever seen, so now you have doubts for anyone who claims they've written some good shit. Psh, yeah right, you mean a mudball like everyone else?
In that scenario there isn't much a skilled software engineer with different experiences can interject because you've already made your decision, and your decision is based on experiences more visceral than anything they can add.
I do sympathize that you've grown impatient with the tools and the output of those around you instead of cracking that nut.
see also: react hooks
Destiny visits me on my 18th birthday and says, "Gart, your mediocrity will result in a long series of elaborate foot guns. Be humble. You are warned."
Every time i reach for them recently I end up spending more time refactoring the bad code out or in deep hostage negotiations with the chatbot of the day that I would have been faster writing it myself.
That and for some reason they occasionally make me really angry.
Oh a bunch of prompts in and then it hallucinated some library a dependency isn’t even using and spews a 200 line diff at me, again, great.
Although at least i can swear at them and get them to write me little apology poems..
I keep flipping between this is the end of our careers, to I'm totally safe. So far this is the longest 'totally safe' period I've had since GPT-2 or so came along..
That there's a learning curve, especially with a new technology, and that only the people at the forefront of using that technology are getting results with it - that's just a very common pattern. As the technology improves and material about it improves - it becomes more useful to everyone.
I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.
... let's see ...
client = OpenAI()
Um right. That's like saying you should implement a web server, you will learn so much, and then you go and import http (in golang). Yeah well, sure, but that brings you like 98% of the way there, doesn't it? What am I missing?
POST https://api.openai.com/v1/responsesOpenAI does have an agents library, but it is separate in https://github.com/openai/openai-agents-python
The term "agent" isn't really defined, but its generally a wrapper around an LLM designed to do some task better than the LLM would on its own.
Think Claude vs Claude Code. The latter wraps the former, but with extra prompts and tooling specific to software engineering.
The fact you find this trivial is kind of the point that's being made. Some people think having an agent is some kind of voodoo, but it's really not.
Did you get to the part where he said MCP is pointless and are saying he's wrong?
Or did you just read the start of the article and not get to that bit?
Now you have a CLI tool you can use yourself, and the agent has a tool to use.
Anthropic itself have made MCP server increasingly pointless: With agents + skills you have a more composeable model that can use the model capabilities to do all an MCP server can with or without CLI tools to augment them.
That was my point. Going the extra step and wrapping it in an MCP provides minimal advantage vs. just writing a SKILL.md for a CLI or API endpoint.
print "Hello world!"
so easy...
Hold up. These are all the right concerns but with the wrong conclusion.
You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.
MCP isn't a plugin interface for Claude, it's just JSON-RPC.
I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).
But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.
The core MCP tech though is not only directionally correct, but even the implementation seems to have made lots of good and forward-looking choices, even if those are still under-utilized. For example besides tools, it allows for sharing prompts/resources between agents. In time, I'm also expecting the idea of "many agents, one generic model in the background" is going to die off. For both costs and performance, agents will use special-purpose models but they still need a place and a way to collaborate. If some agents coordinate other agents, how do they talk? AFAIK without MCP the answer for this would be.. do all your work in the same framework and language, or to give all agents access to the same database or the same filesystem, reinventing ad-hoc protocols and comms for every system.
you dont need the MCP implementation, but the idea is useful and you can consider the tradeoffs to your context window, vs passing in the manual as fine tuning or something.
client = OpenAI()
context_good, context_bad = [{
"role": "system", "content": "you're Alph and you only tell the truth"
}], [{
"role": "system", "content": "you're Ralph and you only tell lies"
}]
...
And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."You're building on quicksand.
You're delegating everything important to someone who has no responsibility to you.
ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens
the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.
i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.
my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge
https://www.reddit.com/r/AndroidQuestions/comments/16r1cfq/p...
Okay, but what if I'd prefer not to have to trust a remote service not to send me
{ "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }
?Instead, think of it as if you were enabling capabilities for AppArmor, by making a function call definition for just 1 command. Then over time suss out what commands you need your agent do to and nothing more.
https://github.com/zerocore-ai/microsandbox
I haven't tried it.
docker run -it --rm \
-e SOME_API_KEY="$(SOME_API_KEY)" \
-v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
--dns=127.0.0.1 \ <-- restrict network calls to localhost
$(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
my-agent-image
Probably could be a bit cleaner, but it worked for me.If you want your agent to pull untrusted code from the internet and go wild while you're doing other stuff it might not be a good choice.
I understand the sharing of kernel, while I might not be aware of all of the implications. I.e. if you have some local access or other sophisticated knowledge of the network/box docker is running on, then sure you could do some damage.
But I think the chances of a whitelisted llm endpoint returning some nefarious code which could compromise the system is actually zero. We're not talking about untrusted code from the internet. These models are pretty constrained.
Up until recently, this plan only offered Qwen3-Coder-480B, which was decent for the price and speed you got tokens at, but doesn't hold a candle to GLM 4.6.
So while they're not the cheapest PAYG GLM 4.6 provider, they are the fastest, and if you make heavy use their monthly subscription plan, then they're also the cheapest per token.
Note: I am neither affiliated with nor sponsored by Cerebras, I'm just a huge nerd who loves their commercial offerings so much that I can't help but gush about them.
gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.
In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.
https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...
No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)
This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)
I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.
The other takeaway is that agents are fantastically simple.
I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).
And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.
I tried Whisper, but it's slow and not great.
I tried the gpt audio models, but they're trained to refuse to transcribe things.
I tried Google's models and they were terrible.
I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.
So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!
Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.
Here's my current setup:
vt.py (mine) - voice type - uses pyqt to make a status icon and use global hotkeys for start/stop/cancel recording. Formerly used 3rd party APIs, now uses parakeet_py (patent pending).
parakeet_py (mine): A Python binding for transcribe-rs, which is what Handy (see below) uses internally (just a wrapper for Parakeet V3). Claude Code made this one.
(Previously I was using voxtral-small-latest (Mistral API), which is very good except that sometimes it will output its own answer to my question instead of transcribing it.)
In other words, I'm running Parakeet V3 on my CPU, on a ten year old laptop, and it works great. I just have it set up in a slightly convoluted way...
I didn't expect the "generate me some rust bindings" thing to work, or I would have probably gone with a simpler option! (Unexpected downside of Claude is really smart: you end up with a Rube Goldberg machine to maintain!)
For the record, Handy - https://github.com/cjpais/Handy/issues - does 80% of what I want. Gives a nice UI for Parakeet. But I didn't like the hotkey design, didn't like the lack of flexibility for autocorrect etc... already had the muscle memory from my vt.py ;)
My setup is as follow: - Simple hotkey to kick off shell script to record
- Simple python script that uses ionotify to watch directory where audio is saved. Uses whisper. This same script runs the transcription through Haiku 4.5 to clean it up. I tell it not to modify the contents, but it's haiku, so sometimes it just does it anyway. The original transcript and the ai cleaner versions are dumped into a directory
- The cleaned up version is run through another script to decide if it's code, a project brief, an email. I usually start the recording "this is code", "this is a project brief" to make it easy. Then, depending on what it is the original, the transcribed, and the context get run through different prompts with different output formats.
It's not fancy, but it works really well. I could probably vibe code this into a more robust workflow system all using ionotify and do some more advanced things. Integrating more sophisticated tool calling could be really neat.
I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.
If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.
What does that mean?
Or: just have a convention/an algorithm to decide how quickly Claude should refresh the access token. If the server knows token should be refreshed after 1000 requests and notices refresh after 2000 requests, well, probably half of the requests were not made by Claude Code.
On the other hand, I think that show or it didn’t happen is essential.
Dumping a bit of code into an LLM doesn’t make it a code agent.
And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?
I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.
Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.
The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.
I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.
Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.
I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.
I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.
And yes - it is very much fun to build those!
https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/
I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..
235 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.