Last activity 17 days agoPosted Nov 6, 2025 at 3:37 PM EST

You Should Write an Agent

tabletcorry

1070 points

395 comments

Mood

heated

Sentiment

mixed

Discussion Activity

Very active discussion

First comment

57m

Peak period

155

Day 1

Avg / period

53.3

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Nov 6, 2025 at 3:37 PM EST
20 days ago
Step 01
02First comment
Nov 6, 2025 at 4:33 PM EST
57m after posting
Step 02
03Peak activity
155 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Nov 10, 2025 at 10:07 AM EST
17 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (395 comments)

Showing 160 comments of 395

tlarkworthy

20 days ago

1 reply

Yeah I was inspired after https://news.ycombinator.com/item?id=43998472 which is also very concrete

tptacek

20 days ago

I love everything they've written and also Sketch is really good.

manishsharan

20 days ago

2 replies

How.. please don't say use langxxx library

I am looking for a language or library agnostic pattern like we have MVC etc. for web applications. Or Gang of Four patterns but for building agents.

tptacek

20 days ago

1 reply

The whole post is about not using frameworks; all you need is the LLM API. You could do it with plain HTTP without much trouble.

manishsharan

20 days ago

2 replies

When I ask for Patterns, I am seeking help for recurring problems that I have encountered. Context management .. small llms ( ones with small context size) break and get confused and forget work they have done or the original goal.

skeledrew

20 days ago

1 reply

That's why you want to use sub-agents which handle smaller tasks and return results to a delegating agent. So all agents have their own very specialized context window.

tptacek

20 days ago

1 reply

That's one legit answer. But if you're not stuck in Claude's context model, you can do other things. One extremely stupid simple thing you can do, which is very handy when you're doing large-scale data processing (like log analysis): just don't save the bulky tool responses in your context window once the LLM has generated a real response to them.

My own dumb TUI agent, I gave a built in `lobotomize` tool, which dumps a text list of everything in the context window (short summary text plus token count), and then lets it Eternal Sunshine of the Spotless Agent things out of the window. It works! The models know how to drive that tool. It'll do a series of giant ass log queries, filling up the context window, and then you can watch as it zaps things out of the window to make space for more queries.

This is like 20 lines of code.

adiasg

20 days ago

1 reply

Did something similar - added `summarize` and `restore` tools to maximize/minimize messages. Haven't gotten it to behave like I want. Hoping that some fiddling with the prompt will do it.

lbotos

20 days ago

FYI -- I vouched for you to undead this comment. It felt like a fine comment? I don't think you are shadowbanned but consider emailing the mods if you think you might me.

zahlman

20 days ago

Start by thinking about how big the context window is, and what the rules should be for purging old context.

Design patterns can't help you here. The hard part is figuring out what to do; the "how" is trivial.

oooyay

20 days ago

I'm not going to link my blog again but I have a reply on this post where I link to my blog post where I talk about how I built mine. Most agents fit nicely into a finite state machine or a directed acyclic graph that responds to an event loop. I do use provider SDKs to interact with models but mostly because it saves me a lot of boilerplate. MCP clients and servers are also widely available as SDKs. The biggest thing to remember, imo, is to keep the relationship between prompts, resources, and tools in mind. They make up a sort of dynamic workflow engine.

behnamoh

20 days ago

2 replies

> nobody knows anything yet

that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.

sumedh

20 days ago

1 reply

That is because for the people for whom AI is actually working/making money they would prefer to keep it a secret on what and how they are doing it, why attract competition?

nylonstrung

19 days ago

Who would you say it's working for?

What products or companies are the gold standard of agent implementation right now?

w_for_wumbo

20 days ago

I think we'd get better results if we thought of it as a conscious agent. If we recognized that it was going to mirror back or unconscious biases and try to complete the task as we define it, instead of how we think it should behave. Then we'd at least get our own ignorance out of the way when writing prompts.

Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.

But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.

If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.

oooyay

20 days ago

2 replies

Heh, the bit about context engineering is palpable.

I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.

All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/

cantor_S_drug

20 days ago

1 reply

https://github.com/mem0ai/mem0?tab=readme-ov-file

Is this useful for you?

oooyay

19 days ago

Could be! I'll give it a shot

qwertox

20 days ago

This sounds really great.

esafak

20 days ago

1 reply

What's wrong with the OWASP Top Ten?

kennethallen

20 days ago

Author on Twitter a few years ago: https://x.com/tqbf/status/851466178535055362

riskable

20 days ago

4 replies

It's interesting how much this makes you want to write Unix-style tools that do one thing and only one thing really well. Not just because it makes coding an agent simpler, but because it's much more secure!

chemotaxis

20 days ago

3 replies

You could even imagine a world in which we create an entire suite of deterministic, limited-purpose tools and then expose it directly to humans!

layer8

20 days ago

2 replies

I wonder if we could develop a language with well-defined semantics to interact with and wire up those tools.

chubot

20 days ago

1 reply

> language with well-defined semantics

That would certainly be nice! That's why we have been overhauling shell with https://oils.pub , because shell can't be described as that right now

It's in extremely poor shape

e.g. some things found from building several thousand packages with OSH recently (decades of accumulated shell scripts)

- bugs caused by the differing behavior of 'echo hi | read x; echo x=$x' in shells, i.e. shopt -s lastpipe in bash.

- 'set -' is an archaic shortcut for 'set +v +x'

- Almquist shell is technically a separate dialact of shell -- namely it supports 'chdir /tmp' as well as cd /tmp. So bash and other shells can't run any Alpine builds.

I used to maintain this page, but there are so many problems with shell that I haven't kept up ...

https://github.com/oils-for-unix/oils/wiki/Shell-WTFs

OSH is the most bash-compatible shell, and it's also now Almquist shell compatible: https://pages.oils.pub/spec-compat/2025-11-02/renamed-tmp/sp...

It's more POSIX-compatible than the default /bin/sh on Debian, which is dash

The bigger issue is not just bugs, but lack of understanding among people who write foundational shell programs. e.g. the lastpipe issue, using () as grouping instead of {}, etc.

---

It is often treated like an "unknowable" language

Any reasonable person would use LLMs to write shell/bash, and I think that is a problem. You should be able to know the language, and read shell programs that others have written

jacquesm

20 days ago

1 reply

I love it how you went from 'Shell-WTFs' to 'let's fix this'. Kudos, most people get stuck at the first stage.

chubot

20 days ago

Thanks! We are down to 14 disagreements between OSH and busybox ash/bash on Alpine Linux main

https://op.oils.pub/aports-build/published.html

We also don't appear to be unreasonably far away from running ~~ "all shell scripts"

Now the problem after that will be motivating authors of foundational shell programs to maintain compatibility ... if that's even possible. (Often the authors are gone, and the nominal maintainers don't know shell.)

As I said, the state of affairs is pretty sorry and sad. Some of it I attribute to this phenomenon: https://news.ycombinator.com/item?id=17083976

Either way, YSH benefits from all this work

zahlman

20 days ago

1 reply

As it happens, I have a prototype for this, but the syntax is honestly rather unwieldy. Maybe there's a way to make it more like natural human language....

imiric

20 days ago

2 replies

I can't tell whether any comment in this thread is a parody or not.

AdieuToLogic

20 days ago

When in doubt, there's always the option of rewriting an existing interactive shell in Rust.

zahlman

20 days ago

(Mine was intended as ironic, suggesting that a circle of development ideas would eventually complete. I interpreted the previous comments as satirically pointing at the fact that the notion of "UNIX-like tools" owes to the fact that there is actually such a thing as UNIX.)

utopiah

20 days ago

Hmmm but how would you name that? Agent skills? Meta cognition agentic tooling? Intelligence driven self improving partial building blocks?

Oh... oh I know how about... UNIX Philosophy? No... no that'd never work.

SatvikBeri

20 days ago

Half my use of LLM tools is just to remember the options for command line tools, including ones I wrote but only use every few months.

tptacek

20 days ago

4 replies

One thing that radicalized me was building an agent that tested network connectivity for our fleet. Early on, in like 2021, I deployed a little mini-fleet of off-network DNS probes on, like, Vultr to check on our DNS routing, and actually devising metrics for them and making the data that stuff generated legible/operationalizable was annoying and error prone. But you can give basic Unix network tools --- ping, dig, traceroute --- to an agent and ask it for a clean, usable signal, and they'll do a reasonable job! They know all the flags and are generally better at interpreting tool output than I am.

I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.

foobarian

20 days ago

2 replies

Honestly the top AI use case for me right now is personal throwaway dev tools. Where I used to write shell oneliners with dozen pipes including greps and seds and jq and other stuff, now I get an AI to write me a node script and throw in a nice Web UI to boot.

Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D

jacquesm

20 days ago

2 replies

Interesting. You have to wonder if all the tools that is based on would have been written in the first place if that kind of thing had been possible all along. Who needs 'grep' when you can write a prompt?

tptacek

20 days ago

2 replies

My long running joke is that the actual good `jq` is just the LLM interface that generates `jq` queries; 'simonw actually went and built that.

a-french-anon

20 days ago

Tried gron (https://github.com/tomnomnom/gron) a bit? If you know your UNIX, I think it can replace jq in a lot of cases. And when it can't, well, you can reach for Python, I guess.

dannyobrien

20 days ago

https://github.com/simonw/llm-jq for those following along at home

https://github.com/simonw/llm-cmd is what i use as the "actually good ffmpeg etc front end"

and just to toot my own horn, I hand Simon's `llm` command lone tool access to its own todo list and read/write access to the cwd with my own tools, https://github.com/dannyob/llm-tools-todo and https://github.com/dannyob/llm-tools-patch

Even with just these and no shell access it can get a lot done, because these tools encode the fundamental tricks of Claude Code ( I have `llmw` aliased to `llm --tool Patch --tool Todo --cl 0` so it will have access to these tools and can act in a loop, as Simon defines an agent. )

agumonkey

20 days ago

1 reply

It's highly plausible that all we assumed was good design / engineering will disappear if LLMs/Agents can produce more without having the be modular. (sadly)

jacquesm

20 days ago

1 reply

There is some kind of parallel behind 'AI' and 'Fuzzy Logic'. Fuzzy logic to me always appeared like a large number of patches to get enough coverage for a system to work even if you didn't understand it. AI just increases the number of patches to billions.

agumonkey

20 days ago

true, there's often a point where your system becomes a blurry miracle

andai

20 days ago

2 replies

Could you give some examples? I'm having the AI write the shell scripts, wondering if I'm missing out on some comfy UIs...

sumedh

20 days ago

It can be anything. It depends on what you want to do with the output.

You can have a simple dashboard site which collects the data from our shell scripts and shows your a summary or red/green signals so that you can focus on things which are interested in.

foobarian

20 days ago

I was debugging a service that was spitting out a particular log line. I gave Copilot an example line, told it to write a script that tails the log line and serves a UI via port 8080 with a table of those log lines parsed and printed nicely. Then I iterated by adding filter buttons, aggregation stats, simple things like that. I asked it to add a "clear" button to reset the UI. I probably would not even have done this without an AI because the CLI equivalent would be parsing out and aggregating via some form of uniq -c | sort -n with a bunch of other tuning and it would be too much trouble.

zahlman

20 days ago

1 reply

> They know all the flags and are generally better at interpreting tool output than I am.

In the toy example, you explicitly restrict the agent to supply just a `host`, and hard-code the rest of the command. Is the idea that you'd instead give a `description` something like "invoke the UNIX `ping` command", and a parameter described as constituting all the arguments to `ping`?

tptacek

20 days ago

3 replies

Honestly, I didn't think very hard about how to make `ping` do something interesting here, and in serious code I'd give it all the `ping` options (and also run it in a Fly Machine or Sprite where I don't have to bother checking to make sure none of those options gives code exec). It's possible the post would have been better had I done that; it might have come up with an even better test.

I was telling a friend online that they should bang out an agent today, and the example I gave her was `ps`; like, I think if you gave a local agent every `ps` flag, it could tell you super interesting things about usage on your machine pretty quickly.

zahlman

20 days ago

1 reply

Also to be clear: are the schemas for the JSON data sent and parsed here specific to the model used? Or is there a standard? (Is that the P in MCP?)

spenczar5

20 days ago

1 reply

Its JSON schema, well standardized, and predates LLMs: https://json-schema.org/

zahlman

20 days ago

1 reply

Ah, so I can specify how I want it to describe the tool request? And it's been trained to just accommodate that?

simonw

20 days ago

Most LLMs have tool patterns trained into them now, which are then managed for you by the API that the developers run on top of the models.

But... you don't have to use that at all. You can use pure prompting with ANY good LLM to get your own custom version of tool calling:

  Any time you want to run a calculation, reply with:
  {{CALCULATOR: 3 + 5 + 6}}
  Then STOP. I will reply with the result.

Before LLMs had tool calling we called this the ReAct pattern - I wrote up an example of implementing that in March 2023 here: https://til.simonwillison.net/llms/python-react-pattern

mwcampbell

20 days ago

1 reply

What is Sprite in this context?

cess11

20 days ago

I'm guessing the Fly Machine they're referring to is a container running on fly.io, perhaps the sprite is what the Spritely Institute calls a goblin.

indigodaddy

20 days ago

1 reply

Or have the agent strace a process and describe what's going on as if you're a 5 year old (because I actually need that to understand strace output)

tptacek

20 days ago

Iterated strace runs are also interesting because they generate large amounts of data, which means you actually have to do context programming.

0xbadcafebee

20 days ago

1 reply

> I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now.

And that's why I won't touch 'em. All the agents will be abandoned when people realize their inherent flaws (security, reliability, truthfulness, etc) are not worth the constant low-grade uncertainty.

In a way it fits our times. Our leaders don't find truth to be a very useful notion. So we build systems that hallucinate and act unpredictably, and then invest all our money and infrastructure in them. Humans are weird.

simonw

20 days ago

2 replies

Some of us have been happily using agentic coding tools (Claude Code etc) since February and we're still not abandoning them for their inherent flaws.

crystal_revenge

20 days ago

3 replies

The problem with statements like these is that I work with people who make the same claims, but are slowly building useless, buggy monstrosities that for various reasons nobody can/will call out.

Obviously I’m reasonably willing to believe that you are an exception. However every person I’ve interacted with who makes this same claim has presented me with a dumpster fire and expected me to marvel at it.

simonw

20 days ago

1 reply

I'm not going to dispute your own experience with people who aren't using this stuff effectively, but the great thing about the internet is that you can use it to track the people who are making the very best use of any piece of technology.

crystal_revenge

20 days ago

3 replies

This line of reasoning is smelling pretty "no true Scotsman" to me. I'm sure there were amazing ColdFusion devs, but that hardly justifies the use of the technology. Likewise "This tool works great on the condition that you need to hire a Simon Willison level dev" is almost a fault. I'm pretty confident you could squeeze some juice out of a Markov Chain (ignoring, of course, that decoder-only LLMs are basically fancy MCs).

In a weird way it sort of reminds me of Common Lisp. When I was younger I thought it was the most beautiful language and a shame that it wasn't more widely adopted. After a few decades in the field I've realized it's probably for the best since the average dev would only use it to create elaborate foot guns.

hombre_fatal

20 days ago

Meh, smart high-agency people can write good software, and they can go on to leverage powerful tools in productive ways.

All I see in your post is equivalent to something like: you're surrounded by boot camp coders who write the worst garbage you've ever seen, so now you have doubts for anyone who claims they've written some good shit. Psh, yeah right, you mean a mudball like everyone else?

In that scenario there isn't much a skilled software engineer with different experiences can interject because you've already made your decision, and your decision is based on experiences more visceral than anything they can add.

I do sympathize that you've grown impatient with the tools and the output of those around you instead of cracking that nut.

notpachet

20 days ago

> I've realized it's probably for the best since the average dev would only use it to create elaborate foot guns

see also: react hooks

gartdavis

20 days ago

"elaborate foot guns" -- HN is a high signal environment, but I could read for a week and not find a gem like this. Props.

Destiny visits me on my 18th birthday and says, "Gart, your mediocrity will result in a long series of elaborate foot guns. Be humble. You are warned."

cyberpunk

20 days ago

2 replies

We have gpt-5 and gemini 2.5 pro at work, and both of them produce huge amounts of basically shit code that doesn’t work.

Every time i reach for them recently I end up spending more time refactoring the bad code out or in deep hostage negotiations with the chatbot of the day that I would have been faster writing it myself.

That and for some reason they occasionally make me really angry.

Oh a bunch of prompts in and then it hallucinated some library a dependency isn’t even using and spews a 200 line diff at me, again, great.

Although at least i can swear at them and get them to write me little apology poems..

simonw

20 days ago

1 reply

Are you using them via a coding agent harness such as Codex CLI or Gemini CLI?

cyberpunk

20 days ago

Via the jetbrains plugin, has an 'agent' mode and can edit files and call tools so on, yes I setup MCP integrations and so on also. Still kinda sucks. shrug.

I keep flipping between this is the end of our careers, to I'm totally safe. So far this is the longest 'totally safe' period I've had since GPT-2 or so came along..

Etheryte

20 days ago

On the sometimes getting angry part, I feel you. I don't even understand why it happens, but it's always a weird moment when I notice it. I know I'm talking to a machine and it can't learn from its mistakes, but it's still very frustrating to get back yet another here's the actual no bullshit fix, for real this time, pinky promise.

edanm

19 days ago

But isn't this true of all technologies? I know plenty of people who are amazing Python developers. I've also seen people make a huge mess, turning a three-week project into a half-year mess because of their incredible lack of understanding of the tools they were using (Django, fittingly enough for this conversation).

That there's a learning curve, especially with a new technology, and that only the people at the forefront of using that technology are getting results with it - that's just a very common pattern. As the technology improves and material about it improves - it becomes more useful to everyone.

techpression

20 days ago

1 reply

I abandoned Claude Code pretty quickly, I find generic tools give generic answers, but since I do Elixir I’m ”blessed” with Tidewave which gives a much better experience. I hope more people get to experience framework built tooling instead of just generic stuff.
It still wants to build an airplane to go out with the trash sometimes and will happily tell you wrong is right. However I much prefer it trying to figure it out by reading logs, schemas and do browser analysis automatically than me feeding logs etc manually.

DeathArrow

20 days ago

1 reply

Cursor can read logs and schemas and use curl to test API responses. It can also look into the database.

techpression

20 days ago

But then you have to use Cursor. Tidewave runs as a dependency in the framework and you just navigate to a url, it’s quite refreshing actually.

chickensong

20 days ago

I hadn't given much thought to building agents, but the article and this comment are inspiring, thx. It's interesting to consider agents as a new kind of interface/function/broker within a system.

tinodb

19 days ago

Indeed. I have a tiny wrapper around the llm cli that gives it 3 tools: read these docs for program X, read its config and search-replace in said config. I use it for adopting Ghostty for example. I can now ask it: “how do I switch between window panes?” Then: “change that shortcut to …”

danpalmer

20 days ago

Doing one thing well means you need a lot more tools to achieve outcomes, and more tools means more context and potentially more understanding of how to string them together.

I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.

teiferer

20 days ago

6 replies

Write an agent, it's easy! You will learn so much!

... let's see ...

client = OpenAI()

Um right. That's like saying you should implement a web server, you will learn so much, and then you go and import http (in golang). Yeah well, sure, but that brings you like 98% of the way there, doesn't it? What am I missing?

tptacek

20 days ago

1 reply

That OpenAI() is a wrapper around a POST to a single HTTP endpoint:

    POST https://api.openai.com/v1/responses

tabletcorryAuthor

20 days ago

Plus a few other endpoints, but it is pretty exclusively an HTTP/REST wrapper.

OpenAI does have an agents library, but it is separate in https://github.com/openai/openai-agents-python

MeetingsBrowser

20 days ago

I think you might be conflating an agent with an LLM.

The term "agent" isn't really defined, but its generally a wrapper around an LLM designed to do some task better than the LLM would on its own.

Think Claude vs Claude Code. The latter wraps the former, but with extra prompts and tooling specific to software engineering.

victorbjorklund

20 days ago

maybe more like “let’s write a web server but let’s use a library for the low level networking stack”. That can still teach you a lot.

Bjartr

20 days ago

No, it's saying "let's build a web service" and starting with a framework that just lets you write your endpoints. This is about something higher level than the nuts and bolts. Both are worth learning.

The fact you find this trivial is kind of the point that's being made. Some people think having an agent is some kind of voodoo, but it's really not.

bootwoot

20 days ago

That's not an agent, it's an LLM. An agent is an LLM that takes real-world actions

munchbunny

20 days ago

An agent is more like a web service in your metaphor. Yes, building a web server is instructive, but almost nobody has a reason to do it instead of using an out of the box implementation once it’s time to build a production web service.

qwertox

20 days ago

2 replies

I've found it much more useful to create an MCP server, and this is where Claude really shines. You would just say to Claude on web, mobile or CLI that it should "describe our connectivity to google" either via one of the three interfaces, or via `claude -p "describe our connectivity to google"`, and it will just use your tool without you needing to do anything special. It's like custom-added intelligence to Claude.

mattmanser

20 days ago

1 reply

Honest question, as your comment confuses me.

Did you get to the part where he said MCP is pointless and are saying he's wrong?

Or did you just read the start of the article and not get to that bit?

vidarh

20 days ago

1 reply

I'd second the article on this, but also add to it that the biggest reason MCP servers don't really matter much any more is that the models are so capable of working with APIs, that most of the time you can just point them at an API and give them a spec instead. And the times that doesn't work, just give them a CLI tool with a good --help option.

Now you have a CLI tool you can use yourself, and the agent has a tool to use.

Anthropic itself have made MCP server increasingly pointless: With agents + skills you have a more composeable model that can use the model capabilities to do all an MCP server can with or without CLI tools to augment them.

simplesagar

20 days ago

2 replies

I feel the CLI vs MCP debate is an apples to oranges framing. When you're using claude you can watch it using CLI's, running brew, mise, lots of jq but what about when you've built an agent that needs to work through a complicated API? You don't want to make 5 CRUD calls to get the right answer. A curated MCP tool ensures it can determinism where it matters most.. when interacting with customer data

mattmanser

19 days ago

Sounds more like a problem with your APIs trying to follow some REST 'purity' rather than be usable.

vidarh

20 days ago

Even in the case where you need to group steps together in a deterministic manner, you don't need an MCP server for that. You just need to bundle those steps into a CLI or API endpoint.

That was my point. Going the extra step and wrapping it in an MCP provides minimal advantage vs. just writing a SKILL.md for a CLI or API endpoint.

tptacek

20 days ago

You can do this. Claude Code can do everything the toy agent this post shows, and much more. But you shouldn't, because doing that (1) doesn't teach you as much as the toy agent does, (2) isn't saving you that much time, and (3) locks you into Claude Code's context structure, which is just one of a zillion different structures you can use. That's what the post is about, not automating ping.

_pdp_

20 days ago

1 reply

It is also very simple to be a programmer.. see,

print "Hello world!"

so easy...

dan_can_code

20 days ago

But that didn't use the H100 I just bought to put me out of my own job!

robot-wrangler

20 days ago

1 reply

> Another thing to notice: we didn’t need MCP at all. That’s because MCP isn’t a fundamental enabling technology. The amount of coverage it gets is frustrating. It’s barely a technology at all. MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don’t control. Write your own agent. Be a programmer. Deal in APIs, not plugins.

Hold up. These are all the right concerns but with the wrong conclusion.

You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.

MCP isn't a plugin interface for Claude, it's just JSON-RPC.

tptacek

20 days ago

2 replies

I think my thing about MCP, besides the outsized press coverage it gets, is the implicit presumption it smuggles in that agents will be built around the context architecture of Claude Code --- that is to say, a single context window (maybe with sub-agents) with a single set of tools. That straitjacket is really most of the subtext of this post.

I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).

But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.

robot-wrangler

20 days ago

I still don't really get it, but would like to hear more. Just to get it out of the way, there's obvious bad aspects. Re: press coverage, everything in AI is bound to be frustrating this way. The MCP ecosystem is currently still a lot of garbage. It feels like a very shitty app-store, lots of abandonware, things that are shipped without testing, the usual band-wagoning. For example instead of a single obvious RAG tool there's 200 different specific tools for ${language} docs

The core MCP tech though is not only directionally correct, but even the implementation seems to have made lots of good and forward-looking choices, even if those are still under-utilized. For example besides tools, it allows for sharing prompts/resources between agents. In time, I'm also expecting the idea of "many agents, one generic model in the background" is going to die off. For both costs and performance, agents will use special-purpose models but they still need a place and a way to collaborate. If some agents coordinate other agents, how do they talk? AFAIK without MCP the answer for this would be.. do all your work in the same framework and language, or to give all agents access to the same database or the same filesystem, reinventing ad-hoc protocols and comms for every system.

8note

20 days ago

i treat MCP as a shorthand for "schema + documentation, passed to the LLM as context"

you dont need the MCP implementation, but the idea is useful and you can consider the tradeoffs to your context window, vs passing in the manual as fine tuning or something.

vkou

20 days ago

2 replies

> It’s Incredibly Easy

    client = OpenAI()
    context_good, context_bad = [{
        "role": "system", "content": "you're Alph and you only tell the truth"
    }], [{
        "role": "system", "content": "you're Ralph and you only tell lies"
    }]
    ...

And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."

You're building on quicksand.

You're delegating everything important to someone who has no responsibility to you.

sumedh

20 days ago

Its easy to switch to an open source model

tptacek

20 days ago

I love that the thing you singled out as not safe to run long term, because (apparently) of woke, was my weird deep-cut Labyrinth joke.

nowittyusername

20 days ago

1 reply

I agree with the sentiment but I also recommend you build a local only agent. Something that runs on llama.cpp or vllm, whatever... This way you can better grasp the more fundamental nature of what LLM's really are and how they work under the hood. That experience will also make you realize how much control you are giving up when using cloud based api providers like OpenAI and why so mane engineers feel that LLM's are a "black box". Well duh buddy you been working with apis this whole time, of course you wont understand much working just with that.

8note

20 days ago

1 reply

ive been trying this for a few week, but i dont at all currently own hardware good enough to be useful for local inference.

ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens

nowittyusername

20 days ago

1 reply

My man, we now have llms that are anywhere between 130 million to 1 trillion parameters available for us to run locally, I can guarantee there is a model for you there that even your toaster can run. I have a RTX 4090 but for most of my fiddling i use small models like Qwen 3 4b and they work amazing so there's no excuse :P.

8note

20 days ago

2 replies

well, i got some gemini models running on my phone, but if i switch apps, android kills it, so the call to the server always hangs... and then the screen goes black

the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.

i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.

my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge

tmzt

17 days ago

If it helps, you can disable some of those limitations on Android:

https://www.reddit.com/r/AndroidQuestions/comments/16r1cfq/p...

nowittyusername

20 days ago

I recommend use low quantized models first. for example anywhere between q4 and q8 gguf models. Also dont need high context to fiddle around and learn the ins and outs. for example 4k context is more then enough to figure out what you need in agentic solutions. In fact thats a good limit to impose on yourself and start developing decent automatic context management systems internally as that will be very important when making robus agentic solutions. with all that you should be able to load an llm no issues on many devices.

zahlman

20 days ago

2 replies

> Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes. Spoiler: you’d be surprisingly close to having a working coding agent.

Okay, but what if I'd prefer not to have to trust a remote service not to send me

    { "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }

?

tptacek

20 days ago

1 reply

Obviously if you're concerned about that, which is very reasonable, don't run it in an environment where `rm -rf` can cause you a real problem.

awayto

20 days ago

1 reply

Also if you're doing function calls you can just have the command as one response param, and arguments array as another response param. Then just black/white list commands you either don't want to run or which should require a human to say ok.

aidenn0

20 days ago

1 reply

blacklist is going to be a bad idea since so many commands can be made to run other commands with their arguments.

awayto

20 days ago

Yeah I agree. Ultimately I would suggest not having any kind of function call which returns an arbitrary command.

Instead, think of it as if you were enabling capabilities for AppArmor, by making a function call definition for just 1 command. Then over time suss out what commands you need your agent do to and nothing more.

worldsayshi

20 days ago

1 reply

There are MCP configured virtualization solutions that is supposed to be safe for letting LLM go wild. Like this one:

https://github.com/zerocore-ai/microsandbox

I haven't tried it.

awayto

20 days ago

1 reply

You can build your agent into a docker image then easily limit both networking and file system scope.

    docker run -it --rm \
      -e SOME_API_KEY="$(SOME_API_KEY)" \
      -v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
      --dns=127.0.0.1 \ <-- restrict network calls to localhost
      $(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
      my-agent-image

Probably could be a bit cleaner, but it worked for me.

worldsayshi

20 days ago

1 reply

Putting it inside docker is probably fine for most use cases but it's generally not considered to be a safe sandbox AFAIK. A docker container shares kernel with the host OS which widens the attack surface.

If you want your agent to pull untrusted code from the internet and go wild while you're doing other stuff it might not be a good choice.

awayto

20 days ago

Could you point to some resources which talk about how docker isn't considered a safe sandbox given the network and file system restrictions I mentioned?

I understand the sharing of kernel, while I might not be aware of all of the implications. I.e. if you have some local access or other sophisticated knowledge of the network/box docker is running on, then sure you could do some damage.

But I think the chances of a whitelisted llm endpoint returning some nefarious code which could compromise the system is actually zero. We're not talking about untrusted code from the internet. These models are pretty constrained.

dagss

20 days ago

1 reply

I realize now what I need in Cursor: A button for "fork context".

I believe that would be a powerful tool solving many things there are now separate techniques for.

all2

20 days ago

crush-cli has this. I think the google gemini chat app also has this now.

ericd

20 days ago

8 replies

Absolutely, especially the part about just rolling your own alternative to Claude Code - build your own lightsaber. Having your coding agent improve itself is a pretty magical experience. And then you can trivially swap in whatever model you want (Cerebras is crazy fast, for example, which makes a big difference for these many-turn tool call conversations with big lumps of context, though gpt-oss 120b is obviously not as good as one of the frontier models). Add note-taking/memory, and ask it to remember key facts to that. Add voice transcription so that you can reply much faster (LLMs are amazing at taking in imperfect transcriptions and understanding what you meant). Each of these things takes on the order of a few minutes, and it's super fun.

anonym29

20 days ago

2 replies

Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.

DeathArrow

20 days ago

3 replies

Aren't there cheaper providers of GLM 4.6 on Openrouter? What are the advantages of using Cerebras? Is it much faster?

simonw

20 days ago

It's astonishingly fast.

anonym29

19 days ago

Cerebras offers a $50/mo and $200/mo "Cerebras Code" subscription for token limits way above what you could get for the same price in PAYG API credits. https://www.cerebras.ai/code

Up until recently, this plan only offered Qwen3-Coder-480B, which was decent for the price and speed you got tokens at, but doesn't hold a candle to GLM 4.6.

So while they're not the cheapest PAYG GLM 4.6 provider, they are the fastest, and if you make heavy use their monthly subscription plan, then they're also the cheapest per token.

Note: I am neither affiliated with nor sponsored by Cerebras, I'm just a huge nerd who loves their commercial offerings so much that I can't help but gush about them.

meeq

20 days ago

You know how sometimes when you send a prompt to Claude, you just know it’s gonna take a while, so you go grab a coffee, come back, and it’s still working? With Cerebras it’s not even worth switching tabs, because it’ll finish the same task in like three seconds.

ericd

20 days ago

Ooh thanks for the heads up!

lukevp

20 days ago

2 replies

What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?

ericd

20 days ago

A lot of it is just from HN osmosis, but /r/LocalLLaMA/ is a good place to hear about the latest open weight models, if that's interesting.

gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.

In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.

andai

20 days ago

Here is ChatGpt in 50 lines of Python:

https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...

No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)

This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)

lowbloodsugar

20 days ago

2 replies

>build your own lightsaber

I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.

The other takeaway is that agents are fantastically simple.

afc

20 days ago

1 reply

I also started building my own, it's fun and you get far quickly.

I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).

threecheese

20 days ago

Just reading through your docs, and feeling inspired. What are you spending, token-wise? Order of magnitude.

ericd

20 days ago

Agreed, and it's actually how I've been thinking about it, but it's also straight from the article, so can't claim credit. But it was fun to see it put into words by someone else.

And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.

andai

20 days ago

8 replies

What are you using for transcription?

I tried Whisper, but it's slow and not great.

I tried the gpt audio models, but they're trained to refuse to transcribe things.

I tried Google's models and they were terrible.

I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!

tptacek

20 days ago

1 reply

I recently bought a mint-condition Alf phone, in the shape of Gordon Shumway of TV's "Alf", out of the back of an old auto shop in the south suburbs of Chicago, and naturally did the most obvious thing, which was to make a Gordon Shumway phone that has conversations in the voice of Gordon Shumway (sampled from Youtube and synthesized with ElevenLabs). I use https://github.com/etalab-ia/faster-whisper-server (I think?) as the Whisper backend. It's fine! Asterix feeds me WAV files, an ASI program feeds them to Whisper (running locally as a server) and does audio synthesis with the ElevenLabs API. Took like 2 hours.

t_akosuke

20 days ago

1 reply

Been meaning to build something very similar! What hardware did you use? I'm assuming that a Pi or similar won't cut it

tptacek

20 days ago

Just a cheap VOIP gateway and a NUC I use for a bunch of other stuff too.

nostrebored

20 days ago

1 reply

Parakeet is sota

dSebastien

20 days ago

Agreed. I just launched https://voice-ai.knowii.net and am really a fan of Parakeet now. What it manages to achieve locally without hogging too much resources is awesome

richardlblair

19 days ago

1 reply

The new Qwen model is supposed to be very good.

Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.

andai

19 days ago

1 reply

Thanks. Could you share more? I'm about to reinvent this wheel right now. (Add a bunch of manual find-replace strings to my setup...)

Here's my current setup:

vt.py (mine) - voice type - uses pyqt to make a status icon and use global hotkeys for start/stop/cancel recording. Formerly used 3rd party APIs, now uses parakeet_py (patent pending).

parakeet_py (mine): A Python binding for transcribe-rs, which is what Handy (see below) uses internally (just a wrapper for Parakeet V3). Claude Code made this one.

(Previously I was using voxtral-small-latest (Mistral API), which is very good except that sometimes it will output its own answer to my question instead of transcribing it.)

In other words, I'm running Parakeet V3 on my CPU, on a ten year old laptop, and it works great. I just have it set up in a slightly convoluted way...

I didn't expect the "generate me some rust bindings" thing to work, or I would have probably gone with a simpler option! (Unexpected downside of Claude is really smart: you end up with a Rube Goldberg machine to maintain!)

For the record, Handy - https://github.com/cjpais/Handy/issues - does 80% of what I want. Gives a nice UI for Parakeet. But I didn't like the hotkey design, didn't like the lack of flexibility for autocorrect etc... already had the muscle memory from my vt.py ;)

richardlblair

17 days ago

My use case is pretty specific - I have a 6 week old baby. So, I've been walking on my walking pad with her in the carrier. Typing in that situation is really not pleasant for anyone, especially the baby. Speed isn't my concern, I just want to keep my momentum in these moments.

My setup is as follow: - Simple hotkey to kick off shell script to record

- Simple python script that uses ionotify to watch directory where audio is saved. Uses whisper. This same script runs the transcription through Haiku 4.5 to clean it up. I tell it not to modify the contents, but it's haiku, so sometimes it just does it anyway. The original transcript and the ai cleaner versions are dumped into a directory

- The cleaned up version is run through another script to decide if it's code, a project brief, an email. I usually start the recording "this is code", "this is a project brief" to make it easy. Then, depending on what it is the original, the transcribed, and the context get run through different prompts with different output formats.

It's not fancy, but it works really well. I could probably vibe code this into a more robust workflow system all using ionotify and do some more advanced things. Integrating more sophisticated tool calling could be really neat.

greenfish6

20 days ago

I use Willow AI, which I think is pretty good

segu

20 days ago

Handy is free, open-source and local model only. Supports Parakeet: https://github.com/cjpais/Handy

raymond_goo

20 days ago

https://github.com/rhulha/Speech2Speech

https://github.com/rhulha/EchoMate

ericd

20 days ago

Whisper.cpp/Faster-whisper are a good bit faster than OpenAI's implementation. I've found the larger whisper models to be surprisingly good in terms of transcription quality, even with our young children, but I'm sure it varies depending on the speaker, no idea how well it handles heavy accents.

I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.

If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.

ty00001

18 days ago

Speechmatics - it is on the expensive side, but provides access to a bunch of languages and the accuracy is phenomenal on all of them - even with multi-speakers.

Uehreka

20 days ago

3 replies

The reason a lot of people don’t do this is because Claude Code lets you use a Claude Max subscription to get virtually unlimited tokens. If you’re using this stuff for your job, Claude Max ends up being like 10x the value of paying by the token, it’s basically mandatory. And you can’t use your Claude Max subscription for tools other than Claude Code (for TOS reasons. And they’ll likely catch you eventually if you try to extract and reuse access tokens).

sumedh

20 days ago

2 replies

> catch you eventually if you try to extract and reuse access tokens

What does that mean?

baq

20 days ago

2 replies

How do they know your requests come from Claude Code?

simonw

20 days ago

I imagine they can spot it pretty quick using machine learning to spot unlikely API access patterns. They're an AI research company after all, spotting patterns is very much in their wheelhouse.

virgilp

20 days ago

a million ways, but e.g: once in a while, add a "challenge" header; the next request should contain a "challenge-reply" header for said challenge. If you're just reusing the access token, you won't get it right.

Or: just have a convention/an algorithm to decide how quickly Claude should refresh the access token. If the server knows token should be refreshed after 1000 requests and notices refresh after 2000 requests, well, probably half of the requests were not made by Claude Code.

Uehreka

20 days ago

I’m saying if you try to use Wireshark or something to grab the session token Claude Code is using and pass it to another tool so that tool can use the same session token, they’ll probably eventually find out. All it would take is having Claude Code start passing an extra header that your other tool doesn’t know about yet, suspend any accounts whose session token is used in requests that don’t have that header and manually deal with any false positives. (If you’re thinking of replying with a workaround: That was just one example, there are a bajillion ways they can figure people out if they want to)

unshavedyak

19 days ago

1 reply

Is using CC outside of the CC binary even needed? CC has a SDK, could you not just use the proper binary? I've debated using it as the backend for internal chat bots and whatnot unrelated to "coding". Though maybe that's against the TOS as i'm not using CC in the spirit of it's design?

simonw

19 days ago

That's very much in the spirit of Claude Code these days. They renamed the Claude Code SDK to the Claude Agent SDK precisely to support this kind of usage of it: https://www.anthropic.com/engineering/building-agents-with-t...

ericd

20 days ago

When comparing, are you using the normal token cost, or cached? I find that the vast majority of my token usage is in the 90% off cached bucket, and the costs aren’t terrible.

_the_inflator

20 days ago

1 reply

I agree with you mostly.

On the other hand, I think that show or it didn’t happen is essential.

Dumping a bit of code into an LLM doesn’t make it a code agent.

And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?

I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.

Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.

The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.

I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.

ericd

20 days ago

To be clear, no one in this thread said this is replacing all senior engineers. But it is still amazing to see it work, and it’s very clear why the hype is so strong. But you’re right that you can quickly run into problems as it gets bigger.

Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.

I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.

20 days ago

Kimi is noticeably better at tool calling than gpt-oss-120b.

I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.

And yes - it is very much fun to build those!

GardenLetter27

20 days ago

But it's way more expensive since most providers won't give you prompt caching?

chrisweekly

20 days ago

There's something(s) about @tptacek's writing style that has always made me want to root for fly.io.

ATechGuy

20 days ago

Maybe we should write an agent that writes an agent that writes an agent...

zkmon

20 days ago

A very good blog article that I have read in a while. Maybe MCP could have been involved as well?

solomonb

20 days ago

This work predates agents as we know them now and was intended for building chat bots (as in irc chat bots) but when auto-gpt I realized I could formalize it super nicely with this library:

https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/

I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..

235 more comments available on Hacker News

View full discussion on Hacker News

ID: 45840088Type: storyLast synced: 11/26/2025, 1:00:33 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN