Superpowers: How I'm Using Coding Agents in October 2025
Posted3 months agoActive3 months ago
blog.fsck.comTechstoryHigh profile
skepticalmixed
Debate
80/100
AI Coding AgentsLarge Language ModelsSoftware Development
Key topics
AI Coding Agents
Large Language Models
Software Development
The post discusses the author's use of coding agents with LLMs, sparking debate about the effectiveness and practicality of this approach among HN commenters.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
5h
Peak period
95
6-12h
Avg / period
14.5
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 11, 2025 at 3:29 AM EDT
3 months ago
Step 01 - 02First comment
Oct 11, 2025 at 8:27 AM EDT
5h after posting
Step 02 - 03Peak activity
95 comments in 6-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 14, 2025 at 2:15 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45547344Type: storyLast synced: 11/20/2025, 8:28:07 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
(There probably are some. Most likely I notice the bad ones more than the good ones. But it does seem like I notice a lot of bad ones, and never any good ones.)
[EDITED to add:] For context, the actual article title begins "Superpowers: How I'm using ..." and it has been auto-rewritten to "Superpowers: I'm using ...", which completely changes what "Superpowers" is understood as applying to. (The actual intention: superpowers for LLM coding agents. The meaning after the change: LLM coding agents as superpowers for humans.)
According to this, we'll all be reading the feelings journals of our LLM children and scolding them for cheating on our carefully crafted exams instead of, you know, making things. We'll read psychology books, apparently.
I like reading and tinkering directly. If this is real, the field is going to leave that behind.
It's just that we designed this iteration of technology foundationally on people's fickle and emotional reddit posts among other things.
It's a designed-in limitation, and kind of a happy accident it's capable of writing code at all. And clearly carries forward a lot of baggage...
That's fine.
Perhaps we can RL away some of this or perhaps there's something else we need. Idk, but this is the problem when engineers are the customer, designer, and target audience.
I hate managing people.
What are we doing?
either your business isn't successful, so you're coding when you shouldn't be, or cosplaying coding with Claude, or you're lying, or you're telling us about your expensive and unproductive hobby.
How much do you spend on AI? What's your annual profit?
edit: oh cosplaying as a CEO. I see. Nice WPEngine landing page Mr AppBind.com CEO. Better have Claude fix your website! I guess that agent needs therapy...
Is Claude really "learning new skills" when you feed it a book, or does it present it like that because you're prompting encourages that sort of response-behavior. I feel like it has to demo Claude with the new skills and Claude without.
Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth.
The most challenging part when working with coding agents is that they seem to do well initially on a small code base with low complexity. Once the codebase gets bigger with lots of non-trivial connections and patterns, they almost always experience tunnel vision when asked to do anything non-trivial, leading to increased tech debt.
I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.
It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.
https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...
"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.
So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.
This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.
I agree. I think there are too many resources, examples, and live streams out there for someone to credibly claim at this point that these tools have no value and are all hype. I think the nuance is in how and where you apply it, what your expectations and tolerances are, and what your working style is. They are bad at many things, but there is tremendous value to be discovered. The loudest people on both sides of this debate are typically wrong in similar ways imo.
> Trial participants saved an average of 56 minutes a working day when using AICAs
That feels accurate to me, but again I'm just going on vibes :P
You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.
> "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
Heh, I'd rephrase the first part to:
> What you're getting at is the heart of the problem with software development though, isn't it?
It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.
If that's what we need to do, don't we already have the answer to the question?
I’ve similarly been using spec.md and running to-do.md files that capture detailed descriptions of the problems and their scoped history. I mark each of my to-do’s with informational tags: [BUG], [FEAT], etc.
I point the LLM to the exact to-do (or section of to-do’s) with the spec.md in memory and let it work.
This has been working very well for me.
https://gist.github.com/JacobBumgarner/d29b660cb81a227885acc...
On top of that, it doesn't sound enjoyable. Anti slop sessions? Seriously?
Lastly, the largest problem I have with LLMs is that they are seemingly incapable of stopping to ask clarifying questions. This is because they do not have a true model of what is going on. Instead they truly are next token generators. A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.
If asking clarifying questions is plausible output text for LLMs, this may work effectively.
1. Context isn't infinite
2. Both Claude and OpenAI get increasingly dumb after 30-50% of context had been filled
Even the "million token context window" becomes useless once it's filled to 30-50% and the model starts "forgetting" useful things like existing components, utility functions, AGENTS.md instructions etc.
Even a junior programmer can search and remember instructions and parts of the codebase. All current AI tools have to be reminded to recreate the world from scratch every time, and promptly forget random parts of it.
If that's your idea of trivial then you and I have very different standards in terms of what's a trivial change and what isn't.
Yes, the core of that pull requests is an hour or two of thinking, the rest is ancillary noise. The LLM took away the need for the noise.
If your definition of trivial is signal/noise ratio, then, sure, relatively little signal in a lot of noise. If your definition of "trivial" hinges on total complexity over time, then this kicks the pants of manual writing.
I'd assume OP did the classic senior engineer stick of "I can understand the core idea quickly, therefore it can't be hard". Whereas Mitchel did the heavy lifting of actually shipping the "not hard" idea - still understanding the core idea quickly, and then not getting bogged down in unnecessary details.
That's the beauty of LLMs - it turns the dream of "I could write that in a weekend" into actually reality, where it before was always empty bluster.
Didn’t you just describe Agile?
Sorry couldn’t resist. Agile’s point was getting feedback during the process rather than after something is complete enough to be shipped thus minimizing risk and avoiding wasted effort.
Instead people are splitting up major projects into tiny shippable features and calling that agile while missing the point.
"Splitting up major projects into tiny shippable features and calling that agile" feels like a much more accurate description of what I've experienced.
I wish I'd gotten to see the real thing(s) so I could at least have an informed opinion.
The failure modes I've personally seen is an organization that isn't interested in cooperating or the person running the show is more interested in process than people. But I'd say those teams would struggle no matter what.
Ultimately, I think it's up to the engineering side to do its best to leverage the process for better results, and I've seen very little of that (and it's of course always been the PM side's fault).
And you're right: use what works for you. I just haven't seen anything that felt like it actually worked. Maybe one problem is people iterating so fast/often they don't actually know why it's not working.
The manager for the only team I think actually checked all the agile boxes had a UI background so she thought in terms of mock-ups, backend, and polishing as different tasks and was constantly getting client feedback between each stage. That specific approach isn’t universal, the feedback as part of the process definitely should be though.
What was a little surreal is the pace felt slow day to day but we were getting a lot done and it looked extremely polished while being essentially bug free at the end. An experienced team avoiding heavy processes, technical debt, and wasted effort goes a long way.
Either way, the ability to produce "working software" (as the manifesto puts it) in "frequent" iterations (often just seconds with an LLM!) and iterate on feedback is core to Agile.
I'm not highlighting this to gloat or to prove a point. If anything in the past I have underestimated how big LLMs were going to be. Anyone so inclined can take the chance to point and laugh at how stupid and wrong that was. Done? Great.
I don't think I've been intentionally avoiding coding assistants and as a matter of fact I have been using Claude Code since the literal day it first previewed, and yet it doesn't feel, not even one bit, that you can take your hands off the wheel. Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.
A year ago I was using GitHub Copilot autocomplete in VS Code and occasionally asking ChatGPT or Claude to help write me a short function or two.
Today I have Claude Code and Codex CLI and Codex Web running, often in parallel, hunting down and resolving bugs and proposing system designs and collaborating with me on detailed specs and then turning those specs into working code with passing tests.
The cognitive overhead today is far higher than it was a year ago.
You can go further and faster, but you can get to a point where you're out of juice miles from home, and getting back is a chuffing nightmare.
Also, you discover that you're putting on weight and not getting that same buzz you got on your old pushbike.
When using it for code or architecture or design, I’m always watching for signs that it is going off the rails. Then I usually write code myself for a while, to keep the structure and key details of whatever I’m doing correct.
- incessantly duplicating already existing functionality: utility functions, UI components etc.
- skipping required parameters like passing current user/actor to DB-related functions
- completely ignoring large and small chunks of existing UI and UI-related functionality like layouts or existing styles
- using ad-hoc DB queries or even iterating over full datasets in memory instead of setting up proper DB queries
And so on and so forth.
YYMV of course depending on language and project
... which is exactly the kind of thing this new skills mechanism is designed to solve.
That they routinely ignore.
> which means documenting the functionality they should be aware of
Which means spending inordinate amounts of time writing down about every single function and component and css and style which can otherwise be easily discovered by just searching. Or by looking at adjacent files.
> which is exactly the kind of thing this new skills mechanism is designed to solve.
I tried it yesterday. It immediately duplicated functionality, ignored existing styles and components, and created ad-hoc queries. It did feel like there were fewer times when it did that, but it's hard to quantify.
I also have to remember all of the new code that’s coming together, and keep it from re-inventing other parts of the codebase, etc.
More productive, but hard work.
P.s. always thought you were one of those irrational AI bros. Later, found that you were super reasonable. That's the way it should be. And thank you!
It's funny because not far below this comment there is someone doing literally this.
For context, I mainly do game development so I'm viewing it through that lens - but I find it easier to debug something bad than to write it from scratch. It's more intensive than doing it yourself but probably more productive too.
C'mon, such self-congratulatory "Look at My Potency: How I'm using Nicknack.exe" fluffies always were and always will be a staple of the IT industry.
However for complex projects IMO one must read what was written by the llm … every actual word.
When it ‘got away’ from me, in each case I left something in the llm written markdown that I should have removed.
99% “I can ask for that later” and 1% “that’s a good idea i hadn’t considered” might be the right ratio when reading an llm generated plan/spec/workunit.
Breaking work into single context passes … 50-60k tokens in sonnet 4.5 has had typically fantastic results for me.
My side project is using lean 4 and a carelessly left in ‘validate’ rather than ‘verify’ lead down a hilariously complicated path equivalent to matching an output against a known string.
I recovered, but it wasn’t obvious to me that was happening. I however would not be able to write lean proofs myself, so diagnosing the problem and fixing it is a small price to be able to mechanically verify part of my software is correct.
maybe explicit support from providers would make it feasible?
The costs of self hosting some reasonable size models for a development group of various sizes is what I would want to know before investing in the skills to do a high usage style that might be being mostly bankrolled by investors for now.
I don’t like the looks of that. If I used this, how soon before those instructions would be in conflict with my actual priorities?
Not everything can be the first law.
Spend some time digging around in his https://github.com/obra/Superpowers repo.
I wrote some notes on this last night: https://simonwillison.net/2025/Oct/10/superpowers/
The packaged collection is very cool and so is the idea of automatically adding new abilities, but I’m not fully convinced that this concept of skills is that much better than having custom commands+sub-agents. I’ll have to play around with it these next few days and compare.
Some of these skills are probably better as programmed workflows that the LLM is forced to go through to improve reliability/consistency, that's what I've found in my own agents, rather than using English to guide the LLM and trusting it to follow the prescribed set of steps needed. Some mix of LLMs (choosing skills, executing the fuzzy parts of them) and just plain code (orchestration of skills) seems like the best bet to me and what I'm pursuing.
The ability to isolate context-noisy subtasks (like agentically searching through a large codebase by grepping through dozens of irrelevant files to find the one you actually need) unlocks much longer-running loops, and therefore much more complex tasks.
And you don't need a system this complicated to take advantage of it. Literally just a simple "codebase-searcher" agent (and Claude can vibe the agent definition for you) is enough to see the benefit first-hand. Once you see it, if you're like me, you will see opportunities for subagents everywhere.
Using them in a way that doesn't waste tokens is something I haven't fully figured out out yet!
What am I missing?
Also, memory itself can be a tool the subagent calls to retrieve only the stuff it needs.
Not only do I have know everything about the code, data and domain, but now I need to understand this whole AI system which is a meta skill of its own.
I fear I may never be able catch up till someone comes along and simplifies it for pleb consumption.
Admittedly that stance is easiest to take if you were old enough, experienced enough already by the time this era hit.
I don't see the coding as the hard or critical part of my work, so I don't put effort into accelerating or delegating that part.
I've found that a single CLAUDE.md does really well at guiding it how I want it to behave. For me that's making it take small steps and stop to ask me questions frequently, so it's more like we're pairing than I'm sending it off solo to work on a task. I'm sure that's not to everyone's taste but it works for me (and I say this as someone who was an agent-sceptic until quite recently).
Fwiw my ~/.claude/CLAUDE.md is 2.2K / 49 lines.
just remember that it works the same for everyone: you input text, magic happens, text comes out.
if you can properly explain a software engineering problem in plain language, you're an expert in using LLMs. everything on top of that people experimenting or trying to build the next big thing.
I’ve found you have to use Claude Code to do something small. And as you do it iterate on the CLAUDE.md input prompt to refine what it does by default. As it doesn't do it your way, change it to see if you can fix how it works. The agent is then equivalent to calling chatgpt / sonnet 1000 times a hour. So these refinements (skills in the post are a meta approach) are all about how to tune the workflow to be more accurate for your project and fit your mental model. So as you tune the md file you’ll start to feel what is possible and understand agent capabilities much better.
So short story you have to try it, but long story its the iteration of the meta prompt approach that teaches you whats possible.
Just give it a few months. If some technics really work, it’ll get streamlined.
No matter what you are told, there is no silver bullet. Precisely defining the problem is always the hard part. And the best way to precisely define a problem and its solution is code.
I’ll let other people fight swarms of bots building… well who knows what. Maybe someday it will deliver useful stuff, but I’m highly skeptical.
Jesse on Bluesky: https://bsky.app/profile/s.ly/post/3m2srmkergc2p
> The core of it is VERY token light. It pulls in one doc of fewer than 2k tokens. As it needs bits of the process, it runs a shell script to search for them. The long end to end chat for the planning and implementation process for that todo list app was 100k tokens.
> It uses subagents to manage token-heavy stuff, including all the actual implementation.
Save yourself the experience of having to write and maintain prompts like this.
That paper is about using persuasion prompts to overcome trained in "safety" refusals, not to improve prompt conformance.
It is the thing I find most irritating about working with llms and agents. They seem forever a generation behind in capabilities that are self referential.
"Phase 2 will take about one week"
No, Claude, it won't, because you you and I will bang this thing out in a few hours.
(The latest Claude has a `/context` command that’s great at measuring this stuff btw)
The whole world is changing around us and nothing is secure. I would not gamble that the market for our engineering careers is safe with so much disruption happening.
Tools like Lovable are going to put lots of pressure on technical web designers.
Business processes may conform to the new shape and channels for information delivery, causing more consolidation and less duplication.
Or perhaps the barrier to entry for new engineers, in a worldwide marketplace, lowers dramatically. We have accessible new tools to teach, new tools to translate, new tools to coordinate...
And that's just the bear case where nothing improves from what we have today.
This is voodoo.
It likely works - but knowing that YAGNI is a thing, means at some level you are invoking a cultural touchstone for a very specific group of humans.
Edit -
I dug into the superpowers and skills for a bit. Definitely learned from it.
There’s stuff that doesn’t make sense to me on a conceptual basis. For example in the skill to preserve productive tensions. There’s a part that goes :
> The trade-off is real and won't disappear with clever engineering
There’s no dimension for “valid” or prediction for tradeoff.
I can guess that if the preceding context already outlines tradeoffs clearly, or somehow encodes that there is no clever solution that threads the needle - then this section can work.
Just imagining what dimensions must be encoding some of this suggests that it’s … it won’t work for situations where the example wasn’t already encoded in the training. (Not sure how to phrase it)
I was struggling to find the exact reason this type of article bugs me so much, and I think "voodoo" is precisely the correct phrase to sum up my feelings.
I don't mean that as a judgement on the utility of LLMs or that reading about what different users have tried out to increase that utility isn't valuable. But if someone asked me how to most effectively get started with coding agents, my instinct is to answer (a) carefully and (b) probably every approach works somewhat.
idk, but if you already assume that the LLM knows what TDD is (it probably ingested ~100 whole books about it), why are we feeding a short (and imo confusing) version of that back to it before the actual prompt?
i feel like a lot of projects like this that are supposed to give LLMs "superpowers" or whatever by prompt engineering are operating on the wrong assumption that LLMs are self-learning and can be made 10x smarter just by adding a bit of magic text that the LLM itself produced before the actual prompt.
ofc context matters and if i have a repetitive tasks, i write down my constraints and requirements and paste that in before every prompt that fits this task. but that's just part of the specific context of what i'm trying to do. it's not giving the LLM superpowers, it's just providing context.
i've read a few posts like this now, but what i am always missing is actual examples of how it produces objectively better results compared to just prompting without the whole "you have skill X" thing.
I would say that systems like this are about getting the agent to correctly choose the precisely correct context snippet for the exact subtask it's doing at a given point within a larger workflow. Obviously you could also do that manually, but that doesn't scale to running many agents in parallel, or running automomously for longer durations.
I’ve found the most helpful things for me is just voice to Whisper to LLMs, managing token usage effectively and restarting chats when necessary, and giving it quantified ways to check when its work is done (say, AI-Unit-Tests with apis or playwright tests.) Also, every file I own is markdown haha.
And obviously having different AI chats for specialized tasks (the way the math works on these models makes this have much better results!)
All of this has allowed me to still be in the PM role like he said, but without burning down a needless forest on having it reevaluate things in its training set lol. But why would we go back to vendor lock in with Claude? Not to mention how much more powerful 5o-codex-high is, it’s not even close
The good thing about what he said is getting AI to work with AI, I have found this to be incredibly useful in promoting, and segmenting out roles
... So, we're refactoring the process of prompting?
> As Claude and I build new skills, one of the things I ask it to do is to "test" the skills on a set of subagents to ensure that the skills were comprehensible, complete, and that the subagents would comply with them. (Claude now thinks of this as TDD for skills and uses its RED/GREEN TDD skill as part of the skill creation skill.)
> The first time we played this game, Claude told me that the subagents had gotten a perfect score. After a bit of prodding, I discovered that Claude was quizzing the subagents like they were on a gameshow. This was less than useful. I asked to switch to realistic scenarios that put pressure on the agents, to better simulate what they might actually do.
... and debugging it?
... How many other basic techniques of SWEng will be rediscovered for the English programming language?
This is actually a really cool idea. I think a lot of the good scaffolding right now is things like “use TDD” bit if you link citations to the book, then it can perhaps extract more relevant wisdom and context (just like I would by reading the book), weather than using the generic averaged interpretation of TDD derived from the internet.
I do like the idea of giving your Claude a reading list and some spare tokens on the weekend where you’re not working, and having it explore new ideas and techniques to bring back to your common CLAUDE.md.
Is this just someone who has tingly feelings about Claude reiterating stuff back to them? cuz that's what an LLM does/can do
71 more comments available on Hacker News