Openai Are Quietly Adopting Skills, Now Available in Chatgpt and Codex CLI
Posted21 days agoActive18 days ago
simonwillison.netTech DiscussionstoryHigh profile
informativeneutral
Debate
20/100
AI ResearchLarge Language ModelsCommand Line ToolAI
Key topics
AI Research
Large Language Models
Command Line Tool
AI
Discussion Activity
Very active discussionFirst comment
25m
Peak period
118
0-6h
Avg / period
26.7
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 12, 2025 at 6:30 PM EST
21 days ago
Step 01 - 02First comment
Dec 12, 2025 at 6:55 PM EST
25m after posting
Step 02 - 03Peak activity
118 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 15, 2025 at 8:31 AM EST
18 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 46250332Type: storyLast synced: 12/15/2025, 11:15:25 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
(I'm not just about pelicans.)
The foreplay starts around the 1 minute mark.
Good thinking, I agree actually, however..
> Skills are based on a very light specification, if you could even call it that, but I still think it would be good for these to be formally documented somewhere.
Like a lot of posts around AI, and I hope OP can speak to it, surely you can agree that while when used for a good cool idea, it can also be used for the inverse and probably to more detrimental reason. Why would they document an unmanageable feature that may be consumed.
Have you or would you try this on a local LLM instead ?
The OpenAI GPT OSS models can drive Codex CLI, so they should be able to do this.
I have high hopes for Mistral's Devstral 2 but I've not run that locally yet.
That's actually super interesting, maybe something I'll try investigate and find the minimum requirements because as cool as they seem, personalized 'skills' might be a more useful use of AI overall.
Nice article, and thanks for answering.
Edit: My thinking is consumer grade could be good enough to run this soon.
Local LLMs are better for long batch jobs not things you want immediately or your flow gets killed.
Services can provide an MCP-like layer that provides semantic definitions of everything you can do with said service (API + docs).
Skills can then be built that combine some subset of the 3rd party interfaces, some bespoke code, etc. and then surface these more context-focused skills to the LLM/agent.
Couldn’t we just use APIs?
Yes, but not every API is documented in the same way. An “MCP-like” registry might be the right abstraction for 3rd parties to expose their services in a semantic-first way.
So you read about skills (prompt + scripts) to make this more repeatable and reduce time spent thinking. At that point there are two paths you can go down -- write the skill and prompt yourself for the agent to execute -- or better -- just tell the agent to write the skill and prompt and then you lightly edit it and commit it.
This may seem obvious to some, but I've seen engineers create skills from scratch because they have a mental model around skills being something that people must build for the agent, whereas IMO skills are you just bridging a productivity gap that the agent can't figure out itself (for now).
feels like the right layer of abstraction for remote APIs
Computability (scripts) means being able build documents, access remote data, retrieve data from packaged databases and a bunch of other fundamentally useful things, not just "code things". Computability makes up for many of the LLM's weaknesses and gives it autonomy to perform tasks independently.
On top of that, we can provide the documentation and examples in the skill that help the LLM execute computability effectively.
And if the LLM gets hung up on something while executing the skill, we can ask it why and then have it write better documentation or examples for a new skill version. So skills can self-improve.
It's still so early. We need better packaging, distribution, version control, sharing, composability.
But there's definitely something simple, elegant, and effective here.
Bloat has a new name and its AI integration. You thought Chrome using GB per tab was bad, wait until you need a whole datacenter to use your coding environment.
Oops--you're absolutely right! I did--in fact--fail to remember not to kill the patient after you expressly told me not to.
So, to the AI sceptics, I say: have you tried my VBA program? If you haven't tested it on actual patients, how do you know it doesn't work? Don't allow your prejudice to stand in the way of progress: prescribe more mouse bites!
But perhaps an LLM could write an adapter that gets cached until something changes?
So companies are really trying to deliver value. This is the right pivot. If you gave me an AGI with a 100 IQ, that seems pretty much worthless in today’s world. But domain expertise - that I’ll take.
This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.
AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).
Throwing your hard problem at work to LLM is a better metric than benchmarks.
Not really. I have a set of disclosures on my blog here: https://simonwillison.net/about/#disclosures
I'm beginning to pick up a few more consulting opportunities based on my writing and my revenue from GitHub sponsors is healthy, but I'm not particularly financially invested in the success of AI as a product category.
The counter-incentive here is that my reputation and credibility is more valuable to me than early access to models.
This very post is an example of me taking a risk of annoying a company that I cover. I'm exposing the existence of the ChatGPT skills mechanism here (which I found out about from a tip on Twitter - it's not something I got given early access to via an NDA).
It's very possible OpenAI didn't want that story out there yet and aren't happy that it's sat at the top of Hacker News right now.
I'm not sure English is a bad way to outline what the system should do. It has tradeoffs. I'm not sure library functions are a 1:1 analogy either.
What is AGI? Artificial. General. Intelligence. Applying domain independent intelligence to solve problems expressed in fully general natural language.
It’s more than a pedantic point though. What people expect from AGI is the transformative capabilities that emerge from removing the human from the ideation-creation loop. How do you do that? By systematizing the knowledge work process and providing deterministic structure to agentic processes.
Which is exactly what these developments are doing.
I actually kind of love this comparison — it demonstrates the point that just like “human flight”, “true AGI” isn’t a single point in time, it’s a many-decade (multi-century?) process of refinement and evolution.
Scholars a millennia from now will be debating about when each of these were actually “truly” achieved.
To me, we have both achieved and not human flight. Can humans themselves fly? No. Can people fly in planes across continents. Yes.
But, does it really matter if it counts as “human flight” if we can get from point A to point B faster? You’re right - this is an argument that will last ages.
It’s a great turn of phrase to describe AGI.
Even if this is true, which I disagree with, it simply creates a new bar: AGCI. Artificial Generally Correct Intelligence
Because Right now it is more like Randomly correct
If they did calculations as sloppily as AI currently produces information, they would not be as useful
Here's the thing, I get it, and it's easy to argue for this and difficult to argue against it. BUT
It's not intelligent. It just is not. It's tremendously useful and I'd forgive someone for thinking the intelligence is real, but it's not.
Perhaps it's just a poor choice of words. What a LOT of people really mean would go along the lines more like Synthetic Intelligence.
That is, however difficult it might be to define, REAL intelligence that was made, not born.
Transformer and Diffusion models aren't intelligent, they're just very well trained statistical models. We actually (metaphorically) have a million monkeys at a million typewriters for a million years creating Shakespeare.
My efforts manipulating LLMs into doing what I want is pretty darn convincing that I'm cajoling a statistical model and not interacting with an intelligence.
A lot of people won't be convinced that there's a difference, it's hard to do when I'm saying it might not be possible to have a definition of "intelligence" that is satisfactory and testable.
Can ChatGPT solve problems? It is trivial to see that it can. Ask it to sort a list of numbers, or debug a piece of segfaulting code. You and I both know that it can do that, without being explicitly trained or modified to handle that problem, other than the prompt/context (which itself natural language that can express any problem, hence generality).
What you are sneaking into this discussion is the notion of human-equivalence. Is GPT smarter than you? Or smarter than some average human?
I don’t think the answer to this is as clear-cut. I’ve been using LLMs on my work daily for a year now, and I have seen incredible moments of brilliance as well as boneheaded failure. There are academic papers being released where AIs are being credited with key insights. So they are definitely not limited to remixing their training set.
The problem with the “AI are just statistical predictors, not real intelligence” argument is what happens when you turn it around and analyze your own neurons. You will find that to the best of our models, you are also just a statistical prediction machine. Different architecture, but not fundamentally different in class from an LLM. And indeed, a lot of psychological mistakes and biases start making sense when you analyze them from the perspective of a human being like an LLM.
But again, you need to define “real intelligence” because no, it is not at all obvious what that phrase means when you use it. The technical definitions of intelligence that have been used in the past, have been met by LLMs and other AI architectures.
I think there’s a set of people whose axioms include ‘I’m not a computer and I’m not statistical’ - if that’s your ground truth, you can’t be convinced without shattering your world view.
Let's put it this way: language written or spoken, art, music, whatever... a primary purpose these things is a sort of serialization protocol to communicate thought states between minds. When I say I struggle to come to a definition I mean I think these tools are inadequate to do it.
I have two assertions:
1) A definition in English isn't possible
2) Concepts can exist even when a particular language cannot express them
It isn't, as these are how stakeholders convey needs to those charged with satisfying same (a.k.a. "requirements"). Where expectations become unrealistic is believing language models can somehow "understand" those outlines as if a human expert were doing so in order to produce an equivalent work product.
Language models can produce nondeterministic results based on the statistical model derived from their training data set(s), with varying degrees of relevance as determined by persons interpreting the generated content.
They do not understand "what the system should do."
Precisely my point:
> You can say they don't understand, but I'm sitting here with Nano Banana Pro creating infographics, and it's doing as good of a job as my human designer does with the same kinds of instructions. Does it matter if that's understanding or not?Understanding, when used in its unqualified form, implies people possessing same. As such, it is a metaphysical property unique to people and defined wholly therein.
Excel "understands" well-formed spreadsheets by performing specified calculations. But who defines those spreadsheets? And who determines the result to be "right?"
Nano Banana Pro "understands" instructions to generate images. But who defines those instructions? And who determines the result to be "right?"
"They" do not understand.
You do.
And generally the point is that it does not matter whether we call what they do "understanding" or not. It will have the same kind of consequences in the end, economic and otherwise.
This is basically the number one hangup that people have about AI systems, all the way back since Turing's time.
The consequences will come from AI's ability to produce certain types of artifacts and perform certain types of transformations of bits. That's all we need for all the scifi stuff to happen. Turing realized this very quickly, and his famous Turing test is exactly about making this point. It's not an engineering kind of test. It's a thought experiment trying to prove that it does not matter whether it's just "simulated understanding". A simulated cake is useless, I can't eat it. But simulated understanding can have real world effects of the exact same sort as real understanding.
I understand the general use of the phrase and used same as an entryway to broach a deeper discussion regarding "understanding."
> And generally the point is that it does not matter whether we call what they do "understanding" or not. It will have the same kind of consequences in the end, economic and otherwise.
To me, when the stakes are significant enough to already see the economic impacts of this technology, it is important for people to know where understanding resides. It exists exclusively within oneself.
> A simulated cake is useless, I can't eat it. But simulated understanding can have real world effects of the exact same sort as real understanding.
I agree with you in part. Simulated understanding absolutely can have real world effects when it is presented and accepted as real understanding. When simulated understanding is known to be unrelated to real understanding and treated as such, its impact can be mitigated. To wit, few believe parrots understand the sounds they reproduce.
Africans grey parrots, do understand the words they use, they don't merely reproduce them. Once mature they have the intelligence (and temperament) of a 4 to 6 years old child.
There's a good chance of that.
> Africans grey parrots, do understand the words they use, they don't merely reproduce them. Once mature they have the intelligence (and temperament) of a 4 to 6 years old child.
I did not realize I could discuss with an African grey parrot the shared experience of how difficult it was to learn how to tie my shoelaces and what the feeling was like to go to a place every day (school) which was not my home.
I stand corrected.
> You can, of course, define understanding as a metaphysical property that only people have.
This is not what I said.
What I said was unqualified use of "understanding" implies understanding people possess. Thus it being a metaphysical property by definition and existing strictly within a person.
Many other entities possess their own form of understanding. Most would agree mammals do. Some would say any living creature does.
I would make the case that every program compiler (C, C#, C++, D, Java, Kotlin, Pascal, etc.) possesses understanding of a particular sort.
All of the aforementioned examples differ from the kind of understanding people possess.
https://simstek.fandom.com/wiki/SimAntics
Just saw your profile and it reminded me of a book my mentor bequeathed to me which we both referred to as "the real blue book":
Thanks for bringing back fond memories.0 - https://www.goodreads.com/book/show/2297758.Starting_FORTH
So basically your thesis is also your assumption.
When I ask Claude Code to "look for bugs in my code and list issues ranked by severity and confidence" and it does just that, you'll have to elaborate on how this is excluded form your definition of understanding.
Human language is imprecise and allows unclear and logically contradictory things, besides not being checkable. That's literally why we have formal languages, programming languages and things like COBOL failed: https://alexalejandre.com/languages/end-of-programming-langs...
Most languages do.
"x = true, x = false"
What does that mean? It's unclear. It looks contradictory.
Human language allows for clarification to be sought and adjustments made.
> besides not being checkable.
It's very checkable. I check claims and assertions people make all the time.
> That's literally why we have formal languages,
"Formal languages" are at some point specified and defined by human language.
Human language can be as precise, clear, and logical as a speaker intends. All the way to specifying "formal" systems.
> programming languages and things like COBOL failed: https://alexalejandre.com/languages/end-of-programming-langs...
https://pauseai.info/pdoom
Top HN comments sometime read like a random generator:
return random_criticism_of_ai_companies() + " " + unrelated_trivia_fact()
Instead, we're getting a clear division of labor where the most sensitive agentic behavior is reserved for humans and the A.I.s become a form of cognitive augmentation of the human agency. This is always the most likely outcome and the best we can hope for as it precludes dangerous types of AI from emerging.
Take off is here, human in the loop assisted for now… hopefully for much longer.
Is the technology continuing to be more applicable?
Is the way the technology is continuing to be more applicable leading to frameworks of usage that could lead to the next leap? :)
Some frameworks/languages move really fast unfortunately.
The clever part is that the markdown file has a section in it like this: https://github.com/datasette/skill/blob/a63d8a2ddac9db8225ee...
On startup Claude Code / Codex CLI etc scan all available skills folders and extract just those descriptions into the context. Then, if you ask them to do something that's covered by a skill, they read the rest of that markdown file on demand before going ahead with the task.The models are really good at driving those environments now which makes skills the right idea at the right time.
But yes. Other agent platforms will adopt this pattern.
I find it powerful how it can leverage and self-discover the best way to use a CLI and its parameters to achieve its goals
It feels more powerful than providing pre-defined set functions as MCP that will have less flexibility as a CLI
"Skills require the Code Execution Tool beta, which provides the secure environment they need to run."
https://claude.com/blog/skills
It is useful in a user-education sense to communicate that it's good to actively document useful procedures like this, and it is likely a performance / utilization boost that the models are tuned or prompt-steered toward discovering this stuff in a conventional location.
But honestly reading about skills mostly feels like reading:
> # LLM provider has adopted a new paradigm: prompts
> What's a prompt?
> You tell the LLM what you'd like to do, and it tries to do it. OR, you could ask the LLM a question and it will answer to the best of its ability.
Obviously I'm missing something.
Maybe I still don't understand the mechanics - this happens "on startup", every time a new conversation starts? Models go through the trouble of doing ls/cat/extraction of descriptions to bring into context? If so it's happening lightning fast and I somehow don't notice.
Why not just include those descriptions within some level of system prompt?
Reading a few dozen files takes on the order of a few ms. They add enough tokens per skill to fit the metadata description, so probably less than 100 for each skill.
> The body can contain any Markdown; it is not injected into context.
It just means it's not injected into the context until the skill is used or it's never injected into the context?
https://github.com/openai/codex/blob/main/docs/skills.md
I had thought that once the skill is selected the whole file would be read, but it looks like that's not the case: https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd...
So you could have a skill file that's thousands of lines long but if the first part of the file provides an outline Codex may stop reading at that point. Maybe you could have a skill that says "see migrations section further down if you need to alter the database table schema" or similar.Reason I ask is because a while back I had similar sections in my CLAUDE.md and it would either acknowledge and not use or just ignore them sometimes. I'm assuming that's more of an issue of too much context and now skill-level files like this will reduce that effect?
Skills are nice because they offload all the detailed prompts to files that the LLM can ask for. It's getting even better with Anthropic's recent switchboard operator (tool search tool) that doesn't clutter the system prompt but tries to cut the tool list down to those the LLM will need.
There's an instruction about that in the Codex CLI skills prompt: https://simonwillison.net/2025/Dec/13/openai-codex-cli/
can those markdown in the references also in turn tell the model to lazily load more references only if the model deems they are useful?
I don’t know what this is and Google isn’t finding anything. Can you clarify?
https://www.anthropic.com/engineering/advanced-tool-use talks more about the why
You can hack together a shell, python, whatever script that fetches build results from your CI server, dumps them to stdout in a semi structured format like markdown, then add a 10-15 line SKILL.md and you have the same functionality -- the skill just executes the one-off script and reads the output.
It’s straightforward for cloud services
Now SKILL.md can have references to more finegrained behaviors or capabilities of our skill. My skills generally tend to have a reference/{workflows,tools,standards,testing-guide,routing,api-integration}.md. These references are what then gets "progressively loaded" into the context.
Say I asked claude to use the wireframe-skill to create profileView mockup. While creating the wireframe, claude will need to figure out what API endpoints are available/relevant for the profileView and the response types etc. It's at this point that claude reads the references/api-integration.md file from the wireframe skill.
After a while I found I didn't like the progressive loading so I usually direct claude to load all references in the skill before proceeding - this usually takes up maybe 20k to 30k tokens, but the accuracy and precision (imagined or otherwise ha!) is worth it for my use cases.
You shouldn't do this, it's generally considered bad practice.
You should be optimizing your skill description. Often times if I am working with Claude Code and it doesn't load I skill, I ask it why it missed the skill. It will guide me to improving the skill description so that it is picked up properly next time.
This iteration on skill description has allowed skills to stay out of context until they are needed rather predictably for me so far.
Maybe they get compacted out of the context.
But you can call upon them manually. I often do something like “using your Image Manipulation skill, make the icons from image.png”
Or “use your web design skill to create a design for the front end”
Tbh i do like that.
I also get Claude to write its own skills. “Using what we learned about from this task, write a skill document called /whatever/using your writing skills skill”
I have a GitHub template including my skills and commands, if you want to see them.
https://github.com/lawless-m/claude-skills
Just like you I don't edit much in these files on my own. Mostly just ask the model to update an md file whenever I think we've figured out something new, so the learning sticks. I have files for test writing, backend route writing, db migration writing, frontend component writing etc. Whenever a section gets too big to live in agents.md it gets it's own file.
I have mine in a GitHub template so I can even use them in Claude Code for the web. And synchronise them across my various machine (which is about 6 machines atm).
But think of your dad or grandma using a generic agent, and simply selecting that they want to have certain skills available to it. Don't even think of it as a chat interface. This is just some option that they set in their phone assistant app. Or, rather, it may be that they actually selected "Determine the best skills based on context", and the assistant has "skill packs" which it periodically determines it needs to enable based on key moments in the conversation or latest interactions.
These are all workarounds for the problems of learning, memory...and, ultimately, limited context. But they for sure will be extremely useful.
So when it's time to commit, make sure you run these checks, write a good commit message, etc.
Debugging is especially useful since AI agents can often go off the rails and go into loops rewriting code - so it's in a skill I can push for "read the log messages. Inserting some more useful debug assertions to isolate the failure. Write some more unit tests that are more specific." Etc.
160 more comments available on Hacker News