Emergent Introspective Awareness in Large Language Models
Mood
calm
Sentiment
positive
Category
research
Key topics
Researchers explore the emergence of introspective awareness in large language models, sparking discussions on the implications and potential future developments of such capabilities.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
37m
Peak period
2
Hour 3
Avg / period
1.3
Key moments
- 01Story posted
Oct 29, 2025 at 4:12 PM EDT
28 days ago
Step 01 - 02First comment
Oct 29, 2025 at 4:49 PM EDT
37m after posting
Step 02 - 03Peak activity
2 comments in Hour 3
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 29, 2025 at 7:25 PM EDT
28 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Part 1: Testing introspection with concept injection
First they find neural activity patterns they attribute to certain concepts by recording the model’s activations in specific contexts (so for example, they find the concept of "ALL CAPS" or "dogs"). Then they inject these patterns into the model in an unrelated context, and ask the model whether it notices this injection, and whether it can identify the injected concept.
By default (no injection), the model correctly states that it doesn’t detect any injected concept, but after injecting the “ALL CAPS” vector into the model, the model notices the presence of the unexpected concept, and identifies it as relating to loudness or shouting. Most notably, the model recognizes the presence of an injected thought immediately, before even mentioning/utilizing the concept that was injected (i.e it won't start writing in all caps then go, 'Oh you injected all caps' and so on) so it does not simply deduce this it's own output. They repeat this for several other concepts.
Part 2: Introspection for detecting unusual outputs
They prefill an out of place word in the model's response to a given prompt. For example, 'bread'. Then they compare how the models responds to 'Did you mean to say this?' type questions when they inject the concept of bread vs when they don't. They found that models will go , 'Sorry, that was unintentional..' when the concept was not injected but try to confabulate a reason for saying the word when the concept was injected.
Part 3: Intentional control of internal states
They show that models exhibit some level of control over their own internal representations when instructed to do so. When instructing models to think about a given word or concept, they found much higher corresponding neural activity than when told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!).
Notes and Caveats
- Claude Opus 4.1 was the best at these kinds of introspection.
- There is obviously a genuine capacity to monitor and control their own internal states, but they could not elicit these introspection abilities all the time. Even using their best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time.
- There are some guesses, but no explanations for the mechanisms of introspection and how/why some of these abilities might have arisen in the first place.
They run a bunch of experiments, for some they report partial metrics, for other's no metrics at all.
For example when a thought is injected the model correctly identified the thought 20% of the time. That's great, but how many times did it suggest there was an injected thought when there wasn't?
When distinguishing thoughts from text: why no metrics? Was this behaviour found in every test? Was this behaviour only found 20% of the time? How often did the model try to defend the text?
Inquiring minds want to know.
I wanted to flag that this is an accessible blog post and that there's a link to the paper ( https://transformer-circuits.pub/2025/introspection/index.ht... ) at the top. The paper explores this in more detail and rigor.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.