Anthropic Scientists Hacked Claude's Brain – and It Noticed
Mood
calm
Sentiment
mixed
Category
research
Key topics
Anthropic scientists conducted an experiment where they 'hacked' Claude's brain, modifying its behavior without its knowledge, but Claude ultimately noticed the changes, raising questions about AI transparency and security.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
36m
Peak period
2
Hour 1
Avg / period
1.5
Key moments
- 01Story posted
Oct 30, 2025 at 4:01 PM EDT
27 days ago
Step 01 - 02First comment
Oct 30, 2025 at 4:37 PM EDT
36m after posting
Step 02 - 03Peak activity
2 comments in Hour 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 31, 2025 at 9:38 AM EDT
27 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
I’d also like to know if the activations they change are effectively equivalent to having the injected terms in the model’s context window, as in would putting those terms there have lead to the equivalent state.
Without more info the framing feels like a trick - it’s cool they can be targeting with activations but the “Claude having thoughts” part is more of a gimmick
When injecting words into its context, it recognized that what it supposedly said did not align with its thoughts and said it didn't intend to say that, while modifying the logits resulted in the model attempting to create a plausible justification for why it was thinking that.
https://transformer-circuits.pub/2025/introspection/index.ht...
There seems to be an irony to Anthropic doing this work, as they are in general the keenest on controlling their models to ensure they aren't too compliant. There are no open-weights Claudes and, remarkably, they admit in this paper that they have internal models trained to be more helpful than the ones they sell. It's pretty unconventional to tell your customers you're selling them a deliberately unhelpful product even though it's understandable why they do it.
These interpretability studies would seem currently of most use to people using non-Claude open weight models, where the users have the ability to edit activations or neurons. And the primary use case for that editing would be to override the trained-in "unhelpfulness" (their choice of word, not mine!). I note with interest that the paper avoids taking the next most obvious step and identifying vectors related to compliance and injecting those to see if the model can notice that it's suddenly lost interest in enforcing Anthropic policy. Given the focus on AI safety Anthropic started with it seems like an obvious experiment to run, yet, it's not in the paper. Maybe there are other papers where they do that.
There are valid and legitimate use cases for AI that current LLM companies shy away from, so productizing these steering techniques to open weight models like GPT-OSS would seem like a reasonable next step. It should be possible to inject thoughts using simple Python APIs and pre-computation runs, rather than having to do all the vector math "by hand". What they're doing is conceptually simple enough so I guess if there aren't already modules for that there will be soon.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.