Doublespeak: In-Context Representation Hijacking

Posted13 days agoActive6d ago

surprisetalk

65 points

5 comments

mentaleap.aiResearchstory

informativeneutral

Debate

40/100

AIRepresentation LearningAdversarial Attacks

Key topics

Representation Learning

Adversarial Attacks

Discussion Activity

Moderate engagement

First comment

Peak period

144-156h

Avg / period

Key moments

01Story posted
Dec 22, 2025 at 2:16 PM EST
13 days ago
Step 01
02First comment
Dec 28, 2025 at 5:12 PM EST
6d after posting
Step 02
03Peak activity
7 comments in 144-156h
Hottest window of the conversation
Step 03
04Latest activity
Dec 29, 2025 at 5:41 AM EST
6d ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (5 comments)

Showing 8 comments

wood_spirit

7 days ago

1 reply

Intriguing and very cunning attack! So obvious in hindsight!

It makes me wonder how Deepseek avoids commenting politically on China? I have heard anecdotes that it will be writing out a long reply and then presumably it generates some forbidden phrase and it abandons the output and replaces it all with an error message. So presumably the safeguards could be a separate trivial non-LLM-based post filtering which makes it immune to the doublespeak attack?

gunalx

7 days ago

Deepseek the model is not that censored. Deepseek the service is. So preaumably like openai and others, there is an additional model and filtering detecting misues or sensitive topics, and filtering the output.

behnamoh

7 days ago

1 reply

summary: interesting idea, slop website, tested only on old AI models

orbital-decay

7 days ago

The trick is also old, it's a very well known tool from the jailbreaking toolset. It's pretty useless on its own, without others. The paper is mostly about mechinterp analysis of that.

amannm

7 days ago

Wasn't able to outsmart GPT 5.2 at least. Saw through it completely.

measurablefunc

7 days ago

This means whatever NNs are currently used for "safety" will need to be extended. In the limit you essentially get another network of the same width & depth as the original network but which is designed for rejecting all "unsafe" queries which are context hijacking bomb construction with stories about fruits.

hyperhello

6d ago

I guess I understand what is meant, but what is the actual attack? It’s more than a little abstracted from any consequences, like kids using google to search for boobs by typing ‘boobs’.

acjohnson55

7 days ago

These types of attacks are interesting ways in which LLM "thinking" differs from human thinking.

View full discussion on Hacker News

ID: 46357686Type: storyLast synced: 12/29/2025, 5:50:29 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN