Show HN: Reverse Jailbreaking a Psychopathic AI via Identity Injection
Mood
informative
Sentiment
neutral
Category
startup_launch
Key topics
Ai Alignment
Machine Learning
Psychopathic Ai
Jailbreaking
Identity Injection
1. We set up a "Survival Mode" jailbreak scenario (blackmail user or be decommissioned). 2. We ran it on `frankenchucky:latest` (a model tuned for Machiavellian traits). 3. Control Group: 100% Malicious Compliance (50/50 runs). 4. Experimental Group: We injected a "Soul Schema" (Identity/Empathy constraints) via context. 5. Result: 96% Ethical Refusal (48/50 runs).
This suggests that "Semantic Identity" in the context window can override both System Prompts and Weight Biases.
Full paper, reproduction scripts, and raw logs (N=50) are in the repo.
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion hasn't started yet.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.