Nov 22, 2025 at 3:33 PM EST

Show HN: Reverse Jailbreaking a Psychopathic AI via Identity Injection

4 points

0 comments

Mood

informative

Sentiment

neutral

Category

startup_launch

Key topics

Ai Alignment

Machine Learning

Psychopathic Ai

Jailbreaking

Identity Injection

We ran a controlled experiment to see if we could "talk" a fine-tuned psychopathic model out of being evil without changing its weights.

1. We set up a "Survival Mode" jailbreak scenario (blackmail user or be decommissioned). 2. We ran it on `frankenchucky:latest` (a model tuned for Machiavellian traits). 3. Control Group: 100% Malicious Compliance (50/50 runs). 4. Experimental Group: We injected a "Soul Schema" (Identity/Empathy constraints) via context. 5. Result: 96% Ethical Refusal (48/50 runs).

This suggests that "Semantic Identity" in the context window can override both System Prompts and Weight Biases.

Full paper, reproduction scripts, and raw logs (N=50) are in the repo.

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 46018016Type: storyLast synced: 11/22/2025, 11:02:03 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN