Openai Has Trained Its LLM to Confess to Bad Behavior

Postedabout 1 month agoActiveabout 1 month ago

cainxinth

1 points

1 comments

technologyreview.comNewsstory

informativeneutral

Debate

60/100

BioethicsLLM DevelopmentAI Transparency

Key topics

Bioethics

LLM Development

AI Transparency

Discussion Activity

Light discussion

First comment

17s

Peak period

0-1h

Avg / period

Key moments

01Story posted
Dec 6, 2025 at 10:22 AM EST
about 1 month ago
Step 01
02First comment
Dec 6, 2025 at 10:22 AM EST
17s after posting
Step 02
03Peak activity
2 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Dec 6, 2025 at 10:53 AM EST
about 1 month ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 2 comments

cainxinthAuthor

about 1 month ago

https://archive.is/69DwW

grayhatter

about 1 month ago

> For example, if you ask a model something it doesn’t know, the drive to be helpful can sometimes overtake the drive to be honest. And faced with a hard task, LLMs sometimes cheat. “Maybe the model really wants to please, and it puts down an answer that sounds good,” says Barak. “It’s hard to find the exact balance between a model that never says anything and a model that does not make mistakes.”

"Wants to"

I find I have a weird take on this kind of anthropomorphism... It doesn't bother me when people say a rock "wants to" roll down hill. But in does when someone say an LLM "wants to"... equally I'm not nearly as bothered by something like the LLM "tries to"... It's a strange rule set for correct communication...

Anyways, please forgive that preemptive tangent. My primary point is

[ citation needed ]

I remember reading a paper praising GPT for being able to explain it's decision making process. This paper provided no evidence, no arguments, and no citations for this exceptionally wild claim. How is this not just a worse version of that claim? I ask that as a real question, so many people willingly believe and state that an LLM is able to (correctly) explain it's decision making process. How? Why isn't it better to assume that's just another hallucination? Especially given it would be nonfalsifiable?

View full discussion on Hacker News

ID: 46173998Type: storyLast synced: 12/6/2025, 3:35:11 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN