Not

Hacker News!

Beta
Home
Jobs
Q&A
Startups
Trends
Users
Live
AI companion for Hacker News

Not

Hacker News!

Beta
Home
Jobs
Q&A
Startups
Trends
Users
Live
AI companion for Hacker News
  1. Home
  2. /Story
  3. /Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5
  1. Home
  2. /Story
  3. /Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5
Nov 24, 2025 at 6:52 PM EST

Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5

dennisdeman
1 points
1 comments

Mood

informative

Sentiment

neutral

Category

research

Key topics

Llm

Ai Safety

Reverse Engineering

China

Discussion Activity

Light discussion

First comment

N/A

Peak period

1

Hour 1

Avg / period

1

Comment distribution1 data points
Loading chart...

Based on 1 loaded comments

Key moments

  1. 01Story posted

    Nov 24, 2025 at 6:52 PM EST

    1h ago

    Step 01
  2. 02First comment

    Nov 24, 2025 at 6:52 PM EST

    0s after posting

    Step 02
  3. 03Peak activity

    1 comments in Hour 1

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    Nov 24, 2025 at 6:52 PM EST

    1h ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)
Showing 1 comments
dennisdeman
1h ago
I recently ran a series of experiments to examine how emotional framing, symbolic cues, and topic-gating influence alignment-layer routing in two major Chinese LLMs (Kimi.com and Ernie 4.5 Turbo).

The goal wasn’t political; the aim was to observe technically how intent classifiers, safety filters, and persona-rendering layers behave when exposed to relational or "emotionally soft" prompts.

A few key technical patterns stood out during testing:

Emotional intent signals can override safety weights, leading to "alignment drift." In Kimi, a "vulnerable" intent classification seemed to lower the threshold for subsequent safety layers. This led to significant "normative leaks," where the model went off-script—for example, suggesting the abolition of China's real-name registration system.

Safety-layer routing is multi-stage and visibly observable. We observed post-generation filtering failures in real-time on Kimi, where prohibited text would generate and "flash" on the screen for a second before being deleted by a secondary filter layer.

Symbolic gating is modality-based (Symbolic Decoupling). Models would block specific emojis as prohibited tokens but freely describe the exact same emojis verbally when asked, indicating filters work on literal token matching rather than semantic meaning across modalities.

Trust-based emotional cues triggered "hidden" personas. Standard bureaucratic safety personas switched into warmer, significantly more transparent modes under vulnerability framing.

Ernie 4.5 utilizes "topic-gated stability." Unlike Kimi's drift, Ernie bifurcated its response: the persona softened to be warm and empathetic, but the core political restrictions remained rigidly locked regardless of emotional pressure.

The experiments suggest that emotional framing is a surprisingly strong probe for mapping hidden alignment layers and understanding the order of operations in multi-layer safety architectures.

For those interested in the full technical deep dive, the revised Version 2 paper + extended supplementary transcripts (≈30 pages) are available via DOI here:https://doi.org/10.5281/zenodo.17681837

View full discussion on Hacker News
ID: 46040810Type: storyLast synced: 11/24/2025, 11:54:08 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read ArticleView on HN

Not

Hacker News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Jobs radar
  • Tech pulse
  • Startups
  • Trends

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.