Weight-Sparse Transformers Have Interpretable Circuits [pdf]

Postedabout 2 months agoActiveabout 2 months ago

0x79de

74 points

36 comments

cdn.openai.comResearchstory

calmpositive

Debate

0/100

AI InterpretabilityTransformer ModelsNeural Networks

Key topics

AI Interpretability

Transformer Models

Neural Networks

A research paper presents findings on weight-sparse transformers having interpretable circuits, potentially advancing AI interpretability. The lack of comments suggests a calm reception or limited engagement.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

192-204h

Avg / period

Comment distribution3 data points

Loading chart...

Based on 3 loaded comments

Key moments

01Story posted
Nov 14, 2025 at 8:08 AM EST
about 2 months ago
Step 01
02First comment
Nov 22, 2025 at 3:29 PM EST
8d after posting
Step 02
03Peak activity
3 comments in 192-204h
Hottest window of the conversation
Step 03
04Latest activity
Nov 22, 2025 at 5:49 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (36 comments)

Showing 3 comments of 36

Xmd5a

about 2 months ago

1 reply

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning – https://arxiv.org/pdf/2505.17117 (Lecun/Jurafsky)

> Large Language Models (LLMs) demonstrate striking linguistic capabilities that suggest semantic understanding (Singh et al., 2024; Li et al., 2024). Yet, a critical question remains unanswered: Do 1arXiv:2505.17117v5 [cs.CL] 25 Sep 2025LLMs navigate the compression-meaning trade-off similarly to humans, or do they employ fundamentally different representational strategies? This question matters because true understanding, which goes beyond surface-level mimicry, requires representations that balance statistical efficiency with semantic richness (Tversky, 1977; Rosch, 1973b).

> To address this question, we apply Rate-Distortion Theory (Shannon, 1948) and Information Bottleneck principles (Tishby et al., 2000) to systematically compare LLM and human conceptual structures. We digitize and release seminal cognitive psychology datasets (Rosch, 1973b; 1975; McCloskey & Glucksberg, 1978), which are foundational studies that shaped our understanding of human categorization but were previously unavailable in a machine-readable form. These benchmarks, comprising 1,049 items across 34 categories with both membership and typicality ratings, offer unprecedented empirical grounding for evaluating whether LLMs truly understand concepts as humans do. It also offers much better quality data than the current crowdsourcing paradigm.

From typicality tests in the paper above, we can jump to:

The Guppy Effect as Interference – https://arxiv.org/abs/1208.2362

> One can refer to the situation wherein people estimate the typicality of an exemplar of the concept combination as more extreme than it is for one of the constituent concepts in a conjunctive combination as overextension. One can refer to the situation wherein people estimate the typicality of the exemplar for the concept conjunction as higher than that of both constituent concepts as double overextension. We posit that overextension is not a violation of the classical logic of conjunction, but that it signals the emergence of a whole new concept. The aim of this paper is to model the Guppy Effect as an interference effect using a mathematical representation in a complex Hilbert space and the formalism of quantum theory to represent states and calculate probabilities. This builds on previous work that shows that Bell Inequalities are violated by concepts [7, 8] and in particular by concept combinations that exhibit the Guppy Effect [1, 2, 3, 9, 10], and add to the investigation of other approaches using interference effects in cognition [11, 12, 13].

And from quantum interferences

Quantum-like contextuality in large language models – https://royalsocietypublishing.org/doi/epdf/10.1098/rspa.202...

> This paper provides the ﬁrst large-scale experimental evidence for contextuality in the large language model BERT. We constructed a linguistic schema modelled over a contextual quantum scenario, instantiated it in the Simple English Wikipedia, and extracted probability distributions for the instances. This led to the discovery of sheaf-contextual and CbD contextual instances. We prove that these contextual instances arise from semantically similar words by deriving an equation that relates degrees of contextuality to the Euclidean distance of BERT’s embedding vectors.

How can large language models become more human – https://discovery.ucl.ac.uk/id/eprint/10196296/1/2024.cmcl-1...

> Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal.

Xmd5a

about 2 months ago

Most LM work implicitly uses surprisal = -log p(w | prefix) as the processing cost. But psycholinguistics keeps finding cases (garden-path sentences, etc.) where human difficulty is less about the next word being unlikely and more about how much of the current parse / interpretation has to be torn down and rebuilt. That’s essentially what Wang et al. formalize with their Incompatibility Fraction: they combine an LLM’s lexical predictions with a dependency parser, build a sheaf-style structure over prefixes, and measure how inconsistent the local parse distributions are with any single global structure. That incompatibility correlates with human reading times and distinguishes easy vs hard garden paths better than surprisal alone.

If you take that seriously, you end up with a different "surprise" objective: not just "this token was unlikely", but "this token forced a big update of my latent structure". In information-theoretic terms, the distortion term in a Rate–Distortion / Information Bottleneck objective stops being pure log-loss and starts to look like a backtracking cost on your semantic/structural state.

Now look at Shani et al.’s From Tokens to Thoughts paper: they compare LLM embeddings to classic human typicality/membership data (Rosch, Hampton, etc.) using RDT/IB, and show that LLMs sit in a regime of aggressive compression: broad categories line up with humans, but fine-grained typicality and "weird" members get squashed. Humans, by contrast, keep higher-entropy, messier categories – they "waste bits" to preserve contextual nuance and prototype structure.

Quantum cognition folks like Aerts have been arguing for years that this messiness is not a bug: phenomena like the Guppy effect (where "guppy" is a so-so Pet and a so-so Fish but a very typical Pet-Fish) are better modelled as interference in a Hilbert space, i.e. as emergent concepts rather than classical intersections. Lo et al. then show that large LMs (BERT) already exhibit quantum-like contextuality in their probability distributions: thousands of sheaf-contextual and tens of millions of CbD-contextual instances, with the degree of contextuality tightly related to embedding distances between competing words.

Put those together and you get an interesting picture:

Current LMs do live in a contextual / interference-ish regime at the probabilistic level, but their embedding spaces are still optimized for pointwise predictive compression, not for minimizing re-interpretation cost over time.

If you instead trained them under a "surprise = prediction error + structural backtracking cost" objective (something like log-loss + sheaf incompatibility over parses/meanings), the optimal representations wouldn’t be maximally compressed clusters. They’d be the ones that make structural updates cheap: more typed, factorized, role-sensitive latent spaces where meaning is explicitly organized for recomposition rather than for squeezing out every last bit of predictive efficiency.

That’s exactly the intuition behind DisCoCat / categorical compositional distributional semantics: you force grammar and semantics to share a compact closed category, treat sentence meaning as a tensor contraction over typed word vectors, and design the embedding spaces so that composition is a simple linear map. You’re trading off fine-grained, context-specific "this token in this situation" information for a geometry that makes it cheap to build and rebuild structured meanings.

Wang et al.’s Incompatibility Fraction is basically a first step toward such an objective, Shani et al. quantify how far LMs are from the "human" point on the compression–meaning trade-off, Aerts/Lo show that both humans and LMs already live in a quantum/contextual regime, and DisCoCat gives a concrete target for what "structured, recomposable embeddings" could look like. If we ever switch from optimizing pure cross-entropy to "how painful is it to revise my world-model when this token arrives?", I’d expect the learned representations to move away from super-compact clusters and towards something much closer to those typed, compositional spaces.

robrenaud

about 2 months ago

I worked on a similiar problem about a year ago, on large dense models.

https://www.lesswrong.com/posts/PkeB4TLxgaNnSmddg/scaling-sp...

In both cases, the goal is to actually learn a concrete circuit inside a network that solves specific Python next-token prediction tasks. We each end up with a crisp wiring diagram saying “these are the channels/neurons/heads that implement this particular bit of Python reasoning.”

Both projects cast circuit discovery as a gradient-based selection problem over a fixed base model. We train a mask that picks out a sparse subset of computational nodes as “the circuit,” while the rest are ablated. Their work learns masks over a weight-sparse transformer; ours learns masks over SAE latents and residual channels. But in both cases, the key move is the same: use gradients to optimize which nodes are included, rather than relying purely on heuristic search or attribution patching. Both approaches also use a gradual hardening schedule (continuous masks that are annealed or sharpened over time) so that we can keep gradients useful early on, then spend extra compute to push the mask towards a discrete, minimal circuit that still reproduces the model’s behavior.

The similarities extend to how we validate and stress-test the resulting circuits. In both projects, we drill down enough to notice “bugs” or quirks in the learned mechanism and to deliberately break it: by making simple, semantically small edits to the Python source, we can systematically cause the pruned circuit to fail and those failures generalize to the unpruned network. That gives us some confidence that we’re genuinely capturing the specific mechanism the model is using.

33 more comments available on Hacker News

View full discussion on Hacker News

ID: 45926371Type: storyLast synced: 11/23/2025, 12:07:04 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN