Summation-Based Aggregation: a Simpler Alternative to Self-Attention

Posted3 months agoActive3 months ago

pfekin_2nd

2 points

2 comments

techrxiv.orgTechstory

calmpositive

Debate

20/100

TransformersSequence ModelingNatural Language Processing

Key topics

Transformers

Sequence Modeling

Natural Language Processing

A research paper proposes summation-based aggregation as a simpler alternative to self-attention in transformers, sparking discussion on its potential to achieve linear complexity in sequence modeling.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

N/A

Peak period

0-1h

Avg / period

Key moments

01Story posted
Sep 24, 2025 at 7:38 AM EDT
3 months ago
Step 01
02First comment
Sep 24, 2025 at 7:38 AM EDT
0s after posting
Step 02
03Peak activity
2 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Sep 24, 2025 at 7:41 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (2 comments)

Showing 2 comments

pfekin_2ndAuthor

3 months ago

Summation-based aggregation replaces pairwise similarity with position-modulated projections and direct summation, reducing per-layer cost from quadratic to near-linear.

On its own, summation is competitive for classification and multimodal tasks. In language modeling, a hybrid design — summation in most layers with a single final attention layer — matches or slightly outperforms full attention while staying nearly linear in cost.

GitHub: https://github.com/pfekin/summation-based-transformers

pfekin_2ndAuthor

3 months ago

Author here — a few clarifications up front:

How is this different from Performer / linear attention? Performer and related methods approximate the softmax kernel with random features or low-rank projections. Summation is not an approximation — it eliminates similarity altogether. Tokens are modulated by positional encodings, projected with nonlinearities, and aggregated by direct addition.

Does pure summation replace attention? In classification and multimodal regression, yes — summation alone is competitive and often faster. In autoregressive language modeling, pure summation underperforms. But a hybrid design (summation in most layers + a single final attention layer) matches or slightly beats full attention while keeping most of the network near-linear.

What scale are the experiments? Small-to-moderate scale (document classification, WikiText-2, AG News, etc.). Scaling laws remain an open question — collaboration on larger-scale validation is very welcome.

Why might this work? Summation acts as a bottleneck: only task-relevant features survive aggregation, which seems to restructure embeddings before the final attention layer stabilizes them. PCA and dimensionality analyses show distinctive representation dynamics compared to attention.

View full discussion on Hacker News

ID: 45358944Type: storyLast synced: 11/17/2025, 1:11:24 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN