Summation-Based Aggregation: a Simpler Alternative to Self-Attention
Posted3 months agoActive3 months ago
techrxiv.orgTechstory
calmpositive
Debate
20/100
TransformersSequence ModelingNatural Language Processing
Key topics
Transformers
Sequence Modeling
Natural Language Processing
A research paper proposes summation-based aggregation as a simpler alternative to self-attention in transformers, sparking discussion on its potential to achieve linear complexity in sequence modeling.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
N/A
Peak period
2
0-1h
Avg / period
2
Key moments
- 01Story posted
Sep 24, 2025 at 7:38 AM EDT
3 months ago
Step 01 - 02First comment
Sep 24, 2025 at 7:38 AM EDT
0s after posting
Step 02 - 03Peak activity
2 comments in 0-1h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 24, 2025 at 7:41 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45358944Type: storyLast synced: 11/17/2025, 1:11:24 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
On its own, summation is competitive for classification and multimodal tasks. In language modeling, a hybrid design — summation in most layers with a single final attention layer — matches or slightly outperforms full attention while staying nearly linear in cost.
GitHub: https://github.com/pfekin/summation-based-transformers
How is this different from Performer / linear attention? Performer and related methods approximate the softmax kernel with random features or low-rank projections. Summation is not an approximation — it eliminates similarity altogether. Tokens are modulated by positional encodings, projected with nonlinearities, and aggregated by direct addition.
Does pure summation replace attention? In classification and multimodal regression, yes — summation alone is competitive and often faster. In autoregressive language modeling, pure summation underperforms. But a hybrid design (summation in most layers + a single final attention layer) matches or slightly beats full attention while keeping most of the network near-linear.
What scale are the experiments? Small-to-moderate scale (document classification, WikiText-2, AG News, etc.). Scaling laws remain an open question — collaboration on larger-scale validation is very welcome.
Why might this work? Summation acts as a bottleneck: only task-relevant features survive aggregation, which seems to restructure embeddings before the final attention layer stabilizes them. PCA and dimensionality analyses show distinctive representation dynamics compared to attention.