Building an Evolutionary Search for Attention Mechanisms
Posted2 months agoActive2 months ago
github.comResearchstory
calmpositive
Debate
0/100
Evolutionary AlgorithmsAttention MechanismsNeural Networks
Key topics
Evolutionary Algorithms
Attention Mechanisms
Neural Networks
The post shares a GitHub repository implementing an evolutionary search for attention mechanisms, sparking interest in the potential applications of this technique.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
1s
Peak period
1
0-1h
Avg / period
1
Key moments
- 01Story posted
Oct 23, 2025 at 3:05 AM EDT
2 months ago
Step 01 - 02First comment
Oct 23, 2025 at 3:05 AM EDT
1s after posting
Step 02 - 03Peak activity
1 comments in 0-1h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 23, 2025 at 3:05 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45679041Type: storyLast synced: 11/17/2025, 9:12:06 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The Question: Transformers use scaled dot-product attention because it was shown to be effective in the "Attention is All You Need" paper. But was it actually optimal, or just the first thing that worked well enough? Most research tweaks hyperparameters. I wanted to explore the mechanism design space itself. The Constraint: I have no computing budget. No lab. No institutional backing. Just free Colab and curiosity.
This meant: Small models only (~500K parameters) Fast training (5K steps per model) Limited search (120 evaluations total) WikiText-2 (small enough to iterate quickly)
The Approach: I encoded attention mechanisms as genes with 4 components:pythongene = AttentionGene( similarity='dot', # How Q and K compute scores normalization='sparsemax', # How scores become weights gating='output_gate', # Optional gating mechanism temperature='learned' # How to scale attention
This creates a discrete search space of 384+ possible mechanisms.
Then I ran a simple genetic algorithm: Initialize 12 random attention mechanisms Train each for 5K steps on WikiText-2 Keep top 3 (elitism) Generate 9 offspring via crossover + mutation Repeat for 10 generations Each generation takes ~2 hours on free Colab. Total: ~20 GPU hours.What Evolution FoundBest mechanism: dot + sparsemax + output_gate + learned_temperatureResults:
Evolved: 98.45 perplexity Baseline (dot + softmax): 102.90 perplexity Improvement: 4.3% The interesting part isn't the 4% improvement. It's what evolution consistently chose:
Finding #1: Sparsemax > Softmax. Every top performer used sparsemax normalization instead of softmax. Sparsemax (from a 2016 paper) creates sparse attention - many weights become exactly zero. The ML community largely ignored it. Evolution rediscovered it works.
Finding #2: Output Gating is a Universal top mechanism used output gating:pythonoutput = attention_result gate = sigmoid(linear(input)) output = output * gate. This wasn't in the original Transformer. Evolution found it's critical.
Finding #3: Highway Gating Always FailsHighway connections (borrowed from Highway Networks) were the worst performers across all generations. Average perplexity: 115.8.This surprised me - highway connections work elsewhere. But for attention, they consistently failed.
Finding #4: Dot-Product is Actually Good. The winner uses standard dot-product similarity, not some exotic function. The improvement comes from normalization + gating, not from replacing the core similarity function. This makes the result more practical - dot-product is fast.
The Honest Part: This is proof-of-concept, not production-ready: Not tested:
Large models (100M+ params) Other datasets Other domains (vision, audio) Production deployment Known issues:
Training variance is ±1 perplexity Only 93 mechanisms evaluated (~24% of search space) Single run per mechanism (no statistical tests) Baseline wasn't hyperparameter-tuned With enough evolutionary steps, you can probably find "good" hyperparameters for any mechanism. I don't know if I discovered better mechanisms or just better hyperparameters. What I Learned 1. Evolutionary Search is Viable at Small Scale. You don't need massive compute to explore architecture spaces. 20 GPU hours found something interesting.
That's 0.8 points of noise. My "4% improvement" has ~1 point of uncertainty baked in. Proper validation requires multiple runs. I didn't do this (compute constraints). Search Space Design is Everything. I spent more time designing the search space than writing the evolution code. What components to include? What ranges? What's too complex? Bad search space = wasted compute.