Understanding Transformers Using a Minimal Example
Posted4 months agoActive4 months ago
rti.github.ioTechstory
calmmixed
Debate
40/100
TransformersLarge Language ModelsAI ExplainabilityMachine Learning
Key topics
Transformers
Large Language Models
AI Explainability
Machine Learning
The post shares an interactive visualization to understand transformers, sparking discussion on its effectiveness and comparisons with other explanations.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
4h
Peak period
5
18-24h
Avg / period
2.8
Comment distribution25 data points
Loading chart...
Based on 25 loaded comments
Key moments
- 01Story posted
Sep 3, 2025 at 11:30 AM EDT
4 months ago
Step 01 - 02First comment
Sep 3, 2025 at 3:48 PM EDT
4h after posting
Step 02 - 03Peak activity
5 comments in 18-24h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 6, 2025 at 11:55 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45116957Type: storyLast synced: 11/20/2025, 2:35:11 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> How can AI ID a cat?
https://news.ycombinator.com/item?id=44964800
For reference, my initial understanding was somewhat low: basically I know a) what embedding is basically b) transformers work by matrix multiplication, and c) it's something like a multi-threaded Markov chain generator with the benefit of prior-trained embeddings
https://youtu.be/wjZofJX0v4M
[0] https://huggingface.co/blog/vtabbott/mixtral
Instead I believe this might work better as a guided exercise where a person can work on it over a few hours rather than being spoon-fed it over the 10 minute reading time. Or breaking up the steps into "interactive" sections that more clearly demarcate the stages.
Regardless I'm very supportive of people making efforts to simplify this topic, each attempt always gives me something that I either forgot or neglect.
If I'm interpreting it correctly, it sort of validates my intuition that attention heads are "multi-threaded markov chain models" , in other words if autocomplete just looks at level 1, a transformer looks at level 1 for every word in the input plus many layers deeper for every word (or token) in the input.. while bringing a huge pre-training dataset to bear.
If that's correct more or less, something that surprises me is how attention is often treated as some kind of "breakthrough" - it seems obvious to me that improving a markov chain recommendation would involve going deeper and dimensionalizing the context in a deeper way.. the technique appears the same just the amount of analysis is more. I'm not sure what I'm missing here. Perhaps adding those extra layers was a hard problem thta we hadnt figured out how to efficiently do yet (?)
Resources I’ve liked:
Sebastian Raschka book on building them from scratch
Deep Learning a Visual Approach
These videos / playlists:
https://youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ... https://youtube.com/playlist?list=PLoROMvodv4rOwvldxftJTmoR3... https://youtube.com/playlist?list=PL7m7hLIqA0hoIUPhC26ASCVs_... https://www.youtube.com/live/uIsej_SIIQU?si=RHBetDNa7JXKjziD
here’s a basic impl that i trained on tinystories to decent effect: https://gist.github.com/nikki93/f7eae83095f30374d7a3006fd5af... (i used claude code a lot to help with the above bc a new field for me) (i did this with C and mlx before but ultimately gave into the python lol)
but overall it boils down to:
- tokenize the text
- embed tokens (map each to a vector) with a simple NN
- apply positional info so each token also encodes where it is
- do the attention. this bit is key and also very interesting to me. there are three neural networks: Q, K, V – that are applied to each token. you then generate a new sequence of embeddings where each position has the Vs of all tokens added up weighted by the Q of that position dot’d with the K of the other position. the new embeddings are /added/ to the previous layer (adding like this is called ‘residual’)
- also do another NN pass without attention, again adding the output (residual) there’s actually multiple ‘heads’ each with a different Q, K, V – their outputs are added together before that second NN pass
there’s some normalization at each stage to keep the numbers reasonable and from blowing up
you repeat the attention + forward blocks many times, then the last embedding in the final layer output is what you can sample based on
i was surprised by how quickly this just starts to generate coherent grammar etc. having the training loop also do a generation step to show example output at each stage of training was helpful to see how the output qualitatively improves over time, and it’s kind of cool to “watch” it learn.
this doesn’t cover MoE, sparse vs dense attention and also the whole thing about RL on top of these (whether for human feedback or for doing “search with backtracking and sparse reward”) – i haven’t coded those up yet just kinda read about them…
now the thing is – this is a setup for it to learn some processes spread among the weights that do what it does – but what those processes are seems still very unknown. “mechanistic interpretability” is the space that’s meant to work on that, been looking into it lately.
I think it does not scale well beyond 5 boxes (20 numbers) because the stacks become too complex to remember and identify patterns in. This is me, could be also quite individual.
Also, the Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Also, this HN comment has numerous resources: https://news.ycombinator.com/item?id=35712334
Also, here's an interactive "transformer explainer" that is absolutely mind-blowing [2].
[1] https://www.youtube.com/watch?v=0VLAoVGf_74
[2] https://poloclub.github.io/transformer-explainer/