Markov Chains Are the Original Language Models
Posted4 months agoActive3 months ago
elijahpotter.devTechstoryHigh profile
calmpositive
Debate
60/100
Markov ChainsLanguage ModelsNlpAIMachine Learning
Key topics
Markov Chains
Language Models
Nlp
AI
Machine Learning
The article discusses how Markov chains are the original language models and shares personal anecdotes and historical context, sparking a thoughtful discussion on the topic.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
142
Day 5
Avg / period
20
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 19, 2025 at 2:42 PM EDT
4 months ago
Step 01 - 02First comment
Sep 23, 2025 at 2:01 PM EDT
4d after posting
Step 02 - 03Peak activity
142 comments in Day 5
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 30, 2025 at 5:02 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45304980Type: storyLast synced: 11/20/2025, 8:32:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).
I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.
For example I'm currently relearning various ImageMagick details and thanks to their explanations now understand things that I cut/copy/pasted a long time ago without always understanding why things worked the way they did.
[1] https://en.wikipedia.org/wiki/Loebner_Prize
https://github.com/Heatwave/the-practice-of-programming/blob...
https://en.wikipedia.org/wiki/Mark_V._Shaney
0: https://www.genudi.com/about
http://c9x.me/cbits/
Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D
I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.
In the end you’re inputting a millions of ways there could be a pattern, passing all of those into a neural network and weighting the chains that make correct predictions more.
You start to realize even with all these ways past context could still influence the current prediction and what you want is a generator for all the ways there could be a pattern. At this point you're getting into the realm of multilayer neural networks and starting to consider the attention mechanism.
I don’t want to discourage anyone from learning markov chains here btw. It’s just that they have limitations and those limitations actually make a great learning journey for neural networks as you realize you really need more than an absolute singular state being activated at a time and then you start to think about how all the states activated in the past might influence the current probabilities (essentially you then start thinking about the problem the attention mechanism solves).
Give a class of students an image with horizontal lines where every second line is a solid color and every other is random static. See how their left to right markov chains do here (should make ~50% correct predictions).
Then rotate the image 90degrees. Have the class observe a left to right markov chains gets 0% when predicting this (every second pixel being random will do that). What to do? Maybe input both ways and weight towards the best one with a perceptron? Hey first step to learning a neural network!
From there you can iterate more and more until you no longer really have markov chains but instead neural networks with a type of attention mechanism.
The essential property of a Markov chain is maintaining the Markov property:
P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)
That is the future, given the present state, is conditionally independent of the past states.
It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.
> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.
Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.
tl;dr most of the LLMs people use are effectively Markov Chains.
Chat interfaces are like taking a simple Markov generator, but with a rule where you say 'whenever it reaches a state ending in X, hand over decision making about state transitions to a human instead of a random number generator, until they move it to state ending in Y'.
Tool calling is similar - 'when it reaches state ending in X, send the state data to a tool; use the result to drive a series of state transitions; then start generating again'.
The validation of what i'm saying is little more than a few minutes google away.
Some interventions can just be warnings, rather than lectures.
As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.
This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go
solid->32 random bits->different solid color
And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.
You can use skipgrams - prefixes with holes in them.
Sparse Non-negative Matrix Language Model [1] uses them with great success.
[1] https://aclanthology.org/Q16-1024/
The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.
If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.
[2] https://github.com/thesz/snmlm-per-byte
What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?
As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.
I guess it would be more accurate to say that at its core a Markovian process is based around a pure function. Usually there's some kind of feedback loop around the function that allows the process to evolve in some way and continue producing output.
>I’m pretty sure anything can be modeled using pure functions.
We're not talking about theoretical models of black boxes, though, we're talking about known mechanisms. A recursive function doesn't become iterative just because it could be thought of as iterative (because eventually the CPU does loop over some sequence of instructions). We still say it's recursive.
>I’m not sure what the usefulness is
I suppose part of the usefulness is to de-mystify LLMs. If you can understand what Markov chains do, the thought "oh. An LLMs is just a much more complex Markov chain" can help reasoning about them.
Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.
I'm not sure what those would be. All dynamical systems (essentially all of classical physics) except maybe quantum mechanics are a Markov chain.
So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.
I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.
These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.
> Can you give an example of a non-Markov system?
Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.
Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.
Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.
...when they are used without RAG and tools. x_n belongs to a set of cardinality 65536^100000 or so. Equating an LLM to a Markov Chain doesn't allow to make any non-trivial predictions about its behavior.
The key insight you raise about attention is particularly important: it doesn't violate the Markov property, it just enables a vastly more sophisticated state representation. Classical n-gram Markov chains use simple discrete states, while transformers use high-dimensional continuous representations that can encode exponentially more information about context.
This perspective helps bridge the conceptual gap many people have. When they think "Markov chain," they often picture simple state transitions, but mathematically, LLMs are just Markov chains with astronomically complex states. The attention mechanism is the computational trick that makes these complex states tractable - it's not changing the fundamental probabilistic structure.
Right - but this feels like being half-way between transformers and lookup tables; the latter have enormous expressive power too, as long as you're willing to handle even more state. I'm curious what else could be put on that spectrum, and where. Maybe there's something weaker than transformers, but good enough for practical applications, while being more resource-efficient?
Related thought: understanding and compression seem to be fundamentally the same thing.
To quote a friend of mine: If I can't just trust the status quo I would have to question everything????
#ItsSoMetaEvenThisAcronym
Both are fundamentally prediction. :)
Basically, for most forms of complexity, Markov Chains are a strictly bad model. Perhaps if there was a way to "virtualize" the multiplication of states, you could do something reasonable. Idle speculation though.
[1] https://news.ycombinator.com/item?id=45354488
I wish! Gboard is terrible at predicting the correct form of a word.
It generally knows the best word, but it doesn't get from context that (e.g.) a gerund form of the verb is the most appropriate next
Where the state space would be proportional to the token length squared, just like the attention mechanisms we use today?
Eg imagine input of red followed by 32bits or randomness followed by blue forever. Markov chains would learn red leads to blue 32bits later. They’d just need to learn 2^32 states.
A few more leaps and we should eventually get models small enough to get close to information theoretic lower bounds of compression.
as long as you don't care about the quality of what they're expressing. there's a reason they never did anything better than the postmodernism generator.
putting paint in a cannon has enormous expressive power too, but if you aren't rothko, nobody's going to care
What's fascinating is that transformers with attention don't actually escape the Markov property - they're mathematically equivalent to very high-order Markov chains where the entire context window forms the "state." The breakthrough wasn't abandoning Markov chains, but finding a parameterized way to approximate these massive state spaces through learned representations rather than explicit enumeration.
Your observation about inevitably trending toward attention-like mechanisms is spot-on. The attention mechanism essentially provides a tractable approximation to the astronomically large transition matrices that would be required for a classical Markov chain to capture long-range dependencies. It's a more elegant solution to the same fundamental problem you were solving with skip states.
However, I'd argue that the distinction between classical n-gram Markov chains and modern transformers isn't as binary as it might appear. When we consider that transformers with context windows are mathematically equivalent to very high-order Markov chains (where the "state" encompasses the entire context), we see that the breakthrough wasn't abandoning the Markov property, but rather expanding the state representation exponentially.
Your observation about inevitably trending toward attention-like mechanisms is particularly insightful. The exponential state explosion you encountered (needing 2^32 states for your example) is precisely why parameterized approaches like transformers became necessary - they provide a tractable way to approximate these massive state spaces through learned representations rather than explicit enumeration.
The key innovation wasn't escaping Markov chains, but rather finding an efficient way to represent and compute over astronomically large state spaces that would be intractable for classical approaches.
What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.
Like, what is one insight beyond that LLMs are Markov chains that you've derived from thinking of LLMs as Markov chains? I'm genuinely very curious.
Around 2009, I had developed an algorithm for building the Burrows–Wheeler transform on (what was back then) very large scale. If you have the BWT of a text corpus, you can use it to simulate a Markov model with any context length. It tried that with a Wikipedia dump, which was amusing for a while but not interesting enough to develop further.
Then, around 2019, I was working in genomics. We were using pangenomes based on thousands of (human) haplotypes as reference genomes. The problem was that adding more haplotypes also added rare variants and rare combinations of variants, which could be misleading and eventually started decreasing accuracy in the tasks we were interested in. The standard practice was dropping variants that were too rare (e.g. <1%) in the population. I got better results with synthetic haplotypes generated by downsampling the true haplotypes with a Markov model (using the BWT-based approach). The distribution of local haplotypes within each context window was similar to the full set of haplotypes, but the noise from rare combinations of variants was mostly gone.
Other people were doing haplotype inference with Markov models based on similarly large sets of haplotypes. If you knew, for a suitably large subset of variants, whether each variant was likely absent, heterozygous, or homozygous in the sequenced genome, you could use the model to get a good approximation of the genome.
When ChatGPT appeared, the application was surprising (even though I knew some people who had been experimenting with GPT-2 and GPT-3). But it was less surprising on a technical level, as it was close enough to what I had intuitively considered possible with large Markov models.
I frequently see people underestimate the danger of LLMs to humanity in the long term and to people’s jobs in the medium term because they follow this chain of reasoning:
- An LLM is basically a fancy markov chain (highly dubious even if there’s a tiny kernel of truth in there) - Therefore markov chains and LLMs have a similar skill ceiling (definitely wrong) - My job could never be done by a markov chain, it’s much too complex (this is presented self-evidently, no one ever feels the need to back this up) - Therefore, since LLMs are basically markov chains, and a markov chain could never do my job, I have nothing to worry about.
I’ll grant that you’re not necessarily responsible for these people using your model in a way you didn’t intend. But I do think at this point it’s time to start pushing people to stop trying to reason mechanically about how these things work.
2. Because they are Markov chains, their skill ceiling is bounded by whatever the skill ceiling of Markov chains happens to be. What LLMs demonstrate is that the skill ceiling of Markov chains is higher than previously understood.
3. If the skill ceiling of Markov chains is high enough, then one could take over some or all of someone's job.
N-gram models are not useful in many of the ways LLMs are.
N-gram models are very limited.
On the other hand, basically any process can be considered a Markov process if you make the state include enough.
So, calling both the very-limited n-gram models and the nearly-unlimited Markov processes, by the same name “Markov chain” is just, super confusing.
My experience with people making claims about the inherent danger has been along the lines of this later quote:
> this is presented self-evidently, no one ever feels the need to back this up
softmax(QK^T/sqrt(dk))V
Q=W_QX K=W_KX V=W_VX
X is a matrix where every column is the embedding of a token.
K and V matrices need to be cached.
Also, you need to calculate all of it, for every layer, even if you just want a single token of output. The tokens after the first token can reuse the K, V embedding matrices and don't have to recalculate everything for on scratch. Causal attention is the concept of masking the calculation so you only have to look at past tokens.
My critique boils down to this: Transformers are too simple to simplify them. You can only lose information if you do that.
And Veritasium's video "The Strange Math That Predicts (Almost) Anything" talks in detail about the history of Markov chains: https://youtu.be/KZeIEiBrT_w
[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...
And, well, what would a human do when they encounter a problem of recognizing an arbitrarily complex grammar? Would they try to learn it? Would they succeed? Or would they write a program to recognize it? Is writing a program in TC0?
It might be a genuine limitation that guaranties subhuman performance at practically achievable network sizes, but it's not that straightforward.
Indeed, even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages, that is still sufficient to write programs that can then simulate the turing machine and give you turing completeness in a "meta" way. (The catch might be that the program is only turing complete in an "abstract" sense, just as in the real-world running programs don't have infinite tape).
> even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages
https://arxiv.org/abs/2310.07923 Their recognition abilities are limited by context-sensitive languages, if the number of steps in chain-of-thought is linear in the length of input.
All in all, asymptotic behavior of a network on an algorithmic task doesn't matter that much for AI. What matters is whether the network can simplify a problem to a manageable length.
https://www.elsewhere.org/journal/pomo/
Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):
"If one examines subcultural capitalist theory, one is faced with a choice: either reject subdialectic cultural theory or conclude that the significance of the artist is significant form, but only if culture is interchangeable with art. It could be said that many desituationisms concerning the capitalist paradigm of context exist. The subject is contextualised into a presemanticist deappropriation that includes truth as a totality."
https://www.elsewhere.org/journal/pomo/?1298795365
In my opinion, it's stolen valor.
There's a reason companies hire people they don't need: to make themselves look bigger to investors and in the case of law firms, clients.
Analyzing complex systems from first principles and challenging decisions that have no good reason is a noble cause, but it's only useful if you have the authority to make the changes. I'm not dictator of the world, I don't get to dictate how customers, clients, or employers shall make their business and hiring decisions.
It worked surprisingly well. :)
Seems like it stopped around 5 years ago.
25:13 As the hands upon his disciple; but make great matter wisely in their calamity; 1:14 Sanctify unto the arches thereof unto them: 28:14 Happy is our judges ruled, that believe. 19:36 Thus saith the first being one another's feet.
This leans pretty hard on Perl DWIM; a Python version is many times longer:
(If you write code like that at a job, you might get fired. I would reject your pull request.)It's probably worth mentioning:
- Markov-chain states don't have to be words. For text modeling, a common thing to do is to use N-grams (of either words or characters) instead of single words for your states.
- It's good to have some nonzero probability to take a transition that hasn't occurred in the training set; this is especially important if you use a larger universe of states, because each state will occur fewer times in the training set. "Laplacian smoothing" is one way to do this.
- Usually (and in my code examples above) the probability distribution of transitions we take in our text generator is the probability distribution we inferred from the input. Potter's approach of multiplying his next-state-occupancy vector Ms⃗ by a random diagonal matrix R and then taking the index of the highest value in the resulting vector produces somewhat similar results, but they are not the same; for example
is roughly 833000, not roughly 750000. It's 5× as common instead of 3×. But I imagine that it still produces perfectly cromulent text.Interestingly, modern compilers derive from Chomsky's theory of linguistics, which he formulated while working on an AI project in the 01950s in order to demolish the dominant theory of psychology at the time, behaviorism. Behaviorism essentially held that human minds were Markov chains, and Chomsky showed that Markov chains couldn't produce context-free languages.
Happy part of today's 10'000.
How big is a Transformer in TensorFlow in Python?
Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.
It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.
> In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.
From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.
KingJamesProgramming [1] - a mashup of Knuth books [2] and the King James Bible with a Markov chain, is still one of my favorite reads for a laugh.
It was also the first place I was really exposed to probabilistic-based methods for generation of novel content.
You get gems like:
"they smote the city with the edge of the sword. 22:20 And one of his main motivations was the high cost of proprietary software houses."
and
"37:29 The righteous shall inherit the land, and leave it for an inheritance unto the children of Gad according to the number of steps that is linear in b."
[1] https://www.tumblr.com/kingjamesprogramming
[2] https://en.wikipedia.org/wiki/Donald_Knuth
"then shall they call upon me, but I will not cause any information to be accumulated on the stack."
And since the context window is bounded, this is guaranteed to be finite-state, and all states are transient except those where the window is saturated.
https://www.youtube.com/watch?v=KZeIEiBrT_w
Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.
To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.
Then two instances can have time efficient interrupty conversations with each other.
Also, this Markov chain has size O(T^K), exponential in the token length of the context window, which is a beyond astronomically large object
Source: https://github.com/Aperocky/weighted-markov-generator
You could load it up with anything, and it would just babble on.
Anyway, for anyone who wants to go deeper, I recommend Lawrence R. Rabiner, "A tutorial on Hidden Markov Models and selected applications in speech recognition", https://www.cs.cmu.edu/~cga/behavior/rabiner1.pdf
And for a cure example, there is also one in nothin else than Claude Shannon, "A Mathematical Theory of Communication", https://web.archive.org/web/20121108191018/https://www.alcat...
https://codap.concord.org/app/static/dg/en/cert/index.html?d...
If you select drawing mode from the first screen shown and then click the "T" icon you can type some text and it will generate a state diagram for you that you can then "play" and examine the output sequences. If you have states that have multiple exit routes you can click on that state and adjust the probabilities of each option.
14 more comments available on Hacker News