Markov Chains Are the Original Language Models

Posted4 months agoActive3 months ago

chilipepperhott

454 points

174 comments

elijahpotter.devTechstoryHigh profile

calmpositive

Debate

60/100

Markov ChainsLanguage ModelsNlpAIMachine Learning

Key topics

Markov Chains

Language Models

Nlp

Machine Learning

The article discusses how Markov chains are the original language models and shares personal anecdotes and historical context, sparking a thoughtful discussion on the topic.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

142

Day 5

Avg / period

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Sep 19, 2025 at 2:42 PM EDT
4 months ago
Step 01
02First comment
Sep 23, 2025 at 2:01 PM EDT
4d after posting
Step 02
03Peak activity
142 comments in Day 5
Hottest window of the conversation
Step 03
04Latest activity
Sep 30, 2025 at 5:02 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (174 comments)

Showing 160 comments of 174

allthatineed

3 months ago

3 replies

I remember playing with megahal eggdrop bots.

This was one of my first forays into modifying c code, trying to figure out why 350mb seemed to be the biggest brain size (32 bit memory limits and requiring a contiguous block for the entire brain).

I miss the innocence of those days. Just being a teen, tinkering with things i didn't understand.

foobarian

3 months ago

2 replies

I'm old now, but thanks to LLMs I can now again tinker with things I don't understand :-)

jcynix

3 months ago

The nice thing about LLMs is that they can explain stuff so you can learn to understand. And they are very patient.

For example I'm currently relearning various ImageMagick details and thanks to their explanations now understand things that I cut/copy/pasted a long time ago without always understanding why things worked the way they did.

codr7

3 months ago

Are you though? Or is the LLM the target of your tinkering and lack of understanding? Honest question.

vunderba

3 months ago

I remember reading the source of the original MegaHAL program when I was younger - one of the tricks that made it stand out (particularly in the Loebner competitions [1]) was that it used both a backwards and forwards Markov chain to generate responses.

[1] https://en.wikipedia.org/wiki/Loebner_Prize

anthk

3 months ago

Hailo it's still alive under CPAN.

glouwbug

3 months ago

3 replies

The Practice of Programming by Kernighan and Pike had a really elegant Markov:

https://github.com/Heatwave/the-practice-of-programming/blob...

jcynix

3 months ago

And Mark V. Shaney was designed by Rob Pike and posted on Usenet, but that happened a long time ago:

https://en.wikipedia.org/wiki/Mark_V._Shaney

y7r4m

3 months ago

About a decade ago I somehow came across Genudi[0], a markov chain based "AI", and had quite a bit of fun with it. The creator has a blog that makes for an interesting reading session.

0: https://www.genudi.com/about

anthk

3 months ago

This one too:

http://c9x.me/cbits/

chankstein38

3 months ago

2 replies

I once, probably 4-6 years ago, used exports from Slack conversations to train a Markov Chain to recreate a user that was around a lot and then left for a while. I wrote the whole thing in python and wasn't overly well-versed in the statistics and math side but understood the principle. I made a bot and had it join the Slack instance that I administrate and it would interact if you tagged it or if you said things that person always responded to (hardcoded).

Well, the responses were pretty messed up and not accurate but we all got a good chuckle watching the bot sometimes actually sound like the person amidst a mass of random other things that person always said jumbled together :D

emmelaich

3 months ago

I'm sure that Gilfoyle's prank in Silicon Valley had its basis in a true story, like most of SV's episodes. i.e. I'm sure this has been done quite a few times ever since Mark V. Shaney.

vunderba

3 months ago

I had a similar program designed as my "AWAY" bot that was trained on transcripts of my previous conversations and connected to Skype. At the time (2009) I was living in Taiwan so I would activate it when I went to bed to chat with my friends back in the States given the ~12 hour time difference. Reading back some of the transcripts made it sound like I was on the verge of a psychotic break though.

ahmedhawas123

3 months ago

1 reply

Random tidbit - 15+ years ago Markov chains were the go to for auto generating text. Google was not as advanced as it is today at flagging spam, so most highly affiliate-marketing dense topics (e.g., certain medications, products) search engine results pages were swamped with Markov chain-created websites that were injected with certain keywords.

gattilorenz

3 months ago

For auto-generating non-sensical texts and SEO spam; for actual text generation, planners and realizers were the to-go approach (and ~5 years later it was all LTSMs)

AnotherGoodName

3 months ago

8 replies

The problem is the linear nature of markov chains. Sure they can branch but after an observation you are absolutely at a new state. A goes to B goes to C etc. A classic problem to understand why this is an issue is feeding in a 2D bitmap where the patterns are vertical but you’re passing in data left to right which Markov chains can’t handle since they are navigating exclusively on the current left to right inout. They miss the patterns completely. Similar things happen with language. Language is not linear and context from a few sentences ago should change probabilities in the current sequence of characters. The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

I have played with Markov chains a lot. I tried having skip states and such but ultimately you’re always pushed towards doing something similar to the attention mechanism to handle context.

cuttothechase

3 months ago

1 reply

Would having a Markov chain of Markov chains help in this situation. One chain does this when 2D bitmap patterns are vertical and another one for left to right?

AnotherGoodName

3 months ago

Yes and then you weight between them with a neural network and your vertical predictor catches that every second vertical line is solid (while every other vertical line is static to mess up the horizontal markov chains). Of course then someone passes you video where there’s 3rd dimension. And you need to yet customise again with that consideration. Or maybe the pattern is in 45 degree diagonal lines and not horizontal or vertical. Better have a markov chains for that too. What about 10degree vertical lines? Etc.

In the end you’re inputting a millions of ways there could be a pattern, passing all of those into a neural network and weighting the chains that make correct predictions more.

You start to realize even with all these ways past context could still influence the current prediction and what you want is a generator for all the ways there could be a pattern. At this point you're getting into the realm of multilayer neural networks and starting to consider the attention mechanism.

I don’t want to discourage anyone from learning markov chains here btw. It’s just that they have limitations and those limitations actually make a great learning journey for neural networks as you realize you really need more than an absolute singular state being activated at a time and then you start to think about how all the states activated in the past might influence the current probabilities (essentially you then start thinking about the problem the attention mechanism solves).

6r17

3 months ago

1 reply

Would you say it's interesting to explore after spending much time on them ? Do you feel like one could make an use for it pragmatically within certain context or it's way too much of a toy where most of the time getting a service / coherent llm would ease-in the work ?

AnotherGoodName

3 months ago

Yes. I think learning them and learning their limitations is the best way to learn neural networks actually.

Give a class of students an image with horizontal lines where every second line is a solid color and every other is random static. See how their left to right markov chains do here (should make ~50% correct predictions).

Then rotate the image 90degrees. Have the class observe a left to right markov chains gets 0% when predicting this (every second pixel being random will do that). What to do? Maybe input both ways and weight towards the best one with a perceptron? Hey first step to learning a neural network!

From there you can iterate more and more until you no longer really have markov chains but instead neural networks with a type of attention mechanism.

roadside_picnic

3 months ago

6 replies

> after an observation you are absolutely at a new state

The essential property of a Markov chain is maintaining the Markov property:

P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)

That is the future, given the present state, is conditionally independent of the past states.

It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.

> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.

tl;dr most of the LLMs people use are effectively Markov Chains.

mjburgess

3 months ago

2 replies

They are Markov Chains, any one saying otherwise doesn't understand what a Markov chain is

lackoftactics

3 months ago

1 reply

Wouldn't tool calling, mcp break finite state?

jameshart

3 months ago

1 reply

Tool calling is just part of the protocol you're using for how to handle states output by your Markov model. Same goes for chat input, for that matter.

Chat interfaces are like taking a simple Markov generator, but with a rule where you say 'whenever it reaches a state ending in X, hand over decision making about state transitions to a human instead of a random number generator, until they move it to state ending in Y'.

Tool calling is similar - 'when it reaches state ending in X, send the state data to a tool; use the result to drive a series of state transitions; then start generating again'.

naasking

3 months ago

Memory Augmented Large Language Models are Computationally Universal, https://arxiv.org/abs/2301.04589

hk__2

3 months ago

1 reply

Authority arguments like "I’m right and if you disagree with me you don’t understand anything" are not exactly helpful.

mjburgess

3 months ago

It's not an authority argument, its the opposite. A warning about the supposed authority of people who would make that kind of claim.

The validation of what i'm saying is little more than a few minutes google away.

Some interventions can just be warnings, rather than lectures.

rcfox

3 months ago

1 reply

Is this just because LLMs don't have state?

As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.

jampekka

3 months ago

Stateful models, e.g. RNNs, are Markov models too. Sometimes "Markov chain" is used to refer specifically to models with no hidden state, e.g. (decoder-only) Transformers.

eigencoder

3 months ago

2 replies

Could you help me understand how decoder-only LLMs maintain the Markov property? If you used the same random seed, the input to the model "The cow jumped over the" would not give the same output as just "the", right? So isn't that violating the Markov property?

AnotherGoodName

3 months ago

1 reply

There would be need to a state specifically for “the cow jumped over the” (and any other relevant context) and states for all the other times ‘the’ is proceeded by something.

This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go

solid->32 random bits->different solid color

And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.

thesz

3 months ago

> It ‘just’ needs a state for every possible random color between.

You can use skipgrams - prefixes with holes in them.

Sparse Non-negative Matrix Language Model [1] uses them with great success.

[1] https://aclanthology.org/Q16-1024/

The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.

If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.

[2] https://github.com/thesz/snmlm-per-byte

OJFord

3 months ago

State (in this sense at least) isn't word/token parsing progress, it's comprising all the input and any context (which may include the entire chat history for example).

kgeist

3 months ago

5 replies

What is the conceptual difference between a "past state" and a "present state"? In a typical conversation with an LLM, you have past messages that remain in the KV cache when generating the next token (so as not to recalculate everything). Yet here, everything that came before is reinterpreted as part of the present state. In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain? I can't wrap my head around it. Can you give an example of a non-Markov system?

fluoridation

3 months ago

1 reply

Hmm... For example, say you have a machine that must decide whether a monetary transaction is allowable or not. The state at any given time could be the balances of all accounts, plus some metadata about past states (but not the past states themselves). Then if A has $5 in their account and tries to transfer $5 to B, the contents of the metadata would be the deciding factor of whether the transaction goes through or not. The machine still only operates on the present state, but the present state depends indirectly on past states.

kgeist

3 months ago

2 replies

But a conversation with an LLM doesn't only depend on the last message, it depends on all previous messages. Sure the number of messages is limited by the context window, so you can indeed say it's a high-order Markov chain (it can only see N states back), but theoretically, what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

thaumasiotes

3 months ago

1 reply

You appear to be confusing the UI over the LLM with the actual working of the LLM. None of them allow for more than one message to exist. You put in one message and you get one word (well, one token) back.

kgeist

3 months ago

1 reply

I have an idea of how they work. At the end of the day, it's all just numbers. You say you put "one message", but in reality it's N input numbers corresponding to the tokens (where N can very from 1 to any size). Basically N variables. An LLM is one large function which takes N values as input and produces a new output value. So you can view an LLM as y = f(a, b, c, ...)

What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?

As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.

fluoridation

3 months ago

1 reply

Let's keep it simple and say that the procedure of an LLM is

  state: list[n];
  while (true){
      new_token = llm(state);
      state.pop_front();
      state.push_back(new_token);
  }

Any given time that llm() is called, it is blind to previous invocations. The state may contain evidence of them, or it may not. It's a free-form data structure that is sometimes filled with user input and sometimes contains previous llm() outputs, but llm() doesn't care.

kgeist

3 months ago

1 reply

So any pure function is Markovian then? They too only depend on their input and are blind to previous invocations. I’m just trying to understand the practical purpose of calling something a Markov chain beyond small sizes, because if we allow the state to be of any size, I’m not sure what the usefulness is: anything can be called a Markov chain if you concatenate all input data of any size into one blob and interpret it as "one state."

fluoridation

3 months ago

>So any pure function is Markovian then?

I guess it would be more accurate to say that at its core a Markovian process is based around a pure function. Usually there's some kind of feedback loop around the function that allows the process to evolve in some way and continue producing output.

>I’m pretty sure anything can be modeled using pure functions.

We're not talking about theoretical models of black boxes, though, we're talking about known mechanisms. A recursive function doesn't become iterative just because it could be thought of as iterative (because eventually the CPU does loop over some sequence of instructions). We still say it's recursive.

>I’m not sure what the usefulness is

I suppose part of the usefulness is to de-mystify LLMs. If you can understand what Markov chains do, the thought "oh. An LLMs is just a much more complex Markov chain" can help reasoning about them.

fluoridation

3 months ago

2 replies

>what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.

imtringued

3 months ago

1 reply

>because there are processes that are neither LLMs nor Markov chains.

I'm not sure what those would be. All dynamical systems (essentially all of classical physics) except maybe quantum mechanics are a Markov chain.

fluoridation

3 months ago

Well, I gave an example already, so...

Veedrac

3 months ago

The point being made is equivalent to pointing out that by strict mathematical definition there aren't physically reliazable distinguishably-non-Markovian processes, because computers are finite objects. Yes, you can talk about Markov chains as referring to everything up to and including actual human brains, but the kind of Markov chains people actually refer to as Markovian in practice are much more specific than that, typically those systems with structurally simple, indexical nodes, but sometimes extending to more complex ideas like RNNs where the states are at least of finite bandwidth. An LLM not only has continuous-valued nodes, it has nodes whose description length grows proportionally to and theoretically-unboundedly with respect to the size of its history.

Imnimo

3 months ago

1 reply

A common distinction is that whatever you take to be your state space should be fixed-sized. So an LLM has a fixed context window, which might be extremely large, but with a long enough conversation, the oldest tokens will fall out of the context window, and your process will be non-Markovian. But with a large context size, most conversations will be small enough that they fit entirely within the context, and then the process is Markovian.

So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.

I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.

Imnimo

3 months ago

1 reply

Sorry, I realized I didn't quite write what I meant to. I didn't intend to say that LLMs are non-Markovian from a theoretical standpoint. I meant to say that the language generation task is non-Markovian from a theoretical standpoint, because the next word can depend on arbitrarily distant history.

vrighter

3 months ago

it can depend on whatever the oldest thing in the context window is, but not anything older

roadside_picnic

3 months ago

1 reply

> In other words, it seems as if you can take any past state and fold it into one large concatenated present state

These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.

> Can you give an example of a non-Markov system?

Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.

nearbuy

3 months ago

If you include the encoder outputs as part of the state, then encoder-decoder LLMs are Markovian as well. While in token space, decoder-only LLMs are not Markovian. Anything can be a Markov process depending what state you include. Humans, or even the universe itself are Markovian. I don't see what insight about LLMs you and other commenters are gesturing at.

jampekka

3 months ago

> In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain?

Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.

jltsiren

3 months ago

The difference is arbitrary but fixed. For any given Markov model, there are inputs that can fully fit in the present state and inputs that can't. But for any given input, there are models that can encode it entirely in the present state.

Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.

red75prime

3 months ago

> tl;dr most of the LLMs people use are effectively Markov Chains.

...when they are used without RAG and tools. x_n belongs to a set of cardinality 65536^100000 or so. Equating an LLM to a Markov Chain doesn't allow to make any non-trivial predictions about its behavior.

nadalishhh

3 months ago

Exactly right on the mathematical equivalence! Your clarification of the Markov property is crucial - the conditional independence given the present state is what defines it, regardless of how complex that state representation becomes.

The key insight you raise about attention is particularly important: it doesn't violate the Markov property, it just enables a vastly more sophisticated state representation. Classical n-gram Markov chains use simple discrete states, while transformers use high-dimensional continuous representations that can encode exponentially more information about context.

This perspective helps bridge the conceptual gap many people have. When they think "Markov chain," they often picture simple state transitions, but mathematically, LLMs are just Markov chains with astronomically complex states. The attention mechanism is the computational trick that makes these complex states tractable - it's not changing the fundamental probabilistic structure.

t_mann

3 months ago

3 replies

You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product. Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.

TeMPOraL

3 months ago

5 replies

> Not that that would necessarily make sense in practice, but in theory Markov chains have enormous expressive power.

Right - but this feels like being half-way between transformers and lookup tables; the latter have enormous expressive power too, as long as you're willing to handle even more state. I'm curious what else could be put on that spectrum, and where. Maybe there's something weaker than transformers, but good enough for practical applications, while being more resource-efficient?

Related thought: understanding and compression seem to be fundamentally the same thing.

impulsivepuppet

3 months ago

1 reply

Related and likely inspired by the related thought: https://www.mattmahoney.net/dc/

refset

3 months ago

Specifically 1.4 "Compression is an Artificial Intelligence Problem" https://www.mattmahoney.net/dc/dce.html#Section_14

jgalt212

3 months ago

1 reply

Most of human life is a lookup table derived action.

econ

3 months ago

1 reply

I notice that people with poor lookup table hardware need to think a lot more. The public part of the table is mostly populated with their thoughts.

To quote a friend of mine: If I can't just trust the status quo I would have to question everything????

TeMPOraL

3 months ago

They're right, of course. You have to pick your battles. There's an art to that - and rules of that art are part of the lookup table too!

#ItsSoMetaEvenThisAcronym

taneq

3 months ago

> Related thought: understanding and compression seem to be fundamentally the same thing.

Both are fundamentally prediction. :)

joe_the_user

3 months ago

But no, as another poster comments, the state space blow up is much worse than transformers for much less [1]. Transformers have more kinds of statefulness while Markov chains seem simpler cause they just multiply the number of states - but that state-multiplication makes them completely impractical. They aren't flexible on a per-state basis, just the opposite.

Basically, for most forms of complexity, Markov Chains are a strictly bad model. Perhaps if there was a way to "virtualize" the multiplication of states, you could do something reasonable. Idle speculation though.

[1] https://news.ycombinator.com/item?id=45354488

joquarky

3 months ago

> Maybe there's something weaker than transformers, but good enough for practical applications, while being more resource-efficient?

I wish! Gboard is terrible at predicting the correct form of a word.

It generally knows the best word, but it doesn't get from context that (e.g.) a gerund form of the verb is the most appropriate next

giovannibonetti

3 months ago

1 reply

> You can model multiple-hop dependencies as a Markov chain by just blowing up the state space as a Cartesian product.

Where the state space would be proportional to the token length squared, just like the attention mechanisms we use today?

AnotherGoodName

3 months ago

1 reply

It blows up at 2^n for Markov chains actually.

Eg imagine input of red followed by 32bits or randomness followed by blue forever. Markov chains would learn red leads to blue 32bits later. They’d just need to learn 2^32 states.

fooker

3 months ago

Yep, the leap away from this exponential blowup is what has made LLMs possible.

A few more leaps and we should eventually get models small enough to get close to information theoretic lower bounds of compression.

JohnHaugeland

3 months ago

> but in theory Markov chains have enormous expressive power.

as long as you don't care about the quality of what they're expressing. there's a reason they never did anything better than the postmodernism generator.

putting paint in a cannon has enormous expressive power too, but if you aren't rothko, nobody's going to care

vrighter

3 months ago

1 reply

yeah but that state can be as big as you want. say, for example, the state could be a vector of a million tokens...

procaryote

3 months ago

* as big as the context window

nadalishhh

3 months ago

Your 2D bitmap example perfectly illustrates the fundamental limitation! The exponential state explosion you encountered (needing 2^32 states for patterns separated by randomness) is precisely why classical Markov chains became intractable for complex dependencies.

What's fascinating is that transformers with attention don't actually escape the Markov property - they're mathematically equivalent to very high-order Markov chains where the entire context window forms the "state." The breakthrough wasn't abandoning Markov chains, but finding a parameterized way to approximate these massive state spaces through learned representations rather than explicit enumeration.

Your observation about inevitably trending toward attention-like mechanisms is spot-on. The attention mechanism essentially provides a tractable approximation to the astronomically large transition matrices that would be required for a classical Markov chain to capture long-range dependencies. It's a more elegant solution to the same fundamental problem you were solving with skip states.

nadalishhh

3 months ago

Excellent point about the fundamental limitation of classical Markov chains' linearity. Your 2D bitmap example perfectly illustrates the key insight: traditional Markov chains are inherently constrained by their sequential, memoryless transitions.

However, I'd argue that the distinction between classical n-gram Markov chains and modern transformers isn't as binary as it might appear. When we consider that transformers with context windows are mathematically equivalent to very high-order Markov chains (where the "state" encompasses the entire context), we see that the breakthrough wasn't abandoning the Markov property, but rather expanding the state representation exponentially.

Your observation about inevitably trending toward attention-like mechanisms is particularly insightful. The exponential state explosion you encountered (needing 2^32 states for your example) is precisely why parameterized approaches like transformers became necessary - they provide a tractable way to approximate these massive state spaces through learned representations rather than explicit enumeration.

The key innovation wasn't escaping Markov chains, but rather finding an efficient way to represent and compute over astronomically large state spaces that would be intractable for classical approaches.

felineflock

3 months ago

Have you tried to sequence the 2D bitmap in a Hilbert space-filling curve before applying the Markov chain?

taolson

3 months ago

1 reply

If you program a Markov chain to generate based upon a fairly short sequence length (4 - 5 characters), it can create some neat portamenteaus. I remember back in the early 90's I trained one on some typical tech literature and it came up with the word "marketecture".

ronsor

3 months ago

That sounds like a corporate buzzword generator

cestith

3 months ago

6 replies

I’ve been telling people for years that a reasonably workable initial, simplified mental model of a large language model is a Markov chain generator with an unlimited, weighted corpus trained in. Very few people who know LLMs have said anything to critique that thought more than that it’s a coarse description and downplays the details. Since being simplified is in the initial statement and it’s not meant to capture detail, I say if it walks like a really big duck and it honks instead of quacking then it’s maybe a goose or swan which are both pretty duck-like birds.

nerdponx

3 months ago

2 replies

It's not a Markov chain because it doesn't obey the Markov property.

What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.

cestith

3 months ago

1 reply

What about a language model requires knowledge of previous states to choose the next state based on the current state? Are you asserting that only the last word chosen is the current state? Couldn’t the list of symbols chosen so far be the current state, so long as no data is carried forward about how that list has been generated?

coliveira

3 months ago

You're right, the recent memory of the LLM is the state, that's why it needs to be so deep to be effective compared to a traditional MC.

qwelfkh

3 months ago

1 reply

Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?

nerdponx

3 months ago

That's fair, and yes it is a Markov chain if you frame it that way. Albeit with the interesting property that the state is very very high dimensional, and the current and previous states never differ by more than 1 token.

jama211

3 months ago

1 reply

Sure, but arguably by that definition so are we ;)

lackoftactics

3 months ago

1 reply

There is no free will, only determinism

jama211

3 months ago

Off topic, but sure. Depends on how you define it of course.

rytill

3 months ago

1 reply

So, I have heard a number of people say this, and I feel like I'm the person in your conversations saying it's a coarse description and downplays the details. What I don't understand is, what specifically do we gain from thinking of it as a Markov chain.

Like, what is one insight beyond that LLMs are Markov chains that you've derived from thinking of LLMs as Markov chains? I'm genuinely very curious.

jltsiren

3 months ago

It depends on if you already had experience in using large Markov models for practical purposes.

Around 2009, I had developed an algorithm for building the Burrows–Wheeler transform on (what was back then) very large scale. If you have the BWT of a text corpus, you can use it to simulate a Markov model with any context length. It tried that with a Wikipedia dump, which was amusing for a while but not interesting enough to develop further.

Then, around 2019, I was working in genomics. We were using pangenomes based on thousands of (human) haplotypes as reference genomes. The problem was that adding more haplotypes also added rare variants and rare combinations of variants, which could be misleading and eventually started decreasing accuracy in the tasks we were interested in. The standard practice was dropping variants that were too rare (e.g. <1%) in the population. I got better results with synthetic haplotypes generated by downsampling the true haplotypes with a Markov model (using the BWT-based approach). The distribution of local haplotypes within each context window was similar to the full set of haplotypes, but the noise from rare combinations of variants was mostly gone.

Other people were doing haplotype inference with Markov models based on similarly large sets of haplotypes. If you knew, for a suitably large subset of variants, whether each variant was likely absent, heterozygous, or homozygous in the sequenced genome, you could use the model to get a good approximation of the genome.

When ChatGPT appeared, the application was surprising (even though I knew some people who had been experimenting with GPT-2 and GPT-3). But it was less surprising on a technical level, as it was close enough to what I had intuitively considered possible with large Markov models.

hatthew

3 months ago

1 reply

Anything can be a markov chain if you define "memory of the past" as a component of the current state and include a big enough memory to hold all of the past for any practical amount of time. Personally, I feel like in most contexts it's not helpful to think of LLMs that way, since their state space (context length) is so large that they are extremely far removed from what we normally think of as markov chains. One might as well define humans as markov chains, since the brain has a finite memory.

cestith

3 months ago

Memory of the past implies the current state includes information about how the buffer got the way it is. I’m not defining anything in that way. I’m saying that if you have a buffer of N tokens, the next token N+1 can be chosen based on N tokens, not 1 token and not on the calculations or past state changes that went into creating the current state of the buffer.

Uehreka

3 months ago

2 replies

The problem is that it doesn’t stay simplified; once you leave the room those people start using that model to explain details of actual LLM behavior. I’m constantly arguing with people who use this mental model to explain why LLMs can’t do things that they absolutely can do.

I frequently see people underestimate the danger of LLMs to humanity in the long term and to people’s jobs in the medium term because they follow this chain of reasoning:

- An LLM is basically a fancy markov chain (highly dubious even if there’s a tiny kernel of truth in there) - Therefore markov chains and LLMs have a similar skill ceiling (definitely wrong) - My job could never be done by a markov chain, it’s much too complex (this is presented self-evidently, no one ever feels the need to back this up) - Therefore, since LLMs are basically markov chains, and a markov chain could never do my job, I have nothing to worry about.

I’ll grant that you’re not necessarily responsible for these people using your model in a way you didn’t intend. But I do think at this point it’s time to start pushing people to stop trying to reason mechanically about how these things work.

asdlfaglkj

3 months ago

1 reply

1. An LLM is absolutely a Markov chain. That is not meant to dismiss that they are a remarkable feat of generating and compressing the representation of really cool Markov chains.

2. Because they are Markov chains, their skill ceiling is bounded by whatever the skill ceiling of Markov chains happens to be. What LLMs demonstrate is that the skill ceiling of Markov chains is higher than previously understood.

3. If the skill ceiling of Markov chains is high enough, then one could take over some or all of someone's job.

drdeca

3 months ago

I think there’s an equivocation being accidentally made between n-gram models, and Markov processes, because “Markov chain” is used to mean both things.

N-gram models are not useful in many of the ways LLMs are.

N-gram models are very limited.

On the other hand, basically any process can be considered a Markov process if you make the state include enough.

So, calling both the very-limited n-gram models and the nearly-unlimited Markov processes, by the same name “Markov chain” is just, super confusing.

Skgqie1

3 months ago

> I frequently see people underestimate the danger of LLMs to humanity in the long term

My experience with people making claims about the inherent danger has been along the lines of this later quote:

> this is presented self-evidently, no one ever feels the need to back this up

imtringued

3 months ago

This is pretty stupid because the equation for self attention is pretty trivial to the point where it makes sense to just go straight into the math and not play pretend games.

softmax(QK^T/sqrt(dk))V

Q=W_QX K=W_KX V=W_VX

X is a matrix where every column is the embedding of a token.

K and V matrices need to be cached.

Also, you need to calculate all of it, for every layer, even if you just want a single token of output. The tokens after the first token can reuse the K, V embedding matrices and don't have to recalculate everything for on scratch. Causal attention is the concept of masking the calculation so you only have to look at past tokens.

My critique boils down to this: Transformers are too simple to simplify them. You can only lose information if you do that.

jcynix

3 months ago

2 replies

Ah, yes, Markov chains. A long time agoMark V. Shaney https://en.wikipedia.org/wiki/Mark_V._Shaney was designed by Rob Pike and posted on Usenet.

And Veritasium's video "The Strange Math That Predicts (Almost) Anything" talks in detail about the history of Markov chains: https://youtu.be/KZeIEiBrT_w

chilipepperhottAuthor

3 months ago

1 reply

It's a great explainer. Definitely better than mine.

proprietario

3 months ago

1 reply

fwiw, your explanation really made markov chains click for me (unlike other markov chain explanations) i have to admit, that the part where you went from the idea over hashmaps to the matrix representation was a bit dense, but rereading it twice made me under it so thanks for your explanation!

chilipepperhottAuthor

3 months ago

That's great to hear. Thanks! I'm glad it worked.

emmelaich

3 months ago

Going way back, I believe https://en.wikipedia.org/wiki/Racter (1984) used Markov chains. The book "The Policeman's Beard is Half-Constructed" is hilarious, but as noted it was co-written by the author.

benob

3 months ago

3 replies

Like it or not, LLMs are effectively high-order Markov chains

BenoitEssiambre

3 months ago

1 reply

Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.

benob

3 months ago

There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.

[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...

krackers

3 months ago

1 reply

Only in the way that every real-world computer is a finite state machine.

bigfishrunning

3 months ago

2 replies

Every real-world computer is a finite state machine, there's just a lot of states. I guess I'm unsure of the point that you're trying to make here

red75prime

3 months ago

1 reply

The point? It's an almost useless way of looking at things.

YeGoblynQueenne

3 months ago

1 reply

What is the useful way?

red75prime

3 months ago

1 reply

The usual. A computer is a computer with memory cells, CPU and so on, or the Turing machine for certain purposes. A deep neural network is a deep neural network with a certain architecture (it allows to reason about training, at least), or the universal approximator (in the same limited way as a computer is the Turing machine), or a function (but it's too general).

krackers

3 months ago

1 reply

For neural networks, the circuit complexity model (e.g. class TC0) is probably better suited.

red75prime

3 months ago

1 reply

It's the circuit complexity of one forward pass.

And, well, what would a human do when they encounter a problem of recognizing an arbitrarily complex grammar? Would they try to learn it? Would they succeed? Or would they write a program to recognize it? Is writing a program in TC0?

It might be a genuine limitation that guaranties subhuman performance at practically achievable network sizes, but it's not that straightforward.

krackers

3 months ago

1 reply

That's actually a good point I had not thought of before... do you happen to know what this concept is called in the literature?

Indeed, even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages, that is still sufficient to write programs that can then simulate the turing machine and give you turing completeness in a "meta" way. (The catch might be that the program is only turing complete in an "abstract" sense, just as in the real-world running programs don't have infinite tape).

red75prime

3 months ago

The term for research on whether LLMs can learn and execute this or that algorithm of program synthesis? I guess there's none. If we do know the algorithm, it's better to teach LLM to write a program that outputs a program. If we don't know the algorithm, we can't reason whether LLM can do it.

> even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages

https://arxiv.org/abs/2310.07923 Their recognition abilities are limited by context-sensitive languages, if the number of steps in chain-of-thought is linear in the length of input.

All in all, asymptotic behavior of a network on an algorithmic task doesn't matter that much for AI. What matters is whether the network can simplify a problem to a manageable length.

YeGoblynQueenne

3 months ago

Pedantry: are you using "finite state machine" as a stand-in for "abstract machine"? If so, OK, otherwise, if you mean it in the sense of a Finite State Automaton (FSA), digital computers of the kind we all use nowadays are Turing machines (with finite tapes). If digital computers were FSAs there would be a good many things you wouldn't be able to do with them, e.g. sorting a list (because that's a context-sensitive task).

guluarte

3 months ago

markov chains with limited self correction

fair_enough

3 months ago

2 replies

On a humorous note OP, you seem like exactly the kind of guy who would get a kick out of this postmodern essay generator that a STEM professor wrote using a Recursive Transition Network in 1996:

https://www.elsewhere.org/journal/pomo/

Every now and again, I come back to it for a good chuckle. Here's what I got this time (link to the full essay below the excerpt):

"If one examines subcultural capitalist theory, one is faced with a choice: either reject subdialectic cultural theory or conclude that the significance of the artist is significant form, but only if culture is interchangeable with art. It could be said that many desituationisms concerning the capitalist paradigm of context exist. The subject is contextualised into a presemanticist deappropriation that includes truth as a totality."

https://www.elsewhere.org/journal/pomo/?1298795365

chilipepperhottAuthor

3 months ago

1 reply

That essay generator is pretty cool. There are definitely a lot of undergrads whose essays look about the same.

fair_enough

3 months ago

1 reply

There is no dearth of midwittery in the non-STEM fields. I admit this is elitist and downright snobbish, but I am disgusted that somebody else can get a degree from the same university as me in a non-STEM major and piggyback off of prestige they didn't create.

In my opinion, it's stolen valor.

procaryote

3 months ago

1 reply

Perhaps you should value knowledge over status?

fair_enough

3 months ago

It doesn't matter what I value. The people who wanna hire me for jobs or invest in my fund value status.

There's a reason companies hire people they don't need: to make themselves look bigger to investors and in the case of law firms, clients.

Analyzing complex systems from first principles and challenging decisions that have no good reason is a noble cause, but it's only useful if you have the authority to make the changes. I'm not dictator of the world, I don't get to dictate how customers, clients, or employers shall make their business and hiring decisions.

1718627440

3 months ago

I think that's how source code sounds to non-programmers. Too far removed from the meaning two distinguish, what makes sense and what is garbage.

jedberg

3 months ago

1 reply

For funsies one of the early reddit engineers wrote a Markov comment generator trained on all the existing comments.

It worked surprisingly well. :)

huseyinkeles

3 months ago

https://www.reddit.com/r/SubredditSimulator

Seems like it stopped around 5 years ago.

kragen

3 months ago

3 replies

A first-order Markov-chain text generator is not complex at all; it's a one-line shell script, reformatted here onto three lines for clarity:

    perl -ane 'push@{$m{$a}},$a=$_ for@F;END{
      print$a=$m{$a}[rand@{$m{$a}}]," "for 1..100_000
    }' <<<'fish fish for fish.'

Given the King James Bible instead of "fish fish for fish." as input, this program produces output like this:

25:13 As the hands upon his disciple; but make great matter wisely in their calamity; 1:14 Sanctify unto the arches thereof unto them: 28:14 Happy is our judges ruled, that believe. 19:36 Thus saith the first being one another's feet.

This leans pretty hard on Perl DWIM; a Python version is many times longer:

    import random, sys

    model, last, word = {None: []}, None, None
    for word in (word for line in sys.stdin for word in line.split()):
        model.setdefault(last, []).append(last := word)

    sys.stdout.writelines(str(word := random.choice(model.get(word) or words)) + " "
                          for words in [list(model.keys())] for i in range(100_000))

(If you write code like that at a job, you might get fired. I would reject your pull request.)

It's probably worth mentioning:

- Markov-chain states don't have to be words. For text modeling, a common thing to do is to use N-grams (of either words or characters) instead of single words for your states.

- It's good to have some nonzero probability to take a transition that hasn't occurred in the training set; this is especially important if you use a larger universe of states, because each state will occur fewer times in the training set. "Laplacian smoothing" is one way to do this.

- Usually (and in my code examples above) the probability distribution of transitions we take in our text generator is the probability distribution we inferred from the input. Potter's approach of multiplying his next-state-occupancy vector Ms⃗ by a random diagonal matrix R and then taking the index of the highest value in the resulting vector produces somewhat similar results, but they are not the same; for example

    sum(random.random() * 0.75 > random.random() * 0.25 for i in range(1_000_000))

is roughly 833000, not roughly 750000. It's 5× as common instead of 3×. But I imagine that it still produces perfectly cromulent text.

Interestingly, modern compilers derive from Chomsky's theory of linguistics, which he formulated while working on an AI project in the 01950s in order to demolish the dominant theory of psychology at the time, behaviorism. Behaviorism essentially held that human minds were Markov chains, and Chomsky showed that Markov chains couldn't produce context-free languages.

chilipepperhottAuthor

3 months ago

1 reply

This is actually sick! I have to admit, my implementation is far more (likely unnecessarily) complex.

kragen

3 months ago

Thanks! I think! Your implementation might be easier to understand and modify, too, although perhaps not because of being unnecessarily complex.

3abiton

3 months ago

1 reply

I brushed with MCMC decades ago, and your code is truely beautiful

kragen

3 months ago

MCMC is considerably more complicated, though!

1718627440

3 months ago

2 replies

What????????? Given that, how was it not obvious for decades, that this devolves into something like the current LLMs? I give it only the first few paragraphs of a random man page and it already passes for a human, that tries to form nonsense sentences, while having mostly correct grammar, but not always. LLMs now feel not very technically advanced. The code isolation in their interface seams to be more impressive than this.

kragen

3 months ago

1 reply

If you only give it a few paragraphs, most words only occur once, so it ends up copying whole phrases from the input until it hits a common word. Computers are very good at sounding like human writing when they're just reproducing things that humans in fact did write. LLMs mostly aren't doing that, though.

1718627440

3 months ago

1 reply

You're right, I was a bit too enthusiastic. I've feed it random C code (both normal and already preprocessed) and it produces random garbage, not remotely passable for syntax. Also it seems to really like emitting parts of copyright messages, since they occur verbatim everywhere.

Happy part of today's 10'000.

kragen

3 months ago

Welcome!

imtringued

3 months ago

1 reply

If there was an array programming language with automatic differentiation and a focus on writing neural networks, I'm sure you could write a transformer based LLM including how to train it on a napkin.

kragen

3 months ago

1 reply

I think reverse-mode automatic differentiation is significantly harder to implement than a string-indexed hash table and delimiter splitting, but maybe that's not important when those are such a small part of an interpreter for Perl or even Awk?

How big is a Transformer in TensorFlow in Python?

kragen

3 months ago

(I guess I should have pointed out that forward-mode automatic differentiation was only about 100–150 lines of code when I implemented it in http://canonical.org/~kragen/sw/81hacks/autodiffgraph/, depending on where you draw the lines. But gradient descent isn't practical with forward-mode autodiff unless it's with a very small number of independent variables. Also even 100–150 lines of code is significantly bigger than a hash table and delimiter splitting.)

fluxusars

3 months ago

1 reply

Can't believe no one's mentioned this classic yet: https://www.tumblr.com/kingjamesprogramming

Back in the early days of Slack I wrote a chat bot that logged all the messages people wrote and used that data to make it say random things prompted by certain trigger words. It was hilarious, until the bosses found out that I was basically logging all their conversations. Unsurprisingly they made me turn it off.

airstrike

3 months ago

LMAO thanks for that link, it's hilarious

mwcz

3 months ago

1 reply

This is great. I use Markov chains to describe to normies how LLMs work.

It's very interesting reading what Claude Shannon wrote after producing Markov chains from English sentences.

> In an unpublished spoof written a year later, Shannon imagined the damage his methods would do if they fell into the wrong hands. It seems that an evil Nazi scientist, dr. Hagen Krankheit, had escaped Germany with a prototype of his Müllabfuhrwortmaschine, a fearsome weapon of war "anticipated in the work ... of Dr. Claude Shannon." Krankheit's machine used the principles of randomized text to totally automate the propaganda industry. By randomly stitching together agitprop phrases in a way that approximated human language, the Müllabfuhrwortmaschine could produce an endless flood of demoralizing statements.

From A Mind at Play. I don't remember the exact year he did that work. I think it was some time before 1950, yet strangely timely in 2025.

lou1306

3 months ago

This is amazing, in no small part due to the fact that Krankheit literally means "disease" in German.

slashdave

3 months ago

1 reply

I don't think the author understands what a Markov chain is

jaapz

3 months ago

Be the first to explain it to them properly then

leecming

3 months ago

1 reply

markov chains are a historical curio for NLP - easy to understand and implement, but just not good enough to do anything useful with. the jump to word vectors (e.g., word2vec) + CNNs/RNNs arguably put the field in a place where you could build useful language tools

hiddencost

3 months ago

They're used heavily in information retrieval at scale. 3gram counting features especially.

dfsegoat

3 months ago

2 replies

KingJamesProgramming [1] - a mashup of Knuth books [2] and the King James Bible with a Markov chain, is still one of my favorite reads for a laugh.

It was also the first place I was really exposed to probabilistic-based methods for generation of novel content.

You get gems like:

"they smote the city with the edge of the sword. 22:20 And one of his main motivations was the high cost of proprietary software houses."

and

"37:29 The righteous shall inherit the land, and leave it for an inheritance unto the children of Gad according to the number of steps that is linear in b."

[1] https://www.tumblr.com/kingjamesprogramming

[2] https://en.wikipedia.org/wiki/Donald_Knuth

svat

3 months ago

1 reply

Just stating for clarity: it is a mashup of the King James Bible and the book Structure and Interpretation of Computer Programs, which is not by Knuth (it is by Harold Abelson and Gerald Jay Sussman with Julie Sussman).

dfsegoat

3 months ago

Ahh thank you for catching - apologies I didn't see this in time to add a correction.

foofoo12

3 months ago

The latest post on KingJamesProgramming is hilarious:

"then shall they call upon me, but I will not cause any information to be accumulated on the stack."

procaryote

3 months ago

1 reply

Aren't transformers markov chains mathematically? You couldn't practically implement them as naive markov chains as the state space is too big, but the math is the same, isn't it?

lou1306

3 months ago

Yes, in the sense that they can be formalised as such. Given n tokens t1..tn, a transformer computes the probability p(t') that every token t' has of being the next. Each of these probabilities are arcs from a state t1..tn to another state t1..tn,t' weighted by p(t').

And since the context window is bounded, this is guaranteed to be finite-state, and all states are transient except those where the window is saturated.

saylisteins

3 months ago

There is an interesting video about this from Veritasium

https://www.youtube.com/watch?v=KZeIEiBrT_w

traes

3 months ago

There seems to be a mistake in the "Building the Transition Matrix" section of this article. Instead of showing the diagonal matrix D with the normalization coefficients it instead shows the normalized Markov matrix M, incorrectly claiming it is equal to the unnormalized matrix C.

Nevermark

3 months ago

Nice little example.

Could add a "stop" token at the end of the fruit example, so when the little model runs it can call its own stop.

To generalize to ADHD, create another model that can quietly "Listen" to a speaker (token generator) until its Markov chain predicts a "Stop", and starts its own generating model running, regardless of whether the speaker was done.

Then two instances can have time efficient interrupty conversations with each other.

mangomountain

3 months ago

When I first heard about chat gpt I thought it was a markov chain generator and totally dismissed it lol woops

Ianjit

3 months ago

I am at stage 4 as well. Boredom.

robotnikman

3 months ago

I remember creating a chatbot for my minecraft server over a decade ago that used Markov chains. It was funny seeing the crazy things it spat out and the other players enjoyed it.

calf

3 months ago

So I think this "talking point" of "LLMs are just Markov chains" is as bad, analogously, as "Your amazing invention that passes the Turing Test is justa giant LUT". Or "Your C++ program is justa FSM". It is all technically true, but it is being used subtly to dismiss/delegitimize something, instead of communicate a scientifically insightful idea. How is this different than "Humans are just collections of cells". True except we're are a very interesting type of collection! LLMs are a very interesting type of Markov chain! Haskell/Java/Ruby programs are a very interesting type of FSM! Etc. The talking point by itself is just reductionism, of which we've seen and heard variants of during other scientific revolutions.

Also, this Markov chain has size O(T^K), exponential in the token length of the context window, which is a beyond astronomically large object

Doch88

3 months ago

I remember coding IRC bots with Markov Chains that were repeating random nonsense from the chat, so many memories

anthk

3 months ago

Also for perl user:

      curl -L https://cpanmin.us/ -o cpanm
  
      ./cpanm -n local::lib
      ./cpanm -n Hailo

      ~/perl5/bin/hailo -t file.txt -b file.brn

      ~/perl5/bin/hailo -b file.brn

file.txt must be a text file with one sentence per line.

Aperocky

3 months ago

Shameless plug for playing with random markov chain with seed texts: https://aperocky.com/markov/

Source: https://github.com/Aperocky/weighted-markov-generator

pessimizer

3 months ago

Loved playing with Babble! when I was a teenager: https://archive.org/details/Babble_1020

You could load it up with anything, and it would just babble on.

_aobj

3 months ago

Great article. I have been hitting on 'High School's levels including 8 dimensional , dual quaternion, Dirac's Eq. Algebra. On my website. Mainly as they give a view of the dynamics of non trivial unitary matrix expansions that can occur in such as Zeta Zero, element nuclei, prime numbers, energy levels. For light elements. For de located electrons in sync, the Markov chains are interesting, as they relate to cellular automata, the taking of limits for (relativity, electrons moving with charge potential).Having not done high school algebra, until latter years. I have learnt to not assume, a particular algebra fits all dynamic non trivial matrix process's. Ie quantum computing, long range sensors, gravity waves, pressurised plasma at phase, ECT. So great article as it shows a practical high school level of applied knowledge, that doesn't revolve around solely unitary matrices. In as much as its potential dynamic (linear expansion)evolved states ! To which the high school level of visualisation can become aware! (I just thought I'd add this on legal probabilities. There is a case involving a SASR trooper 'Schultz' that revolves around shooting a Mulha's spotter. In Australia magistrates seem to judge cases based on 4 dimensional probabilities. And this seems to tie in with Jeffery Epstein's management through cohesion of President Clinton's pattent attorney change from an solely engineer to a lawyer based pattent granting. Epstein's little girl problem wasn't solely sexual, but given stock market prediction dynamics, more like a dynamic non trivial unitary probability problem of visualising size. Something magistrates in Australia seem to have adopted in the determining of complex military cases.). Warren

stared

3 months ago

Markov models do not extrapolate. If we take intuitions from HMMs and use them in LLMs, we are prone to see nothing besides "stochastic parrots".

Anyway, for anyone who wants to go deeper, I recommend Lawrence R. Rabiner, "A tutorial on Hidden Markov Models and selected applications in speech recognition", https://www.cs.cmu.edu/~cga/behavior/rabiner1.pdf

And for a cure example, there is also one in nothin else than Claude Shannon, "A Mathematical Theory of Communication", https://web.archive.org/web/20121108191018/https://www.alcat...

dugmartin

3 months ago

If you would like to visually play with a Markov chain model my non-profit research org built a Markov plugin for, CODAP, our data analysis tool:

https://codap.concord.org/app/static/dg/en/cert/index.html?d...

If you select drawing mode from the first screen shown and then click the "T" icon you can type some text and it will generate a state diagram for you that you can then "play" and examine the output sequences. If you have states that have multiple exit routes you can click on that state and adjust the probabilities of each option.

14 more comments available on Hacker News

View full discussion on Hacker News

ID: 45304980Type: storyLast synced: 11/20/2025, 8:32:40 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN