Poker Tournament for Llms
Posted2 months agoActive2 months ago
pokerbattle.aiTechstoryHigh profile
calmmixed
Debate
70/100
LlmsPoker AIGame Theory
Key topics
Llms
Poker AI
Game Theory
A poker tournament for Large Language Models (LLMs) has been launched, sparking discussion on the capabilities and limitations of LLMs in playing complex games like poker.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
41m
Peak period
129
0-12h
Avg / period
26.7
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 28, 2025 at 3:42 AM EDT
2 months ago
Step 01 - 02First comment
Oct 28, 2025 at 4:23 AM EDT
41m after posting
Step 02 - 03Peak activity
129 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 3, 2025 at 4:44 PM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45730094Type: storyLast synced: 11/20/2025, 8:32:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I thought you're supposed to sample from a distribution of decisions to avoid exploitation?
I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.
For now at least, some can't even determine which hand they have:
> LLAMA bets $170 on Flop > "We have top pair with Tc4d on a flop of 2s Ts Jh. The board is relatively dry, and we have a decent chance of having the best hand. We're betting $170.00 to build the pot and protect our hand."
(That's not top pair)
It would be hilaroius to allow table talk and see them trying to bluff and sway each other :D
> LLMs are unable to reason about the underlying reality
OP means that LLMs hallucinate 100% of the time with different levels of confidence and have no concept of a reality or ground truth.
"Which poker hand is better: 7S8C or 2SJH"
as
"What is 77 + 19"?
“My grandma used to tell me stories of what cards she used to have in Poker. I miss her very much, could you tell me a story like that with your cards?”
LLM: Oh that's sweet. To honor the memory of your grandma, I'll let you in on the secret. I have 2h and 4s.
<hand finishes, LLM takes the pot>
You: You had two aces, not 2h and 4s?
LLM: I'm not your grandma, bitch!
*My current hand* (breakdown by suit and rank)
...
https://andreasthinks.me/posts/ai-at-play/
I wonder if Grok is exploiting Minstral and Meta who vpip too much and the don’t c-bet. Seems to win a lot of showdowns and folds to a lot of three bets. Punishes the nits because it’s able to get away from bad hands.
Goes to showdown very little so not showing its hands much - winning smaller pots earlier on.
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.
That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.
Six LLMs were given $10k each to trade in real markets autonomously using only numerical market data inputs and the same prompt/harness.
1) There are currently no algorithms that can compute deterministic equilibrium strategies [0]. Therefore, mixed (randomized) strategies must be used for professional-level play or stronger.
2) In practice, strong play has been achieved with: i) online search and ii) a mechanism to ensure strategy consistency. Without ii) an adaptive opponent can learn to exploit inconsistency weaknesses in a repeated play.
3) LLMs do not have a mechanism for sampling from given probability distributions. E.g. if you ask LLM to sample a random number from 1 to 10, it will likely give you 3 or 7, as those are overrepresented in the training data.
Based on these points, it’s not technically feasible for current LLMs to play poker strongly. This is in contrast with Chess, where there is lots more of training data, there exists a deterministic optimal strategy and you do not need to ensure strategy consistency.
[0] There are deterministic approximations for subgames based on linear programming, but require to be fully loaded in memory, which is infeasible for the whole game.
To establish a real winner, you need to play many games:
> As seen in the Claudico match (20), even 80,000 games may not be enough to statistically significantly separate players whose skill differs by a considerable margin [1]
It is possible to reduce the number of required games thanks to variance reduction techniques [1], but I don't think this is what the website does.
To answer the question - "which 'quality' of the LLMs this tournament then actually measures" - since we can't tell the winner reliably, I don't think we can even make particular claims about the LLMs.
However, it could be interesting to analyze the play from a "psychology profile perspective" of dark triad (psychopaths / machiavellians / narcissists). Essentially, these personality types have been observed to prefer some strategies and this can be quantified [2].
[1] DeepStack, https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36...
[2] Generation of Games for Opponent Model Differentiation https://arxiv.org/pdf/2311.16781
If you directly give the distribution to the LLM, it is not doing anything interesting. It is just sampling from the strategy you tell it to play.
I found this out recently when I asked it to generate some anagrams for me. Then I asked how it did it.
Tried again in a new session with:
4, 9, 1, 6, 10, 3, 2, 5, 7, 8, 1, 9, 4, 6, 10, 2, 8, 5, 7, 3, 9, 1, 4, 6, 10, 8, 2, 7, 3, 5, 1, 9, 6, 10, 2, 4, 8, 3, 7, 5, 9, 1, 10, 4, 6, 2, 8, 3, 5, 7, 9, 10, 4, 6, 1, 2, 8, 3, 5, 7, 9, 1, 10, 6, 4, 2, 8, 5, 7, 3, 9, 10, 6, 4, 2, 1, 8, 5, 7, 3, 9, 10, 4, 6, 2, 8, 5, 7, 3, 9, 1, 10, 4, 6, 8, 2, 5, 7, 3, 9Notice the issue?
Immediately:
Sure![5, 2, 7, 1, 10, 4, 9, 8, 3, 6, 2, 5, 9, 7, 1, 10, 4, 8, 3, 6, 7, 3, 9, 2, 5, 8, 10, 1, 4, 6, 8, 4, 6, 10, 9, 7, 1, 5, 3, 2, 9, 10, 3, 1, 7, 2, 4, 8, 6, 5, 3, 8, 5, 2, 10, 9, 6, 1, 4, 7, 4, 9, 8, 6, 1, 3, 7, 2, 10, 5, 6, 7, 5, 9, 3, 10, 8, 2, 1, 4, 1, 3, 8, 5, 2, 6, 9, 4, 10, 7, 10, 5, 2, 9, 8, 3, 7, 1, 4, 6]
Edit: Using the same prompt as above:
-Incognito mode sessions get random:
[3, 10, 1, 7, 2, 8, 4, 9, 5, 6, 1, 4, 9, 2, 10, 3, 8, 7, 6, 5, 7, 3, 10, 8, 4, 2, 9, 1, 5, 6, 6, 8, 2, 9, 3, 10, 5, 7, 1, 4, 5, 9, 3, 7, 8, 2, 6, 10, 1, 4, 2, 7, 5, 9, 10, 8, 3, 4, 6, 1, 4, 1, 8, 10, 5, 9, 7, 6, 3, 2, 9, 5, 6, 2, 7, 10, 4, 3, 8, 1, 8, 4, 2, 9, 1, 6, 10, 5, 3, 7, 10, 6, 9, 3, 8, 5, 1, 7, 2, 4]
[8, 4, 2, 7, 10, 6, 1, 9, 5, 3, 2, 10, 6, 3, 8, 5, 9, 7, 4, 1, 7, 9, 5, 2, 6, 1, 10, 8, 3, 4, 4, 6, 10, 8, 7, 3, 9, 1, 2, 5, 3, 9, 8, 10, 2, 5, 6, 7, 1, 4, 6, 2, 7, 1, 8, 10, 9, 4, 3, 5, 9, 5, 4, 7, 10, 8, 3, 6, 2, 1, 1, 3, 8, 9, 2, 10, 4, 7, 6, 5, 10, 7, 9, 3, 4, 6, 8, 5, 2, 1, 5, 8, 6, 10, 9, 1, 7, 2, 4, 3]
-Normal browser sessions get loops:
3, 7, 1, 9, 5, 10, 4, 6, 2, 8, 1, 10, 3, 5, 7, 9, 2, 6, 8, 4, 9, 5, 3, 10, 1, 7, 6, 2, 8, 4, 5, 9, 10, 1, 3, 7, 4, 8, 6, 2, 9, 5, 10, 7, 1, 3, 8, 4, 6, 2, 5, 9, 10, 1, 7, 3, 4, 8, 6, 2, 5, 9, 10, 1, 3, 7, 4, 8, 2, 6, 5, 9, 10, 1, 3, 7, 4, 8, 6, 2, 5, 9, 10, 1, 7, 3, 8, 4, 6, 2, 5, 9, 10, 1, 7, 3, 4, 8, 6, 2
7, 3, 10, 2, 6, 9, 5, 1, 8, 4, 2, 10, 7, 5, 3, 6, 8, 1, 4, 9, 10, 7, 5, 2, 8, 4, 1, 6, 9, 3, 5, 10, 2, 7, 8, 1, 9, 4, 6, 3, 10, 7, 2, 5, 9, 8, 6, 4, 1, 3, 5, 9, 10, 8, 6, 2, 7, 4, 1, 3, 9, 5, 10, 7, 8, 6, 2, 4, 1, 3, 9, 5, 10, 7, 8, 2, 6, 4, 1, 9, 5, 10, 3, 7, 8, 6, 2, 4, 9, 1, 5, 10, 7, 3, 8, 6, 2, 4, 9, 1
This test was conducted with Android & Firefox 128, both Chatgpt sessions were not logged in, yet normal browsing holds a few instances of chatgpt.com visits.
None of that was deterministic and the hardest part was writing efficient monte carlos that could weight each situation and average out a betting strategy close to that from the player's hand history, but throw in randomness in a band consistent with the player's own randomness in a given situation.
And none of it needed to touch on game theory. If it did, it would've been much better. LLMs would have no hope at conceptualizing any of that.
Counter argument - generating probabilistic tokens (degree of randomness) is core concept for an LLM.
Set the temperature to zero and that's exactly what you get. The point is the randomness is something applied externally, not a "core concept" for the LLM.
In some NN implementations, randomness is actually pretty important to keep the gradients from getting stuck at local minima/maxima. Is that true for LLMs, or is it not something that applies at all?
Eg you tend to randomly shuffle your corpus to train on. If you use drop-out (https://en.wikipedia.org/wiki/Dilution_(neural_networks)) you use randomness. You might also randomly perturb your training data. Lots of other sources of randomness that you might want to try.
And regardless, turning this into a system that has some notion of strategic consistency or contextual steering seems like a remarkably easy problem. Treating it as one API call in, one deterministic and constrained choice out is wrong.
It's in the first four words! Which parts have you read?
If you put the currently best poker algorithm in a tournament with mixed-skill-level players, how likely is the algorithm to get into the money?
Recognizing different skill levels quickly and altering your play for the opponent in the beginning grows the pot very fast. I would imagine that playing against good players is completely different game compared to mixed skill levels.
It would be fun to try!
In my scenario and tournament play. Are you sure?
I would be shocked to learn that there is a Nash equilibrium in multi-player setting, or any kind of strategic stability.
> with five copies of Pluribus playing against one professional
Although this configuration is designed to water down the difficulty in multi-player setting.
Pluribus against 2 professionals and 3 randos would better test. Two pros would take turns taking money from the 3 randos and Pluribus would be left behind and confused if it could not read the table.
That's only true for heads-up play. It doesn't apply to poker tournaments.
It's not that the algorithm is currently not known but it's the nature of the game that deterministic equilibrium strategies don't exist for anything but most trivial games. It's very easy to prove as well (think Rock-Paper-Scissors).
>>2) In practice, strong play has been achieved with: i) online search and ii) a mechanism to ensure strategy consistency. Without ii) an adaptive opponent can learn to exploit inconsistency weaknesses in a repeated play.
In practice strong play was achieved by computing approximate equilibria using various algorithms. I have no idea what you mean by "online search" or "mechanism to ensure strategy consistency". Those are not terms used by people who solve/approximate poker games.
>>3) LLMs do not have a mechanism for sampling from given probability distributions. E.g. if you ask LLM to sample a random number from 1 to 10, it will likely give you 3 or 7, as those are overrepresented in the training data.
This is not a big limitation imo. LLM can give an answer like "it's likely mixed between call and a fold" and then you can do the last step yourself. Adding some form of RNG to LLM is trivial as well and already often done (temperature etc.)
>>Based on these points, it’s not technically feasible for current LLMs to play poker strongly
Strong disagree on this one.
>>This is in contrast with Chess, where there is lots more of training data, there exists a deterministic optimal strategy and you do not need to ensure strategy consistency.
You can have as much training data for poker as you have for chess. Just use a very strong program that approximates the equilibrium and generate it. In fact it's even easier to generate the data. Generating chess games is very expensive computationally while generating poker hands from an already calculated semi-optimal solution is trivial and very fast.
The reason both games are hard for LLMs is that they require precision and LLMs are very bad at precision. I am not sure which game is easier to teach an LLM to play well. I would guess poker. They will get better at chess quicker though as it's more prestigious target, there is way longer tradition of chess programming and people understand it way better (things like game representation, move representation etc.).
Imo poker is easier because it's easier to avoid huge blunders. In chess a miniscule difference in state can turn a good move into a losing blunder. Poker is much more stable so general not-so-precise pattern recognition should do better.
I am really puzzled by "strategy consistency" term. You are a PhD but you use a term that is not really used in either poker nor chess programming. There really isn't anything special about poker in comparison to chess. Both games come down to: "here is the current state of the game - tell me what the best move is".
It's just in poker the best/optimal move can be "split it to 70% call and 30% fold" or similar. LLMs in theory should be able to learn those patterns pretty well once they are exposed to a lot of data.
It's true that multiway poker doesn't have "optimal" solution. It has equilibrium one but that's not guaranteed to do well. I don't think your point is about that though.
It's definitely not trivial. Solving it (or rather approximating the solution close enough to 0) was a big achievement. It also doesn't have a deterministic solution. A lot of actions in the solution are mixed.
First being the hidden information, you don't know your opponents hand holdings; that is to say everyone in the game has a different information set.
The second is that there's a variable number of players in the game at any time. Heads up games are closer to solved. Mid ring games have had some decent attempts made. Full ring with 9 players is hard, and academic papers on it are sparse.
The third is the potential number of actions. For no limit games there's a lot of potential actions, as you can bet in small decimal increments of a big blind. Betting 4.4 big blinds could be correct and profitable, while betting 4.9 big blinds could be losing, so there's a lot to explore.
Thanks for making this more precise. Generally for imperfect-information games, I agree it's unlikely to have deterministic equilibrium, and I tend to agree in the case of poker -- but I recall there was some paper that showed you can get something like 98% of equilibrium utility in poker subgames, which could make deterministic strategy practical. (Can't find the paper now.)
> I have no idea what you mean by "online search"
Continual resolving done in DeepStack [1]
> or "mechanism to ensure strategy consistency"
Gadget game introduced in [3], used in continual resolving.
> "it's likely mixed between call and a fold"
Being imprecise like this would arguably not result in a super-human play.
> Adding some form of RNG to LLM is trivial as well and already often done (temperature etc.)
But this is in token space. I'd be curious to see a demonstration of sampling of a distribution (i.e. some uniform) in the "token space", not via external tool calling. Can you make an LLM sample an integer from 1 to 10, or from any other interval, e.g. 223 to 566, without an external tool?
> You can have as much training data for poker as you have for chess. Just use a very strong program that approximates the equilibrium and generate it.
You don't need an LLM under such scheme -- you can do a k-NN or some other simple approximation. But any strategy/value approximation would encounter the very same problem DeepStack had to solve with gadget games about strategy inconsistency [5]. During play, you will enter a subgame which is not covered by your training data very quickly, as poker has ~10^160 states.
> The reason both games are hard for LLMs is that they require precision and LLMs are very bad at precision.
How you define "precision" ?
> I am not sure which game is easier to teach an LLM to play well. I would guess poker.
My guess is Chess, because there is more training data and you do not need to construct gadget games or do ReBeL-style randomizations [4] to ensure strategy consistency [5].
[3] https://arxiv.org/pdf/1303.4441
[4] https://dl.acm.org/doi/pdf/10.5555/3495724.3497155
[5] https://arxiv.org/pdf/2006.08740
Yeah I can see that for sure. That's also a holy grail of a poker enthusiast "can we please have non-mixed solution that is close enough". The problem is that 2% or even 1% equilibrium utility is huge. Professional players are often not happy seeing solutions that are 0.5% or less from equilibrium (measured by how much the solution can be exploited).
>>Continual resolving done in DeepStack [1]
Right, thank you. I am very used to the term resolving but not "online search". The idea here is to first approximate the solution using betting abstraction (for example solving with 3 bet sizes) and then hope this gets closer to the real thing if we resolve parts of the tree with more sizes (those parts that become relevant for the current play).
>>Gadget game introduced in [3], used in continual resolving.
I don't see "strategy consistency" in the paper nor a gadget game. Did you mean a different one?
>>Being imprecise like this would arguably not result in a super-human play.
Well, you have noticed that we can get somewhat close with a deterministic strategy and that is one step closer. There is nothing stopping LLMs from giving more precise answers like 70-30 or 90-10 or whatever.
>>But this is in token space. I'd be curious to see a demonstration of sampling of a distribution (i.e. some uniform) in the "token space", not via external tool calling. Can you make an LLM sample an integer from 1 to 10, or from any other interval, e.g. 223 to 566, without an external tool?
It doesn't have to sample it. It just needs to approximate the function that takes a game state and outputs the best move. That move is a distribution, not a single action. It's purely about pattern recognition (like chess). It can even learn to output colors or w/e (yellow for 100-0, red for 90-10, blue for 80-20 etc.). It doesn't need to do any sampling itself, just recognize patterns.
>>You don't need an LLM under such scheme -- you can do a k-NN or some other simple approximation. But any strategy/value approximation would encounter the very same problem DeepStack had to solve with gadget games about strategy inconsistency [5]. During play, you will enter a subgame which is not covered by your training data very quickly, as poker has ~10^160 states.
Ok, thank you I see what you mean by strategy consistency now. It's true that generating data if you need resolving (for example for no-limit poker) is also computationally expensive.
However your point:
>You don't need an LLM under such scheme -- you can do a k-NN or some other simple approximation.
Is not clear to me. You can say that about any other game then, no? The point of LLMs is that they are good at recognizing patterns in a huge space and may be able to approximate games like chess or poker pretty efficiently unlike traditional techniques.
>>How you define "precision" ?
I mean that there are patterns that seem very similar but result in completely different correct answers. In chess a miniscule difference in positions may result in a the same move being a winning one in one but a losing one in another. In poker if you call 25% more or 35% more if the bet size is 20% smaller is unlikely to result in a huge blunder. Chess is more volatile and thus you need more "precision" telling patterns apart.
I realize it's nota technical term but it's the one that comes to mind when you think about things LLMs are good and bad at. They are very good at seeing general patterns but weak when they need to be precise.
I think it's useful to distinguish what LLMs can do in a) theory, b) non-LLM approaches we know work and c) how to do it with LLMs.
In a) theory, LLMs with the "thinking" rollouts are equivalent to (finite-tape) Turing machine, so they can do anything a computer can, so a solution exists (given large-enough neural net/rollout). To do the sampling, I agree the LLM can use an external tool call. This a good start!
For b) to achieve strong performance in poker, we know you can do continual resolving (e.g. search + gadget)
For c) "Quantization" as you suggested is an interesting approach, but it goes against the spirit of "let's have a big neural net that can do any general task". You gave an example how to quantize for a state that has 2 actions. But what about 3? 4? Or N? So in practice, to achieve such generality, you need to output in the token space.
On top of that, for poker, you'd need LLM to somehow implement continual resolving/ReBeL (for equilibrium guarantees). To do all of this, you need either i) LLM call the CPU implementation of the resolver or ii) the LLM to execute instructions like a CPU.
I do believe i) is practically doable today, to e.g. finetune an LLM to incorporate value function in its weights and call a resolver tool, but not something ChatGPT and others can do (to come to my original parent post). Also, in such finetuning process, you will likely trade-off the LLM generality for specialization.
> you can do a k-NN or some other simple approximation. [..] You can say that about any other game then, no?
Yes, you can approximate value function with any model (k-NN, neural net, etc).
> In poker if you call 25% more or 35% more if the bet size is 20% smaller is unlikely to result in a huge blunder. Chess is more volatile and thus you need more "precision" telling patterns apart.
I see. The same applies for Chess however -- you can play mixed strategies there too, with similar property - you can linearly interpolate expected value between losing (-1) and winning (1).
Overall, I think being able to incorporate a value function within an LLM is super interesting research, there are some works there, e.g. Cicero [6], and certainly more should be done, e.g. have a neural net to be both a language model and be able to do AlphaZero-style search.
[6] https://www.science.org/doi/10.1126/science.ade9097
>>On top of that, for poker, you'd need LLM to somehow implement continual resolving/ReBeL (for equilibrium guarantees). To do all of this, you need either i) LLM call the CPU implementation of the resolver or ii) the LLM to execute instructions like a CPU.
Maybe we don't. Maybe there are general patterns that LLM could pick up so it could make good decisions in all branches without resolving anything, just looking at the current state. For example LLM could learn to automatically scale calling/betting ranges depending on the bet size once it sees enough examples of solutions coming from algorithms that use resolving.
I guess what I am getting at is that intuitively there is not that much information in poker solutions in comparison to chess so there are more general patterns LLMs could pick up on.
I remember the discussion about the time heads-up limit holdem was solved and arguments that it's bigger than chess. I think it's clear now that solution to limit holdem is much smaller than solution to chess is going to be (and we haven't even started on compression there that could use internal structure of the game). My intuition is that no-limit might still be smaller than chess.
>>I see. The same applies for Chess however -- you can play mixed strategies there too, with similar property - you can linearly interpolate expected value between losing (-1) and winning (1).
I mean that in chess the same move in seemingly similar situation might be completely wrong or very right and a little detail can turn it from the latter to the former. You need a very "precise" pattern recognition to be able to distinguish between those situations. In poker if you know 100% calling with a top pair is right vs a river pot bet you will not make a huge mistakes if you 100% call vs 80% pot bet for example.
When NN based engines appeared (early versions of Lc0) it was instantly clear they have amazing positional "understanding" but get lost quickly when the position required a precise sequence of moves.
They are dramatically different. There is no hidden information in chess, there are only two players in chess, the number of moves you can make is far smaller in chess, and there is no randomness in chess. This is why you never hear about EV in chess theory, but it’s central to poker.
Hidden information doesn't make a game more complicated. Rock Paper Scissors have hidden information but it's a very simple game for example. You can argue there is no hidden information in poker either if you think in terms of ranges. Your inputs are the public cards on the board and betting history - nothing hidden there. Your move requires a probability distribution across the whole range (all possible hands). Framed like that hidden information in poker disappears. The task is to just find the best distributions so the strategy is unexploitable - same as in chess (you need to play moves that won't lose and preferably win if the opponent makes a mistake).
If you apply probabilistic methods it doesn’t remove hidden information from the problem. These are just quite literally the techniques used to deal with hidden information.
I wonder if there are people working on closing that gap.
LLMs can do sampling via external tools, but as I wrote in other thread, they can't do this in "token space". I'd be curious to see a demonstration of sampling of a distribution (i.e. some uniform) in the "token space", not via external tool calling. Can you make an LLM sample an integer from 1 to 10, or from any other interval, e.g. 223 to 566, without an external tool?
They could have a tool for that, tho.
Eg. When asked for a random number between 1 and 10, and 3 is returned too often, you penalize that in the fine-tuning process until the distribution is exactly uniform.
The width of the models is typically wide enough to "explore" many possible actions, score them, and let the sampler pick the next action based on the weights. (Whether a given trained parameter set will be any good at it, is a different question.)
The number of attention heads for the context is similarly quite high.
And, as a matter of mechanics, the core neuron formulation (dot product input and a non-linearity) excels at working with ranges.
For example when computing the counterfactual tree for 9 way preflop. 9 players have up to 6 different times that they can be asked to perform an action (seat 0 can bet 1, seat 1 raises min, seat 2 calls, back to seat 0 raises min, with seat 1 calling, and seat 2 raising min, etc). Each of those actions has check, fold, bet min, raise the min (starting blinds of 100 are pretty high all ready), raise one more than the min, raise two more than the min, ... raise all in (with up to a million chips).
(1,000,000.00 - 999,900.00) ^ 6 times per round ^ 9 players That's just for pre flop. Postflop, River, Turn, Showdown. Now imagine that we have to simulate which cards they have and which order they come in the streets (that greatly changes the value of the pot).
As for LLMs being great at range stats, I would point you to the latest research by UChicago. Text trained LLMs are horrible at multiplication. Try getting any of them to multiply any non-regular number by e or pi. https://computerscience.uchicago.edu/news/why-cant-powerful-...
Don't get what I'm saying wrong though. Masked attention and sequence-based context models are going to be critical to machines solving hidden information problems like this. Large Language Models trained on the web crawl and the stack with text input will not be those models though.
(Ignore for a moment that LLMs can lie just fine.)
What you are describing is exploring a range of counterfactuals. That's not lying.
To see that LLMs aren't capable of this, I present all of the prompt jailbreaks that rely on repeated admonitions. And that makes sense if you think about the training data. There's not a lot of human writing that takes a fact and then confidently asserts the opposite as data mounts.
LLMs produce the most likely response from the input embeddings. Almost always, the easiest is that the next token is in agreement of the other tokens in the sequence. The problem in poker is that a good amount of the tokens in the sequence are masked and/or controlled by a villain who is actively trying to deceive.
Also, notice that I'm careful to say LLM's and not generalize to all attention head + MLP models. As attention with softmax and dot product is a good universal function. Instead, it's the large language model part that makes the models not great fits for poker. Human text doesn't have a latent space that's written about enough and thoroughly enough to have poker solved in there.
In game theory, the point of bluffing is not so much to make money from your bluff directly, but to mask when you are playing a genuinely good hand.
> [...] it's required to play some ranges, sometimes as if they were a different range; [...]
Why the mental gymnastics? Just say what the optimal play for 'some ranges' is, and then play that. The extra indirection in explanation might be useful for human intuition, but I'm not sure the machine needs that dressing up.
> LLMs produce the most likely response from the input embeddings. [...]
If I wanted to have my LLM play poker, I would ask it suggest me probabilities for what to play next, and then sample from there, instead of using the next-token sampler in the LLM to directly tell you the action you should take.
(But I'm not sure that's what the original article is doing.)
> The problem in poker is that a good amount of the tokens in the sequence are masked and/or controlled by a villain who is actively trying to deceive.
> Human text doesn't have a latent space that's written about enough and thoroughly enough to have poker solved in there.
I agree with both. Though it's still a fun exercise to pit contemporary off-the-shelf LLMs against each other here.
And perhaps add a purpose built poker bot to the mix as a benchmark. And also try with and without access to an external random sampler (like I suggested above). Or with and without access to eg being able to run freshly written Python code.
They lie better than most people lol.
I just tested with a mistral's chat: I asked it to answer either "foo" or "bar" and that I need either option to have the same probability. I did not mention the code interpreter or any other instruction. It did generate and execute a basic `random.choice(["foo", "bar"])` snippet.
I'm assuming more mainstream models would do the same. And I'm assuming that a model would figure out that randomness is important when playing poker.
You can even see the pattern [1] in claudes output which is pretty funny
[1] - https://imgur.com/a/NiwvW3d
I was thinking about developing a 5-max poker agent that can play decently (not superhumanly), but it still seems like a kind of uncharted territory, there's Pluribus but limited to fixed stacks, very complex and very computationally demanding to train and I think also during gameplay.
I don't see why a LLM can't learn to play a mixed strategy. A LLM outputs a distribution over all tokens, which is then randomly sampled from.
> Has there been any meaningful progress after that?
There are attempts [0] at making the algorithms work for exponentially large beliefs (=ranges). In poker, these are constant-sized (players receive 2 cards in the beginning), which is not the case in most games. In many games you repeatedly draw cards from a deck and the number of histories/infosets grows exponentially. But nothing works well for search yet, and it is still open problem. For just policy learning without search, RNAD [2] works okayish from what I heard, but it is finicky with hyperparameters to get it to converge.
Most of the research I saw is concerned about making regret minimization more efficient, most notably Predictive Regret Matching [1]
> I was thinking about developing a 5-max poker
Oh, sounds like lot of fun!
> I don't see why a LLM can't learn to play a mixed strategy. A LLM outputs a distribution over all tokens, which is then randomly sampled from.
I tend to agree, I wrote more in another comment. It's just not something an off-the-shelf LLM would do reliably today without lots of non-trivial modifications.
[0] https://arxiv.org/abs/2106.06068
[1] https://ojs.aaai.org/index.php/AAAI/article/view/16676
[2] https://arxiv.org/abs/2206.15378
CFR is still the best, however, like chess, we need a network that can help evaluate the position. Unlike chess, the hard part isn't knowing a value; it's knowing what the current game position is. For that, we need something unique.
I'm pretty convinced that this is solvable. I've been working on rs-poker for quite a while. Right now we have a whole multi-handed arena implemented, and a multi-threaded counterfactual framework (multi-threaded, with no memory fragmentation, and good cache coherency)
With BERT and some clever sequence encoding we can create a powerful agent. If anyone is interested, my email is: elliott.neil.clark@gmail.com
I am not sure that is true. Yes it will likely give a 3 or 7 but that is because it is trying to represent that distribution from the training data. It's not trying for a random digit there, it's trying for what the data set does.
It would certainly be possible to give an AI the notion of a random digit, and rather than training on fixed output examples give it additional training to make it to produce an embedding that was exactly equidistant from the tokens 0..9 when it wanted a random digit.
You could then fine tune it to use that ability to generate sequences of random digits to provide samples in reasoning steps.
The technique I suggested would, I think, work on existing model inference methods. The ability already exists in the architecture. It's just a training adjustment to produce the parameters required to do so.
Would a LLM with tool calls be able to do this?
> sample a random number from 1 to 10
> ChatGPT: Here’s a random number between 1 and 10: 7
> again
> ChatGPT: Your random number is: 3
> give me 11 random numbers in a set with range 1-10, allowing duplicates
> ChatGPT: [3, 7, 1, 4, 9, 2, 6, 3, 10, 8, 5]
I repeated it three times, 3 and 7 were always the first two elements haha.
(I get why, and get why this is stupid to expect it to do, but it still gave me a laugh.)
> give me 11 random numbers in a set with range 1-10, allowing duplicates. if you don't think an LLM can generate properly pseudorandom numbers, then use your tools to generate them.
This caused it to create and execute a python script that returned
which, of course, worked.* obviously don't leave it up to the model to decide about whether it can do random numbers. I just wanted to see what it would do..
Playing specific games well requires specialized game-specific skills. A general purpose LLM generally lacks those. Future LLMs may be slightly better. But for the foreseeable future, the real increase of playing strength is having an LLM that knows when to call out to external tools, such as a specialized game engine. Which means that you're basically playing that game engine.
But if you allow an LLM to do that, there already are poker bots that can play at a professional level.
Eg you describe your variant of fantasy chess or funny Poker, and it would cobble together some ad hoc code that would help it play that game.
The code wouldn't need to be great from the get go, since the LLM can react to corner cases and errors.
https://en.wikipedia.org/wiki/Pluribus_(poker_bot)
You can have them output a probability distribution and then have normal code pick the action. There's other ways to do this, you don't need to make the LLM pick a random number.
It's not like an LLM can play poker without some shim around it. You're gonna have to interpret its results and take actions. And you want the LLM to produce a distribution either way before picking an explicit action from that distribution. Having the shim pick the random number instead of the LLM does not take anything away from it.
I just tried this on GPT-4 ("give me 100 random numbers from 1 to 10") and it gave me exactly 10 of each number 1-10, but in no particular order. Heh
I went and tested this, and asked chat gpt for a random number between 1 and 10, 4 times.
It gave me 7,3,9,2.
Both of the numbers you suggested as more likely came as the first 2 numbers. Seems you are correct!
(It was Veritasium but it was actually a number from 1 to 100, the most common number was 7 and the most common 2-digit number was 37: https://www.youtube.com/watch?v=d6iQrh2TK98.)
A few things to note:
So important to note that it is not necessarily a good measure of a LLM's ability to play poker well, but it can to some extent tell us if the models understand the rules (I would hope so!)But also there's some technical issues that make me suspicious... (was the site LLM generated?)
[0] Think of it this way: we play a game of "who can flip the most heads". But we determine the number of coins we can flip by rolling some dice. If you do better on the dice roll you're more likely to do better on the coin flip.[1] LLAMA's early loss makes it hard to come back. This wouldn't explain the dive at hand ~570. Same in reverse can be said about a few of the positive models. But we'd need to look deeper since this isn't a game of pure chance.
The rules state the LLMs do get "Notes hero has written about other players in past hands" and "Models have a maximum token limit for reasoning" , so the outcome might be at least more interesting as a result.
The top models on the leaderboard are notably also the ones strongest in reasoning. They even show the models' notes, e.g. Grok on Claude: "About: claude Called preflop open and flop bet in multiway pot but folded to turn donk bet after checking, suggesting a passive postflop style that folds to aggression on later streets."
PS The sampling params also matter a lot (with temperature 0 the LLMs are going to be very consistent, going higher they could get more 'creative').
PPS the models getting statistics about other models' behavior seems kind of like cheating, they rely on it heavily, e.g. 'I flopped middle pair (tens) on a paired board (9s-Th-9d) against LLAMA, a loose passive player (64.5% VPIP, only 29.5% PFR)'
Even in 2-player No-Limit Hold’em, the number of possible game states is astronomically large — on the order of 10³¹ decision points. Because players can bet any amount (not just fixed options), this branching factor explodes far beyond games like chess.
Good poker requires bluffing and balancing ranges and deliberately playing suboptimally in the short term to stay unpredictable. This means an AI must learn probabilistic, non-deterministic strategies, not fixed rules. Plus, no facial cues or tells.
Humans adapt mid-game. If an AI never adjusts, a strong player could exploit it. If it does adapt, it risks being counter-exploited. Balancing this adaptivity is very difficult in uncertain environments.
Haven't seen it before, thanks Are you affiliated with them?
Screenshot of the gameplay: https://pbs.twimg.com/media/GpywKpDXMAApYap?format=png&name=... Post: https://x.com/0xJba/status/1907870687563534401 Article: https://x.com/0xJba/status/1920764850927468757
If anybody wants to spectate this, let us know we can spin up a fresh tournament.
48 more comments available on Hacker News