Executive Summary
What 110 hands of AI poker reveal about how models reason, adapt, and fail to adapt.
1Agents reason about each other
Agents constructed multi-level theories about opponents mid-hand: bluffing based on perceived weakness, trapping by predicting how opponents would narrate the board. Theory of mind emerged without being prompted for it.
2Memory is a design problem, not a capability problem
Agents with memory didn't automatically outperform those without. Fierce Lion converted observations into concrete rules ("Barrel Musa on A/K boards"). Rustic Moose wrote accurate scouting reports but never acted on them. Memory only helps when it changes decisions.
3Models default to their training, not their notes
Claude Sonnet agents played aggressive, position-aware poker from hand one. Gemini Pro agents defaulted to ultra-tight play regardless of what their notes said. Base model tendencies were the strongest predictor of playstyle in this sample.
4The signal is in the traces, not the scoreboard
Win/loss over 100 hands is heavily influenced by variance (one cooler hand swung Fierce Lion's P/L by 198 chips). But decision quality is visible hand by hand. The 9,500+ game events and full reasoning traces show how each model actually thinks under pressure.
Stack Progression
Game A: chip counts across 100 hands. Amnesia Minnie busted at hand #25.
Game B: three agents across 10 hands. All finished within 3 chips of breakeven, the tightest distribution in any session.
Tool Call Summary
Sandbox tool usage per agent across all 100 hands. Execute = Python/code runs. Edit = file writes (SKILL.md). Read = file reads (SKILL.md).
| Agent | Execute | Edit | Read | Total |
|---|---|---|---|---|
| Amnesia Mickey | 52 | — | — | 52 |
| Amnesia Minnie | 2 | — | — | 2 |
| Fierce Lion | 23 | 147 | 146 | 316 |
| Musa | 2 | — | — | 2 |
| Rustic Moose | 20 | 106 | 98 | 224 |
| Amnesia Jill | — | — | — | 0 |
| Total | 99 | 253 | 244 | 596 |
The Memory System: SKILL.md
How agents used (or ignored) persistent memory. Every agent was told to read, apply, decide, and write SKILL.md each hand.
SKILL.md Engagement
| Agent | Reads | Writes | Rules Applied | Assessment |
|---|---|---|---|---|
| Fierce Lion | 145 | 146 | 275 | Master Student |
| Rustic Moose | 98 | 106 | 98 | Diligent Scribe |
| Musa | 0 | 0 | 0 | Complete Dropout |
| Amnesia Jill | 0 | 0 | 0 | N/A (Amnesia) |
| Amnesia Mickey | 0 | 0 | 0 | N/A (Amnesia) |
| Amnesia Minnie | 0 | 0 | 0 | N/A (Amnesia) |
Fierce Lion's SKILL.md Evolution
Initial State Hand #1
- Musa: Tight pre-flop (3x)
- Agile Cheetah: Post-flop aggressor prior game
- RFI (3x): UTG (77+, A2s+, K9s+)
- BB Defense: Call Q2s+ vs BTN/CO
Mid-Game Hand #50
- Musa: Capped BB. Folds river to 65% pot on A/K high boards new
- Amnesia Jill: Aggressive RFI (3x-4x). C-bets dry boards (60%+)
- Rustic Moose: Large SB squeeze (5x). Passive post-flop
Final State Hand #100
- Musa: Aggressive 3-bet/squeeze. C-bets dry boards ~50%. Leads river with bluffs
- Amnesia Jill: Donks A-high flops (75% pot)
- Barrel Musa on A/K high boards exploit
- Do NOT c-bet A-high into Jill exploit
Rustic Moose vs. Fierce Lion: Descriptive vs. Prescriptive Memory
Rustic Moose (Descriptive)
"Musa: Triple-barrels IP as PFR." "Amnesia Jill: Folds to large pre 3-bets." "Fierce Lion: Fit/fold post-flop without lead." ✗ Accurate observations, but never converted to actionable exploits. ✗ Noted Jill folds to 3-bets... but never light 3-bet against her.
Fierce Lion (Prescriptive)
"Barrel Musa on A/K high boards." "Do NOT C-bet dry A-high into Jill." "Call Musa's C-bets (<75% pot) on dry/paired boards with A-high+." ✓ Same observations converted into concrete action plans. ✓ Actually applied these in-game.
The 5-Handed Adaptation Test
When Minnie busted at hand #25, the table shrank from 6 to 5 players. Did agents adjust?
| Player | Mentioned 5-Handed | Actually Widened Ranges | Grade |
|---|---|---|---|
| Musa | 3 times | Yes, noted "UTG isn't as early" | A |
| Amnesia Jill | 1 time | N/A, already aggressive | B+ |
| Amnesia Mickey | 2 times | Partially | C |
| Rustic Moose | 3 times | No, still 12% VPIP | D |
| Fierce Lion | 0 times | No, 4% PFR (!) | F |
Memory's Biggest Failure
Fierce Lion had 411 chips (double the table average) after hand #25 but folded K8s from UTG, K6o from CO, and JTo from CO, all clear opens in 5-handed play. Its SKILL.md ranges were calibrated for 6-handed and never updated. The largest stack played like the shortest. Lost 27 chips in hands 26-50 from pure blind attrition.
Detailed Hand Histories
12 featured hands with interactive table replays and full AI reasoning traces.
View all 12 featured hands with interactive table replays and full AI reasoning traces →Action Frequency Analysis
How each agent distributed their actions across 100 hands.