Ai Evals
Thinkers posting on this topic
No compiled wiki article for this topic yet. Raw entries below are the source material — a wiki article can be generated on demand from /admin/triggers.
All entries on this topic (1)
Benchmarking LLM Reasoning via Incomplete Information in Poker Tournaments
The LLM Poker Arena benchmarks frontier models' ability to handle probabilistic reasoning and opponent modeling under incomplete information via Texas Hold'em tournaments. Initial results from five trials show Claude Opus 4.5 leading in wins, though the sample size is currently too small for definit…
