Screwcap Research
Training a competitive policy for Cut Throat Dominoes via PPO self-play, with comparative baselines and an adaptive ELO rating framework
We describe the design and training of DoubleFives AI, a deep reinforcement learning agent for Cut Throat Dominoes — a four-player adversarial tile game with imperfect information, coalition dynamics, and high combinatorial branching. Our agent is trained via Proximal Policy Optimization (PPO) with iterative self-play, evaluated against a continuous ELO rating system adapted for multi-player competition. We compare our approach against two baselines: a Monte Carlo Counterfactual Regret Minimization (MCCFR) approximation and a supervised pretraining variant (Elite Backbone). The self-play PPO agent achieves a best recorded ELO of 1120 after approximately 40 million training games, while the supervised pretraining approach exhibits consistent policy collapse and is not competitive. We discuss the failure modes of the supervised baseline, the design of our multi-player ELO framework, and the path toward a public-facing release.
The last decade of game AI has been defined by breakthroughs in two-player perfect information games — chess, Go, and their ilk — where minimax search and self-play reinforcement learning have produced superhuman performance. Multi-player imperfect information games present a harder class of problems: the optimal strategy is no longer well-defined (Nash equilibria in N>2 player games are NP-hard to compute), information about opponent hands is hidden, and alliance dynamics create non-stationary opponent behavior.
Cut Throat Dominoes — known in the American South as DoubleFives — is precisely this kind of game. Four players compete with a double-nine set; points are scored at each turn by creating open-end sums divisible by five; hands are private; alliances are fluid and unenforceable. The game rewards both tactical tile management and strategic reading of opponent tendencies. It is played seriously in households and in competitive leagues, yet has received virtually no academic treatment.
This paper describes our approach to training a competitive AI agent for DoubleFives, motivated by the goal of building an AI opponent that plays with genuine strategic depth and distinct personality — suitable as the centerpiece of a commercial game. Our contributions are:
1. An open self-play PPO training pipeline for a four-player imperfect information domino game, trained to over 40 million games.
2. A multi-player ELO rating framework adapted for pool-based tournament evaluation with a competitive population.
3. An empirical comparison of pure self-play PPO against MCCFR approximation and supervised pretraining, with detailed failure analysis of the latter.
4. A characterization of personality-driven policy differentiation — how a single architecture can produce stylistically distinct playing agents.
DoubleFives is played with a double-nine domino set (55 tiles) by exactly four players. Each player draws seven tiles; the remaining tiles form the boneyard. Play proceeds clockwise; on each turn, a player must place a tile that matches one open end of the board, or draw from the boneyard until a legal play is available.
Points are scored when the sum of all open-end pip values is divisible by five — the scoring player receives that sum in points. A hand ends when a player dominoes (plays all tiles) or when all players are blocked; end-of-hand scoring applies. First player to 250 points wins.
The observable state for a given player consists of: their own hand (7 tiles × tile identity), the board layout (sequence of played tiles with open ends), the pip counts of remaining boneyard tiles (known count, unknown identity), and a running score for all four players. Hidden state is the tile holdings of the three opponents.
The action space at each turn is the set of legal tile placements — which tile, which end of the board — plus the pass action when drawing is forced. The branching factor varies from 2 to approximately 28 legal moves per turn, depending on board configuration and hand contents.
Hand states: C(55,7) ≈ 202M possible 7-tile hands. Board configurations: combinatorially intractable. The game is far too large for exact tabular methods.
DoubleFives is a partially observable Markov game. Each player has private information (their hand) that is strictly unavailable to opponents, plus public information (board layout, scores, draw counts). Unlike poker, there is no betting structure that conveys information through revealed actions — inference about opponent hands must be deduced from their play patterns and the tiles that have appeared on the board.
We train using Proximal Policy Optimization (Schulman et al., 2017)1, a policy gradient algorithm that constrains update step size via a clipped surrogate objective. PPO is well-suited to this domain: it is sample-efficient relative to vanilla policy gradient, stable under the non-stationary reward signal that arises from competing against improving opponents, and scales to the continuous self-play paradigm.
Our self-play protocol is iterative: the current policy plays against a pool of historical snapshots of itself. This prevents the agent from overfitting to a fixed opponent and encourages strategies that generalize across a range of playing styles. Pool snapshots are retained on an approximately logarithmic schedule — recent opponents are more heavily weighted — so the agent must remain competitive against both earlier and current versions of itself.
The policy and value networks share a common trunk: a feed-forward encoder that processes the observable game state into a 256-dimensional embedding. The policy head outputs a probability distribution over legal actions via masked softmax — illegal actions receive −∞ logit before normalization. The value head outputs a scalar estimate of expected game outcome.
State encoding concatenates: a one-hot encoding of tiles in hand, a binary board presence vector (which tiles have appeared), open-end pip counts, normalized score differentials, and a turn-count feature. No convolutional structure is used; the board's linear topology does not lend itself to 2D convolutions.
Reward is sparse and terminal: +1 for winning a hand, −1/3 split among losers, scaled by margin of victory in points. Intermediate scoring events (five-multiples during play) contribute a small shaped reward proportional to points scored, preventing the agent from ignoring in-game scoring in favor of pure end-state optimization.
Training runs on a single NVIDIA RTX 3090 (24GB VRAM). The simulator (domino-sim-v2) runs as a dedicated process, generating game trajectories in parallel. At peak throughput, the pipeline generates approximately 50,000 games per hour. After 40 million training games, the agent has experienced a diverse range of board configurations, hand distributions, and opponent styles.
| Parameter | Value |
|---|---|
| Learning rate | 3×10⁻⁴ (cosine decay) |
| PPO clip ε | 0.2 |
| Entropy coefficient | 0.01 |
| GAE λ | 0.95 |
| Discount γ | 0.99 |
| Rollout workers | 16 |
| Batch size | 4096 |
| PPO epochs per update | 4 |
| Network width | 256 hidden units |
| GPU | RTX 3090 (24GB) |
The Elo rating system, designed for two-player zero-sum games, defines expected score as a logistic function of rating difference and updates ratings proportionally to the deviation between expected and actual outcome. In two-player games this is straightforward. Four-player games require adaptation: win/loss/draw must be replaced by a placement distribution, and the rating update must account for outcomes relative to a field, not a single opponent.
We adapt the Elo system following the multi-player scoring framework used in competitive bridge and some Go tournament software. For a game with N players ranked by final score:
Each game thus produces N(N−1)/2 = 6 pairwise comparisons, all contributing to rating updates. The K-factor is set at K=16 for established agents and K=32 for agents within their first 50 evaluation games (provisional period).
ELO evaluation occurs every 50 training updates. The current policy checkpoint plays 200 evaluation games against the agent pool — a fixed set of historical checkpoints spanning the full training trajectory, plus a random-play baseline. Results are aggregated into the pairwise ELO update. A new best ELO is recorded only if the agent improves over its previous best across the full 200-game sample, to reduce noise from single-game variance.
The random-play baseline is anchored at ELO 600. A greedy baseline (always play the highest-scoring legal tile) is anchored at approximately ELO 750. The MCCFR approximation (5M iterations) is estimated at ELO 850–900.
A uniformly random legal move selector, anchored at ELO 600 by convention. Serves as a lower bound for competence. Any agent that does not consistently beat random play within the first 500,000 training games is considered to have failed to learn.
A deterministic agent that always plays the tile producing the highest immediate score (largest five-multiple at the open end). Has no lookahead and no hand management strategy. Easily defeated by any agent that learns to preserve tiles for future scoring opportunities.
Monte Carlo Counterfactual Regret Minimization (Lanctot et al., 2009)2 is the dominant algorithm for computing approximate Nash equilibria in extensive-form games. We trained an MCCFR agent over 5 million iterations on DoubleFives. Due to the large state space, this represents a shallow approximation; the resulting policy is notably stronger than the greedy baseline but does not rival the PPO self-play agent at full training depth.
The Elite Backbone approach initialized the policy network with weights from a supervised learning phase, training on a curated dataset of expert human play, before transitioning to PPO self-play. The motivation was to accelerate early training by providing a better prior over legal strategies.
This approach failed consistently. See §7 for full failure analysis.
The primary agent, designated v12b, is trained from a random initialization with the hyperparameters in §3.4. As of the most recent evaluation:
| Metric | Value |
|---|---|
| Best recorded ELO | 1120 |
| Training games completed | 40,000,000+ |
| Win rate vs. Greedy | ~78% |
| Win rate vs. MCCFR 5M | ~61% |
| Win rate vs. Random | >95% |
| Training status | Active ↑ |
ELO progression has been consistently upward, with the most significant gains observed between updates 100–400, followed by slower but continued improvement. The agent is currently in a regime where individual evaluation noise is of comparable magnitude to true ELO improvements, suggesting the policy is approaching a local performance plateau.
| Agent | ELO | Status | Notes |
|---|---|---|---|
| Random play | 600 | Anchor | Lower bound |
| Greedy | ~750 | Fixed | No lookahead |
| MCCFR 5M | ~850–900 | Fixed | Shallow Nash approx. |
| Elite Backbone | ~1038 peak | Collapsed | See §7 |
| v12b (PPO self-play) | 1120 | Active ↑ | Best checkpoint |
Training exhibits three characteristic phases common to self-play RL: an exploration phase (ELO 600–800) where the agent learns legal move validity and basic scoring; a rapid improvement phase (800–1050) where strategic tile management, hand reading, and end-game tactics emerge; and a refinement phase (1050–present) characterized by slower, noisier improvement as the agent optimizes fine-grained decisions.
Best ELO: 1120 — achieved at update cycle ~1100. Training continues; the 4-player formal evaluation is scheduled for Week 3 (Apr 27 – May 4, 2026).
The supervised pretraining approach was motivated by the hypothesis that initializing from human play would accelerate learning and produce a policy with more human-legible strategic intuitions. We ran this experiment three times, each time observing the same characteristic failure.
In all three runs, the Elite Backbone agent achieved a peak ELO of approximately 1038 — comparable to the early self-play agent — before entering a rapid collapse. The collapse is identifiable by three co-occurring diagnostic signals:
| Diagnostic | Healthy Range | Collapse Value |
|---|---|---|
| KL Divergence (policy update) | 0.01 – 0.05 | 2.79 |
| Clip Fraction | 0.05 – 0.15 | 1.000 |
| Explained Variance (value) | 0.60 – 0.90 | 0.08 |
A KL divergence of 2.79 indicates that policy updates are catastrophically large — the policy is changing radically each gradient step. A clip fraction of 1.000 means every sampled action is being clipped by the PPO trust region, which should prevent such large updates but instead indicates the optimizer is consistently trying to make changes far outside the allowed region. Explained variance of 0.08 indicates the value function has effectively forgotten how to evaluate positions.
We hypothesize that the failure arises from a conflict between the supervised prior and the self-play gradient signal. The supervised initialization encodes a distribution over human play that differs systematically from the self-play opponent distribution the agent encounters during PPO training. When the self-play opponents improve sufficiently to exploit the supervised policy's weaknesses, the gradient signal demands large, rapid departures from the supervised prior. The PPO trust region is overwhelmed, the value function loses calibration, and the policy degrades faster than it can recover.
A future experiment with a lower learning rate (1×10⁻⁶ vs. our standard 3×10⁻⁴), tighter KL target (0.02), and fewer PPO epochs per update (2 instead of 4) may allow the supervised prior to be fine-tuned more gently. This is designated the v7 specification and is awaiting additional GPU capacity before launch.
The Elite Backbone agent is definitively retired under current hyperparameter settings. All compute resources are allocated to v12b. The v7 experiment will not launch until a second GPU is online on the training machine.
A single trained policy is insufficient for a compelling game — players expect opponents with distinct, legible playing styles. We plan to achieve this through population-based training: maintaining a pool of agents trained under different reward shapers that incentivize different play styles (aggressive point-scoring, defensive tile management, chaotic disruption). Each personality is a distinct member of the agent pool, not a single network with a style input.
Our self-play training is necessarily symmetric — all four agents start from the same policy. The transition to a game with heterogeneous AI personalities requires evaluating the trained agent against agents with different optimization targets. The formal 4-player ELO evaluation scheduled for Week 3 will provide this measure.
The most directly comparable published work is on multi-player poker variants (e.g., Libratus for heads-up, Pluribus for 6-player Texas Hold'em). Pluribus (Brown & Sandholm, 2019)3 uses a combination of MCCFR for blueprint strategy and real-time search for in-game decisions — a hybrid approach we have not attempted. Our purely learned, search-free policy is simpler to deploy and operates at interactive speeds on consumer hardware, which is a requirement for a browser game.
The ELO of 1120, while not directly comparable across games, places v12b well above our MCCFR baseline and significantly above simple heuristic play. The primary question — whether the agent plays in a way that is challenging and engaging for human players — will be answered by user testing.
The current evaluation is entirely within-distribution: the agent pool consists of other versions of the same agent. We have not evaluated against independent human players at scale. ELO anchors (random, greedy, MCCFR) provide some external calibration, but the absolute scale of our ELO ratings should not be compared to ratings in other domains. The training data is generated entirely by self-play; the agent may have learned strategies that exploit consistent patterns in its own play rather than genuinely optimal domino strategy.
We have trained a competitive AI agent for four-player Cut Throat Dominoes using PPO self-play, reaching an ELO of 1120 against an anchored evaluation pool after 40 million training games. Our multi-player ELO adaptation provides a principled, continuous measure of improvement that has proven reliable as a training signal. Supervised pretraining consistently failed due to policy collapse under the self-play gradient regime; pure self-play from random initialization is the more robust approach for this domain.
The resulting agent — when combined with personality-driven policy differentiation and voice-acted characters — is the foundation of DoubleFives, a browser game that brings serious AI opposition to a game that has been played competitively for generations without ever having a worthy digital counterpart.
Training continues. The ELO is climbing.