PAWN / docs /ACCURACY_CEILING.md
thomas-schweich's picture
Add naive conditional ceiling to accuracy ceiling docs
2660c6c

Theoretical Accuracy Ceiling

PAWN is trained on uniformly random chess games. Since each move is drawn uniformly from the legal move set, top-1 accuracy has a hard theoretical ceiling β€” no model, however large, can exceed it.

Three ceilings

Unconditional ceiling: E[1/N_legal] = 6.43%

At each position, the move is drawn uniformly from N legal moves. The best a predictor can do without any context is pick one at random: accuracy = 1/N. Averaged over all positions in random games, this gives 6.43%.

A model that exceeds this ceiling has learned something beyond just "which moves are legal" β€” it has learned to estimate the number of legal moves at each position and bias predictions toward positions with fewer options.

Naive conditional ceiling: 6.44%

A zero-cost analytical estimate of outcome conditioning. At each position, legal moves that lead to an immediate terminal state with a different outcome than the actual game are excluded, and accuracy = 1/(N_legal - N_wrong).

This barely exceeds the unconditional ceiling (1.00x boost) because immediate terminal states are rare β€” most moves at most positions lead to non-terminal continuations, so the filter has almost nothing to exclude.

MCTS conditional ceiling: 7.92%

The full Monte Carlo estimate. At each sampled position, every legal move is tried and 32 random continuations are played out to estimate P(outcome | move, history). The Bayes-optimal predictor picks the move most consistent with the known outcome.

PAWN's input sequence begins with an outcome token (WHITE_CHECKMATES, STALEMATE, PLY_LIMIT, etc.). This leaks information about the game's trajectory, making some moves more predictable:

  • Checkmate games: The final move must deliver checkmate. Knowing this raises the ceiling at the last ply from ~5% to ~14%.
  • Ply limit games: Knowing the game lasts 255 plies constrains the move distribution slightly.
  • Stalemate games: The final position has no legal moves but isn't check β€” very constraining on late moves.

Adjusted accuracy

Metric Value
Unconditional ceiling (E[1/N_legal]) 6.43%
Naive conditional ceiling (1-ply filter) 6.44%
MCTS conditional ceiling (32 rollouts) 7.92%
Conditioning boost (naive) 1.00x
Conditioning boost (MCTS) 1.23x

For a model with top-1 accuracy A:

  • Adjusted (unconditional) = A / 6.43% β€” measures how much the model has learned about chess legality. Values > 100% mean it has learned structure beyond just legal moves.
  • Adjusted (naive conditional) = A / 6.44% β€” essentially the same as unconditional; confirms that 1-ply lookahead explains almost none of the outcome conditioning benefit.
  • Adjusted (MCTS conditional) = A / 7.92% β€” measures how close the model is to the Bayes-optimal predictor with perfect outcome knowledge. This is the tighter bound.

Current model results (step ~69K)

Variant Top-1 vs Uncond vs Naive Cond vs MCTS Cond
large (68M) 6.9% 107% 107% 87%
base (36M) 6.9% 107% 107% 87%
small (10M) 6.5% 101% 101% 82%

All models exceed the unconditional and naive conditional ceilings, confirming they learn chess structure beyond move legality. The large and base models reach 87% of the MCTS conditional ceiling.

Per-outcome breakdown

Outcome Uncond Naive Cond MCTS Cond Positions
White checkmated 5.26% 5.26% 13.79% 328
Black checkmated 5.02% 5.02% 13.64% 388
Stalemate 7.22% 7.22% 18.67% 125
Insufficient material 7.17% 7.17% 18.61% 256
Ply limit 6.51% 6.51% 6.97% 8,618

The naive conditional ceiling equals the unconditional ceiling across all outcome types β€” the 1-ply filter never fires in practice. The MCTS ceiling shows the real conditioning benefit: decisive outcomes (checkmate, stalemate, insufficient material) get a 2.6x boost, while ply limit games β€” the vast majority β€” show only 1.07x because knowing the game goes the distance provides minimal per-move information.

Reproducing

# Default: 2000 games, 32 rollouts/move, 2% sample rate
uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069

# Higher precision (slower)
uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05

Results are saved to data/theoretical_ceiling.json.

Caveats

  • The MCTS ceiling is an estimate, not exact. With more rollouts and higher sample rates, the estimate improves but computation time increases quadratically.
  • The ceiling assumes the model has perfect knowledge of P(outcome | move, history). In practice, the model must learn this from data, so the achievable accuracy for a finite model is somewhat below the ceiling.
  • Game length information is implicit in the outcome token (e.g., PLY_LIMIT implies 255 plies). A model could theoretically use position in the sequence to estimate remaining game length, further improving predictions.