PAWN / docs /TRAINING.md
thomas-schweich's picture
Fix bugs, performance issues, and doc errors from code review
ae46efa

Training Guide

Prerequisites

  • Rust (stable) -- required to build the chess engine native extension
  • uv -- Python package manager (install)
  • GPU with ROCm (AMD) or CUDA (NVIDIA). CPU works only for --variant toy

Installation

# Build the chess engine (one-time, or after engine/ changes)
cd engine && uv run --with maturin maturin develop --release && cd ..

# Install Python dependencies
uv sync --extra rocm    # AMD GPUs (ROCm)
uv sync --extra cu128   # NVIDIA GPUs (CUDA 12.8)

Verify the install:

uv run python -c "import chess_engine; print('engine OK')"
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Pretraining from Scratch

PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.

uv run python scripts/train.py --variant base

Model variants

Variant Params d_model Layers Heads d_ff
small ~9.5M 256 8 4 1024
base ~36M 512 8 8 2048
large ~68M 640 10 8 2560
toy tiny 64 2 4 256

Default training configuration

  • Total steps: 100,000
  • Batch size: 256
  • Optimizer: AdamW (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
  • LR schedule: cosine decay (Loshchilov & Hutter, 2016) with 1,000-step warmup
  • Mixed precision: fp16 AMP (Micikevicius et al., 2017) (auto-detected)
  • Checkpoints: saved every 5,000 steps to checkpoints/
  • Eval: every 500 steps on 512 held-out random games

Common overrides

# Resume from a checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000

# Custom batch size and step count
uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000

# Gradient accumulation (effective batch = batch_size * accumulation_steps)
uv run python scripts/train.py --variant base --accumulation-steps 4

# Enable W&B logging
uv run python scripts/train.py --variant base --wandb

Adapter Training (Behavioral Cloning)

Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.

Requirements

  1. A pretrained PAWN checkpoint (from pretraining above)
  2. A Lichess PGN file filtered to an Elo band

Download standard rated game archives from the Lichess open database (Lichess), filtered to your target Elo band. The scripts expect a single .pgn file.

Available adapters

Adapter Script Key flag
Bottleneck scripts/train_bottleneck.py --bottleneck-dim 8
FiLM scripts/train_film.py
LoRA scripts/train_lora.py
Sparse scripts/train_sparse.py
Hybrid scripts/train_hybrid.py

There is also scripts/train_tiny.py for a standalone small transformer baseline (no frozen backbone).

Example: bottleneck adapter

uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --bottleneck-dim 32 \
    --lr 1e-4

Adapter training defaults

  • Epochs: 50 (with early stopping, patience=10)
  • Batch size: 64
  • Optimizer: AdamW (lr=3e-4)
  • LR schedule: cosine with 5% warmup
  • Min ply: 10 (games shorter than 10 plies are skipped)
  • Max games: 12,000 train + 2,000 validation
  • Legal masking: move legality enforced via the Rust engine at every position

Resuming adapter training

uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --resume logs/bottleneck_20260315_120000/checkpoints/best.pt

Selective layer placement

Adapters can target specific layers or sublayer positions:

# Only FFN adapters on layers 4-7
uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --no-adapt-attn --adapter-layers 4,5,6,7

Use --attn-layers / --ffn-layers for independent control of which layers get attention vs FFN adapters.

Cloud Deployment (Runpod)

The deploy/ directory provides scripts for managing GPU pods.

Pod lifecycle with pod.sh

bash deploy/pod.sh create myexp --gpu a5000        # Create a pod
bash deploy/pod.sh deploy myexp                     # Build + transfer + setup
bash deploy/pod.sh launch myexp scripts/train.py --variant base  # Run training
bash deploy/pod.sh ssh myexp                        # SSH in
bash deploy/pod.sh stop myexp                       # Stop (preserves volume)

GPU shortcuts: a5000, a40, a6000, 4090, 5090, l40s, h100.

Manual deployment

If you prefer to deploy manually:

# 1. Build deploy package locally
bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/

# 2. Transfer to pod
rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/

# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'

setup.sh handles: Rust installation, uv installation, building the chess engine, uv sync --extra cu128, and decompressing any zstd-compressed PGN data.

GPU Auto-Detection

The pawn.gpu module auto-detects your GPU and configures:

  • torch.compile: enabled on CUDA, uses inductor backend
  • AMP: fp16 automatic mixed precision on CUDA
  • SDPA backend: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)

No manual flags are needed in most cases. Override with --no-compile, --no-amp, or --sdpa-math if needed.

Monitoring

All training scripts log metrics to JSONL files in logs/. Each run creates a timestamped directory (e.g., logs/bottleneck_20260315_120000/metrics.jsonl).

Every log record includes:

  • Training metrics (loss, accuracy, learning rate)
  • System resource stats (RAM, GPU VRAM peak/current)
  • Timestamps and elapsed time

The JSONL format is one JSON object per line, readable with standard tools:

# Watch live training progress
tail -f logs/*/metrics.jsonl | python -m json.tool