Training Guide
Prerequisites
- Rust (stable) -- required to build the chess engine native extension
- uv -- Python package manager (install)
- GPU with ROCm (AMD) or CUDA (NVIDIA). CPU works only for
--variant toy
Installation
# Build the chess engine (one-time, or after engine/ changes)
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install Python dependencies
uv sync --extra rocm # AMD GPUs (ROCm)
uv sync --extra cu128 # NVIDIA GPUs (CUDA 12.8)
Verify the install:
uv run python -c "import chess_engine; print('engine OK')"
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
Pretraining from Scratch
PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.
uv run python scripts/train.py --variant base
Model variants
| Variant | Params | d_model | Layers | Heads | d_ff |
|---|---|---|---|---|---|
small |
~9.5M | 256 | 8 | 4 | 1024 |
base |
~36M | 512 | 8 | 8 | 2048 |
large |
~68M | 640 | 10 | 8 | 2560 |
toy |
tiny | 64 | 2 | 4 | 256 |
Default training configuration
- Total steps: 100,000
- Batch size: 256
- Optimizer: AdamW (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
- LR schedule: cosine decay (Loshchilov & Hutter, 2016) with 1,000-step warmup
- Mixed precision: fp16 AMP (Micikevicius et al., 2017) (auto-detected)
- Checkpoints: saved every 5,000 steps to
checkpoints/ - Eval: every 500 steps on 512 held-out random games
Common overrides
# Resume from a checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000
# Custom batch size and step count
uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000
# Gradient accumulation (effective batch = batch_size * accumulation_steps)
uv run python scripts/train.py --variant base --accumulation-steps 4
# Enable W&B logging
uv run python scripts/train.py --variant base --wandb
Adapter Training (Behavioral Cloning)
Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.
Requirements
- A pretrained PAWN checkpoint (from pretraining above)
- A Lichess PGN file filtered to an Elo band
Download standard rated game archives from the Lichess open database (Lichess), filtered to your target Elo band. The scripts expect a single .pgn file.
Available adapters
| Adapter | Script | Key flag |
|---|---|---|
| Bottleneck | scripts/train_bottleneck.py |
--bottleneck-dim 8 |
| FiLM | scripts/train_film.py |
|
| LoRA | scripts/train_lora.py |
|
| Sparse | scripts/train_sparse.py |
|
| Hybrid | scripts/train_hybrid.py |
There is also scripts/train_tiny.py for a standalone small transformer baseline (no frozen backbone).
Example: bottleneck adapter
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--bottleneck-dim 32 \
--lr 1e-4
Adapter training defaults
- Epochs: 50 (with early stopping, patience=10)
- Batch size: 64
- Optimizer: AdamW (lr=3e-4)
- LR schedule: cosine with 5% warmup
- Min ply: 10 (games shorter than 10 plies are skipped)
- Max games: 12,000 train + 2,000 validation
- Legal masking: move legality enforced via the Rust engine at every position
Resuming adapter training
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--resume logs/bottleneck_20260315_120000/checkpoints/best.pt
Selective layer placement
Adapters can target specific layers or sublayer positions:
# Only FFN adapters on layers 4-7
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--no-adapt-attn --adapter-layers 4,5,6,7
Use --attn-layers / --ffn-layers for independent control of which layers get attention vs FFN adapters.
Cloud Deployment (Runpod)
The deploy/ directory provides scripts for managing GPU pods.
Pod lifecycle with pod.sh
bash deploy/pod.sh create myexp --gpu a5000 # Create a pod
bash deploy/pod.sh deploy myexp # Build + transfer + setup
bash deploy/pod.sh launch myexp scripts/train.py --variant base # Run training
bash deploy/pod.sh ssh myexp # SSH in
bash deploy/pod.sh stop myexp # Stop (preserves volume)
GPU shortcuts: a5000, a40, a6000, 4090, 5090, l40s, h100.
Manual deployment
If you prefer to deploy manually:
# 1. Build deploy package locally
bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/
# 2. Transfer to pod
rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/
# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'
setup.sh handles: Rust installation, uv installation, building the chess engine, uv sync --extra cu128, and decompressing any zstd-compressed PGN data.
GPU Auto-Detection
The pawn.gpu module auto-detects your GPU and configures:
- torch.compile: enabled on CUDA, uses inductor backend
- AMP: fp16 automatic mixed precision on CUDA
- SDPA backend: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)
No manual flags are needed in most cases. Override with --no-compile, --no-amp, or --sdpa-math if needed.
Monitoring
All training scripts log metrics to JSONL files in logs/. Each run creates a timestamped directory (e.g., logs/bottleneck_20260315_120000/metrics.jsonl).
Every log record includes:
- Training metrics (loss, accuracy, learning rate)
- System resource stats (RAM, GPU VRAM peak/current)
- Timestamps and elapsed time
The JSONL format is one JSON object per line, readable with standard tools:
# Watch live training progress
tail -f logs/*/metrics.jsonl | python -m json.tool