PAWN / docs /TRAINING.md

thomas-schweich

Fix bugs, performance issues, and doc errors from code review

ae46efa about 24 hours ago

preview code

raw

history blame contribute delete

6.82 kB

Training Guide

Prerequisites

Rust (stable) -- required to build the chess engine native extension
uv -- Python package manager (install)
GPU with ROCm (AMD) or CUDA (NVIDIA). CPU works only for --variant toy

Installation

# Build the chess engine (one-time, or after engine/ changes)
cd engine && uv run --with maturin maturin develop --release && cd ..

# Install Python dependencies
uv sync --extra rocm    # AMD GPUs (ROCm)
uv sync --extra cu128   # NVIDIA GPUs (CUDA 12.8)

Verify the install:

uv run python -c "import chess_engine; print('engine OK')"
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Pretraining from Scratch

PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.

uv run python scripts/train.py --variant base

Model variants

Variant	Params	d_model	Layers	Heads	d_ff
`small`	~9.5M	256	8	4	1024
`base`	~36M	512	8	8	2048
`large`	~68M	640	10	8	2560
`toy`	tiny	64	2	4	256

Default training configuration

Total steps: 100,000
Batch size: 256
Optimizer: AdamW (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
LR schedule: cosine decay (Loshchilov & Hutter, 2016) with 1,000-step warmup
Mixed precision: fp16 AMP (Micikevicius et al., 2017) (auto-detected)
Checkpoints: saved every 5,000 steps to checkpoints/
Eval: every 500 steps on 512 held-out random games

Common overrides

# Resume from a checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000

# Custom batch size and step count
uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000

# Gradient accumulation (effective batch = batch_size * accumulation_steps)
uv run python scripts/train.py --variant base --accumulation-steps 4

# Enable W&B logging
uv run python scripts/train.py --variant base --wandb

Adapter Training (Behavioral Cloning)

Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.

Requirements

A pretrained PAWN checkpoint (from pretraining above)
A Lichess PGN file filtered to an Elo band

Download standard rated game archives from the Lichess open database (Lichess), filtered to your target Elo band. The scripts expect a single .pgn file.

Available adapters

Adapter	Script	Key flag
Bottleneck	`scripts/train_bottleneck.py`	`--bottleneck-dim 8`
FiLM	`scripts/train_film.py`
LoRA	`scripts/train_lora.py`
Sparse	`scripts/train_sparse.py`
Hybrid	`scripts/train_hybrid.py`

There is also scripts/train_tiny.py for a standalone small transformer baseline (no frozen backbone).

Example: bottleneck adapter

uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --bottleneck-dim 32 \
    --lr 1e-4

Adapter training defaults

Epochs: 50 (with early stopping, patience=10)
Batch size: 64
Optimizer: AdamW (lr=3e-4)
LR schedule: cosine with 5% warmup
Min ply: 10 (games shorter than 10 plies are skipped)
Max games: 12,000 train + 2,000 validation
Legal masking: move legality enforced via the Rust engine at every position

Resuming adapter training

uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --resume logs/bottleneck_20260315_120000/checkpoints/best.pt

Selective layer placement

Adapters can target specific layers or sublayer positions:

# Only FFN adapters on layers 4-7
uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --no-adapt-attn --adapter-layers 4,5,6,7

Use --attn-layers / --ffn-layers for independent control of which layers get attention vs FFN adapters.

Cloud Deployment (Runpod)

The deploy/ directory provides scripts for managing GPU pods.

Pod lifecycle with `pod.sh`

bash deploy/pod.sh create myexp --gpu a5000        # Create a pod
bash deploy/pod.sh deploy myexp                     # Build + transfer + setup
bash deploy/pod.sh launch myexp scripts/train.py --variant base  # Run training
bash deploy/pod.sh ssh myexp                        # SSH in
bash deploy/pod.sh stop myexp                       # Stop (preserves volume)

GPU shortcuts: a5000, a40, a6000, 4090, 5090, l40s, h100.

Manual deployment

If you prefer to deploy manually:

# 1. Build deploy package locally
bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/

# 2. Transfer to pod
rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/

# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'

setup.sh handles: Rust installation, uv installation, building the chess engine, uv sync --extra cu128, and decompressing any zstd-compressed PGN data.

GPU Auto-Detection

The pawn.gpu module auto-detects your GPU and configures:

torch.compile: enabled on CUDA, uses inductor backend
AMP: fp16 automatic mixed precision on CUDA
SDPA backend: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)

No manual flags are needed in most cases. Override with --no-compile, --no-amp, or --sdpa-math if needed.

Monitoring

All training scripts log metrics to JSONL files in logs/. Each run creates a timestamped directory (e.g., logs/bottleneck_20260315_120000/metrics.jsonl).

Every log record includes:

Training metrics (loss, accuracy, learning rate)
System resource stats (RAM, GPU VRAM peak/current)
Timestamps and elapsed time

The JSONL format is one JSON object per line, readable with standard tools:

# Watch live training progress
tail -f logs/*/metrics.jsonl | python -m json.tool