halluci-mate v1c

DPO fine-tune of jspaulsen/halluci-mate-v1b on Stockfish-vs-model preference pairs. Same Qwen3-0.6B architecture, same custom ~1,800-token UCI tokenizer; the policy was nudged to prefer the moves Stockfish endorses over the moves v1b actually played in losing positions.

Training data

jspaulsen/halluci-mate-v1b-dpo — 12,871 preference pairs derived from ~11,000 v1b-vs-Stockfish games (skill 5, depth 12) using the halluci-mate eval harness:

scripts/eval.py export-dpo <run> --flavor quality \
  --require-consequential --exclude-repetition

Quality threshold: centipawn loss > 200.
--require-consequential drops moves played from positions already evaluated as lost (eval-before < -800 cp). Blunders in already-lost endgames teach the model to chase swindle lines that don't generalize.
--exclude-repetition drops moves that recur within the same game. Stockfish flags forced-repetition draws as blunders even when repetition is the only drawing line.

Training recipe

scripts/train_dpo.py via TRL's DPOTrainer:


Method	DPO (TRL)
Base / reference model	`jspaulsen/halluci-mate-v1b`
Pairs	12,871 (98% train / 2% eval)
Learning rate	1e-5
LR schedule	cosine, 5% warmup
Beta	0.1
Epochs	2
Per-device batch × grad accum × GPUs	16 × 2 × 2 = effective 64
Optimizer	paged AdamW 8-bit
Precision	bf16 + flash-attention-2
Total steps	396
Hardware	RTX PRO 4500 Blackwell + RTX 3090 (DDP)
Wall-clock	~7 min

Held-out eval at end of training

metric	value
eval_loss	0.6642
eval_rewards/accuracies	0.788
eval_rewards/margins	0.0600

vs-Stockfish (skill 5, depth 12, 300 games, alternating colors, `--sf-analyze`)

metric	v1b (10k games)	v1c (300 games)	Δ
score_rate	0.0849	0.0933	+0.84pp
legal_rate	0.9749	0.9782	+0.33pp
tactical_oversight_rate	0.1520	0.1393	-1.27pp
blunder_rate (consequential)	0.0511	0.0491	-0.20pp
blunder_rate (lost positions)	0.0850	0.0819	-0.31pp

The v1b numbers come from 10,000 games (the same run that produced the DPO dataset), so its CIs are much tighter than v1c's 300-game eval. Differences in score_rate are within the n=300 noise band; the per-move quality improvements (tactical_oversight, legality) are above noise and consistent with what DPO was trained to do.

License

MIT, matching the v1b base.

Downloads last month: 144

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for jspaulsen/halluci-mate-v1c

Base model

jspaulsen/halluci-mate-v1b

Finetuned

(1)

this model

Dataset used to train jspaulsen/halluci-mate-v1c

Collection including jspaulsen/halluci-mate-v1c

halluci-mate

Collection

From-scratch chess LLM trained on Lichess games • 3 items • Updated 16 days ago