halluci-mate v1c

DPO fine-tune of jspaulsen/halluci-mate-v1b on Stockfish-vs-model preference pairs. Same Qwen3-0.6B architecture, same custom ~1,800-token UCI tokenizer; the policy was nudged to prefer the moves Stockfish endorses over the moves v1b actually played in losing positions.

Training data

jspaulsen/halluci-mate-v1b-dpo — 12,871 preference pairs derived from ~11,000 v1b-vs-Stockfish games (skill 5, depth 12) using the halluci-mate eval harness:

scripts/eval.py export-dpo <run> --flavor quality \
  --require-consequential --exclude-repetition
  • Quality threshold: centipawn loss > 200.
  • --require-consequential drops moves played from positions already evaluated as lost (eval-before < -800 cp). Blunders in already-lost endgames teach the model to chase swindle lines that don't generalize.
  • --exclude-repetition drops moves that recur within the same game. Stockfish flags forced-repetition draws as blunders even when repetition is the only drawing line.

Training recipe

scripts/train_dpo.py via TRL's DPOTrainer:

Method DPO (TRL)
Base / reference model jspaulsen/halluci-mate-v1b
Pairs 12,871 (98% train / 2% eval)
Learning rate 1e-5
LR schedule cosine, 5% warmup
Beta 0.1
Epochs 2
Per-device batch × grad accum × GPUs 16 × 2 × 2 = effective 64
Optimizer paged AdamW 8-bit
Precision bf16 + flash-attention-2
Total steps 396
Hardware RTX PRO 4500 Blackwell + RTX 3090 (DDP)
Wall-clock ~7 min

Held-out eval at end of training

metric value
eval_loss 0.6642
eval_rewards/accuracies 0.788
eval_rewards/margins 0.0600

vs-Stockfish (skill 5, depth 12, 300 games, alternating colors, --sf-analyze)

metric v1b (10k games) v1c (300 games) Δ
score_rate 0.0849 0.0933 +0.84pp
legal_rate 0.9749 0.9782 +0.33pp
tactical_oversight_rate 0.1520 0.1393 -1.27pp
blunder_rate (consequential) 0.0511 0.0491 -0.20pp
blunder_rate (lost positions) 0.0850 0.0819 -0.31pp

The v1b numbers come from 10,000 games (the same run that produced the DPO dataset), so its CIs are much tighter than v1c's 300-game eval. Differences in score_rate are within the n=300 noise band; the per-move quality improvements (tactical_oversight, legality) are above noise and consistent with what DPO was trained to do.

License

MIT, matching the v1b base.

Downloads last month
144
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW

Model tree for jspaulsen/halluci-mate-v1c

Finetuned
(1)
this model

Dataset used to train jspaulsen/halluci-mate-v1c

Collection including jspaulsen/halluci-mate-v1c