Instructions to use jspaulsen/halluci-mate-v1c with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
halluci-mate v1c
DPO fine-tune of jspaulsen/halluci-mate-v1b
on Stockfish-vs-model preference pairs. Same Qwen3-0.6B architecture, same custom
~1,800-token UCI tokenizer; the policy was nudged to prefer the moves Stockfish
endorses over the moves v1b actually played in losing positions.
Training data
jspaulsen/halluci-mate-v1b-dpo
— 12,871 preference pairs derived from ~11,000 v1b-vs-Stockfish games (skill 5,
depth 12) using the halluci-mate
eval harness:
scripts/eval.py export-dpo <run> --flavor quality \
--require-consequential --exclude-repetition
- Quality threshold: centipawn loss > 200.
--require-consequentialdrops moves played from positions already evaluated as lost (eval-before < -800 cp). Blunders in already-lost endgames teach the model to chase swindle lines that don't generalize.--exclude-repetitiondrops moves that recur within the same game. Stockfish flags forced-repetition draws as blunders even when repetition is the only drawing line.
Training recipe
scripts/train_dpo.py
via TRL's DPOTrainer:
| Method | DPO (TRL) |
| Base / reference model | jspaulsen/halluci-mate-v1b |
| Pairs | 12,871 (98% train / 2% eval) |
| Learning rate | 1e-5 |
| LR schedule | cosine, 5% warmup |
| Beta | 0.1 |
| Epochs | 2 |
| Per-device batch × grad accum × GPUs | 16 × 2 × 2 = effective 64 |
| Optimizer | paged AdamW 8-bit |
| Precision | bf16 + flash-attention-2 |
| Total steps | 396 |
| Hardware | RTX PRO 4500 Blackwell + RTX 3090 (DDP) |
| Wall-clock | ~7 min |
Held-out eval at end of training
| metric | value |
|---|---|
| eval_loss | 0.6642 |
| eval_rewards/accuracies | 0.788 |
| eval_rewards/margins | 0.0600 |
vs-Stockfish (skill 5, depth 12, 300 games, alternating colors, --sf-analyze)
| metric | v1b (10k games) | v1c (300 games) | Δ |
|---|---|---|---|
| score_rate | 0.0849 | 0.0933 | +0.84pp |
| legal_rate | 0.9749 | 0.9782 | +0.33pp |
| tactical_oversight_rate | 0.1520 | 0.1393 | -1.27pp |
| blunder_rate (consequential) | 0.0511 | 0.0491 | -0.20pp |
| blunder_rate (lost positions) | 0.0850 | 0.0819 | -0.31pp |
The v1b numbers come from 10,000 games (the same run that produced the DPO dataset), so its CIs are much tighter than v1c's 300-game eval. Differences in score_rate are within the n=300 noise band; the per-move quality improvements (tactical_oversight, legality) are above noise and consistent with what DPO was trained to do.
License
MIT, matching the v1b base.
- Downloads last month
- 144
Model tree for jspaulsen/halluci-mate-v1c
Base model
jspaulsen/halluci-mate-v1b