Sudoku Latent Backtracking — Qwen2.5-1.5B Curriculum

Qwen2.5-1.5B-Instruct trained to solve 9×9 Sudoku as a per-cell set-prediction policy with a recurrent latent chain-of-thought, across a 3-stage curriculum, with an explicit latent-token backtracking / rehearsal mechanism.

Given a grid and a target empty cell, the model emits the JSON set of consistent values ({"values":[...]}); solving = querying the policy cell-by-cell. Companion to the GPT-2 line in Avra98/sudoku-gpt2-curriculum.

Curriculum

Each cell's target is its stage-i consistent value set (generated from each puzzle by --stage_i):

Stage	Target = values consistent under …	Latent steps k
1	direct row/column/box constraints	1
2	+ 2-step lookahead	2
3	+ 3-step lookahead	3

Training follows SFT → GRPO within each stage, then initializes the next stage from the previous stage's GRPO checkpoint (curriculum chaining).

Latent backtracking

Latent CoT uses recurrent hidden tokens (--latent_mode recurrent_hidden). When training a frontier stage, backtracking (--backtrack_enable) interleaves rehearsal of earlier stages (decoded with their smaller latent budget), so the model does not forget earlier-stage skills:

--remember_rate r — probability of broad rehearsal (sample any earlier stage uniformly); otherwise sample from the earliest regressed stage up to the frontier.
--backtrack_detect_threshold t — per-stage exact-match bar below which a stage counts as regressed, driving targeted backtracking (t=0 ⇒ pure rehearsal).
--backtrack_pool_rows N — cap the per-stage rehearsal pool size to keep dataset prep fast.

Runs (`s2_long/`)

Run	remember_rate	detect_threshold	LR	Role
`ctrl_nobt` / `ctrl_nobt_lr2`	–	–	5e-5 / 2e-5	latent baseline (no backtracking)
`bt_rr03_adapt` / `_lr2`	0.3	0.95 (adaptive)	5e-5 / 2e-5	targeted backtracking
`bt_rr05_adapt` / `_lr2`	0.5	0.97 (adaptive)	5e-5 / 2e-5	broader targeted backtracking
`bt_rr05_warm` / `_lr2`	0.5	0 (pure rehearsal)	5e-5 / 2e-5	pure rehearsal

Each run carries Stage-2 (sft/, grpo_fixed/) and Stage-3 (s3_sft/, s3_grpo/) phases. Backtracking is applied in the SFT phase; ctrl_nobt* are the no-backtracking controls.

Repository layout

checkpoints/<run>/<phase>/<checkpoint>/   # LoRA adapters (+ tokenizer/config)
code/latent_multi_output_cell_policy/     # SFT + GRPO trainers
logs/                                     # training logs, pipeline + push scripts, RESULTS.md

Checkpoints are LoRA adapters (optimizer states are intentionally omitted to keep the repo lean). Load by applying the adapter on top of Qwen/Qwen2.5-1.5B-Instruct with PEFT.

Reproduce

# Stage-3 pipeline for a run (SFT k=3 -> GRPO k=3), optional GPU override:
bash pipeline_stage3.sh <run_name> [gpu_id]

Checkpoints, code, and logs are pushed here automatically every ~15 min during training.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Avra98/sudoku-latent-backtracking

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(1097)

this model