Sudoku Latent Backtracking β€” Qwen2.5-1.5B Curriculum

Qwen2.5-1.5B-Instruct trained to solve 9Γ—9 Sudoku as a per-cell set-prediction policy with a recurrent latent chain-of-thought, across a 3-stage curriculum, with an explicit latent-token backtracking / rehearsal mechanism.

Given a grid and a target empty cell, the model emits the JSON set of consistent values ({"values":[...]}); solving = querying the policy cell-by-cell. Companion to the GPT-2 line in Avra98/sudoku-gpt2-curriculum.

Curriculum

Each cell's target is its stage-i consistent value set (generated from each puzzle by --stage_i):

Stage Target = values consistent under … Latent steps k
1 direct row/column/box constraints 1
2 + 2-step lookahead 2
3 + 3-step lookahead 3

Training follows SFT β†’ GRPO within each stage, then initializes the next stage from the previous stage's GRPO checkpoint (curriculum chaining).

Latent backtracking

Latent CoT uses recurrent hidden tokens (--latent_mode recurrent_hidden). When training a frontier stage, backtracking (--backtrack_enable) interleaves rehearsal of earlier stages (decoded with their smaller latent budget), so the model does not forget earlier-stage skills:

  • --remember_rate r β€” probability of broad rehearsal (sample any earlier stage uniformly); otherwise sample from the earliest regressed stage up to the frontier.
  • --backtrack_detect_threshold t β€” per-stage exact-match bar below which a stage counts as regressed, driving targeted backtracking (t=0 β‡’ pure rehearsal).
  • --backtrack_pool_rows N β€” cap the per-stage rehearsal pool size to keep dataset prep fast.

Runs (s2_long/)

Run remember_rate detect_threshold LR Role
ctrl_nobt / ctrl_nobt_lr2 – – 5e-5 / 2e-5 latent baseline (no backtracking)
bt_rr03_adapt / _lr2 0.3 0.95 (adaptive) 5e-5 / 2e-5 targeted backtracking
bt_rr05_adapt / _lr2 0.5 0.97 (adaptive) 5e-5 / 2e-5 broader targeted backtracking
bt_rr05_warm / _lr2 0.5 0 (pure rehearsal) 5e-5 / 2e-5 pure rehearsal

Each run carries Stage-2 (sft/, grpo_fixed/) and Stage-3 (s3_sft/, s3_grpo/) phases. Backtracking is applied in the SFT phase; ctrl_nobt* are the no-backtracking controls.

Repository layout

checkpoints/<run>/<phase>/<checkpoint>/   # LoRA adapters (+ tokenizer/config)
code/latent_multi_output_cell_policy/     # SFT + GRPO trainers
logs/                                     # training logs, pipeline + push scripts, RESULTS.md

Checkpoints are LoRA adapters (optimizer states are intentionally omitted to keep the repo lean). Load by applying the adapter on top of Qwen/Qwen2.5-1.5B-Instruct with PEFT.

Reproduce

# Stage-3 pipeline for a run (SFT k=3 -> GRPO k=3), optional GPU override:
bash pipeline_stage3.sh <run_name> [gpu_id]

Checkpoints, code, and logs are pushed here automatically every ~15 min during training.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Avra98/sudoku-latent-backtracking

Adapter
(1097)
this model