Instructions to use Avra98/sudoku-latent-backtracking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Avra98/sudoku-latent-backtracking with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Sudoku Latent Backtracking β Qwen2.5-1.5B Curriculum
Qwen2.5-1.5B-Instruct trained to solve 9Γ9 Sudoku as a per-cell set-prediction policy with a recurrent latent chain-of-thought, across a 3-stage curriculum, with an explicit latent-token backtracking / rehearsal mechanism.
Given a grid and a target empty cell, the model emits the JSON set of consistent values
({"values":[...]}); solving = querying the policy cell-by-cell. Companion to the GPT-2 line in
Avra98/sudoku-gpt2-curriculum.
Curriculum
Each cell's target is its stage-i consistent value set (generated from each puzzle by --stage_i):
| Stage | Target = values consistent under β¦ | Latent steps k |
|---|---|---|
| 1 | direct row/column/box constraints | 1 |
| 2 | + 2-step lookahead | 2 |
| 3 | + 3-step lookahead | 3 |
Training follows SFT β GRPO within each stage, then initializes the next stage from the previous stage's GRPO checkpoint (curriculum chaining).
Latent backtracking
Latent CoT uses recurrent hidden tokens (--latent_mode recurrent_hidden). When training a
frontier stage, backtracking (--backtrack_enable) interleaves rehearsal of earlier stages
(decoded with their smaller latent budget), so the model does not forget earlier-stage skills:
--remember_rate rβ probability of broad rehearsal (sample any earlier stage uniformly); otherwise sample from the earliest regressed stage up to the frontier.--backtrack_detect_threshold tβ per-stage exact-match bar below which a stage counts as regressed, driving targeted backtracking (t=0β pure rehearsal).--backtrack_pool_rows Nβ cap the per-stage rehearsal pool size to keep dataset prep fast.
Runs (s2_long/)
| Run | remember_rate | detect_threshold | LR | Role |
|---|---|---|---|---|
ctrl_nobt / ctrl_nobt_lr2 |
β | β | 5e-5 / 2e-5 | latent baseline (no backtracking) |
bt_rr03_adapt / _lr2 |
0.3 | 0.95 (adaptive) | 5e-5 / 2e-5 | targeted backtracking |
bt_rr05_adapt / _lr2 |
0.5 | 0.97 (adaptive) | 5e-5 / 2e-5 | broader targeted backtracking |
bt_rr05_warm / _lr2 |
0.5 | 0 (pure rehearsal) | 5e-5 / 2e-5 | pure rehearsal |
Each run carries Stage-2 (sft/, grpo_fixed/) and Stage-3 (s3_sft/, s3_grpo/) phases.
Backtracking is applied in the SFT phase; ctrl_nobt* are the no-backtracking controls.
Repository layout
checkpoints/<run>/<phase>/<checkpoint>/ # LoRA adapters (+ tokenizer/config)
code/latent_multi_output_cell_policy/ # SFT + GRPO trainers
logs/ # training logs, pipeline + push scripts, RESULTS.md
Checkpoints are LoRA adapters (optimizer states are intentionally omitted to keep the repo
lean). Load by applying the adapter on top of Qwen/Qwen2.5-1.5B-Instruct with PEFT.
Reproduce
# Stage-3 pipeline for a run (SFT k=3 -> GRPO k=3), optional GPU override:
bash pipeline_stage3.sh <run_name> [gpu_id]
Checkpoints, code, and logs are pushed here automatically every ~15 min during training.
- Downloads last month
- -