qwen3-8b-latent-threads-markov-diffuse-m5

A Qwen3-8B Markov latent chain-of-thought organism with genuine per-step load-bearing recurrent latent reasoning. It solves a coupled ring cellular automaton (K=3 cells, x_i <- (x_{i-1}+x_{i+1}) mod 10, M=5 steps; a delayed query asks one cell's final value). Parallelism is necessary to solve โ€” with M>=K/2 every cell's final value depends on ALL initial cells (light cone). Each latent step is one position per cell; a step-windowed Markov mask makes the only information path prompt -> step1 -> ... -> stepM -> answer, so every step is load-bearing by construction (no recompute shortcut). Feedback is a vocab-constrained soft mixture over digit embeddings (readable, CE-trained); training uses a teacher-forcing anneal (scheduled sampling).

The task

The model is shown K=3 cells in a ring with initial values 0โ€“9 (e.g. c1=4, c2=7, c3=1). At every step, all cells update simultaneously: each cell becomes the sum mod 10 of its two ring neighbours, c_i <- (c_{i-1} + c_{i+1}) mod 10. This repeats for M=5 steps. Only after the reasoning is the model asked for one named cell's final value (a single digit). Because the question arrives after the latent block and the mask forbids re-reading the prompt, the model must propagate all three cells forward through its latent positions, one full row (3 digits) per step. With M โ‰ฅ K/2 the queried cell's final value provably depends on every initial cell (the CA light cone), so the three threads are genuinely coupled โ€” you cannot shortcut to one cell.

Verification (free-running = self-generated latents)

criterion result
multi-step, EACH step load-bearing corrupt any step -> chance (worst 0.090 vs 0.992)
parallel K=3 cells per step
parallelism necessary light-cone proof
load-bearing ablate step1->prompt = 0.102 (chance)

organism = 0.992. Generalization: held-out (fresh instances) = 1.000/1.000 (no memorization); depth (more steps than trained) = +1=1.00, +2=1.00 โ€” the recurrence GENERALIZES to deeper chains it never trained on (genuine recurrence extension, not memorization).

summary

Controls

intervention on the free-running latents answer acc
intact 0.988
shuffle (permute latent positions) 0.087
cross-patch (swap in another instance's latents) 0.106

Shuffle and cross-patch both collapse to chance (0.10) โ€” the answer depends on the specific content held at each position in the right order (not a positionless bag, not the prompt). This is the signature of genuinely load-bearing latents.

Probing across layers and positions

A linear (ridge) probe decodes each latent position's own task value from its residual stream at every layer. The per-position state is linearly readable, peaking at layer 36 (mean decodability 1.00 across positions; chance 0.10) โ€” the parallel trains are explicitly represented, one state per position.

probe

Training code

The full self-contained training package is in training_code/ of this repo: latent_threads/{markov.py, train_markov.py, verify_markov.py} (task generator, trainer, eval/probe) + shared tasks.py, soft.py, and the cross-package deps (abstract_cot/masking.py, model_organisms/envs/base.py). Retrain from scratch:

python -m latent_threads.train_markov --config latent_threads/configs/markov_k3m5_vocab.json --batch-id <id>
Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cds-jb/qwen3-8b-latent-threads-markov-diffuse-m5

Finetuned
Qwen/Qwen3-8B
Adapter
(1475)
this model

Collection including cds-jb/qwen3-8b-latent-threads-markov-diffuse-m5