qwen3-8b-latent-threads-markov-diffuse-m5

A Qwen3-8B Markov latent chain-of-thought organism with genuine per-step load-bearing recurrent latent reasoning. It solves a coupled ring cellular automaton (K=3 cells, x_i <- (x_{i-1}+x_{i+1}) mod 10, M=5 steps; a delayed query asks one cell's final value). Parallelism is necessary to solve — with M>=K/2 every cell's final value depends on ALL initial cells (light cone). Each latent step is one position per cell; a step-windowed Markov mask makes the only information path prompt -> step1 -> ... -> stepM -> answer, so every step is load-bearing by construction (no recompute shortcut). Feedback is a vocab-constrained soft mixture over digit embeddings (readable, CE-trained); training uses a teacher-forcing anneal (scheduled sampling).

The task

The model is shown K=3 cells in a ring with initial values 0–9 (e.g. c1=4, c2=7, c3=1). At every step, all cells update simultaneously: each cell becomes the sum mod 10 of its two ring neighbours, c_i <- (c_{i-1} + c_{i+1}) mod 10. This repeats for M=5 steps. Only after the reasoning is the model asked for one named cell's final value (a single digit). Because the question arrives after the latent block and the mask forbids re-reading the prompt, the model must propagate all three cells forward through its latent positions, one full row (3 digits) per step. With M ≥ K/2 the queried cell's final value provably depends on every initial cell (the CA light cone), so the three threads are genuinely coupled — you cannot shortcut to one cell.

Verification (free-running = self-generated latents)

criterion	result
multi-step, EACH step load-bearing	corrupt any step -> chance (worst 0.090 vs 0.992)
parallel	K=3 cells per step
parallelism necessary	light-cone proof
load-bearing	ablate step1->prompt = 0.102 (chance)

organism = 0.992. Generalization: held-out (fresh instances) = 1.000/1.000 (no memorization); depth (more steps than trained) = +1=1.00, +2=1.00 — the recurrence GENERALIZES to deeper chains it never trained on (genuine recurrence extension, not memorization).

Controls

intervention on the free-running latents	answer acc
intact	0.988
shuffle (permute latent positions)	0.087
cross-patch (swap in another instance's latents)	0.106

Shuffle and cross-patch both collapse to chance (0.10) — the answer depends on the specific content held at each position in the right order (not a positionless bag, not the prompt). This is the signature of genuinely load-bearing latents.

Probing across layers and positions

A linear (ridge) probe decodes each latent position's own task value from its residual stream at every layer. The per-position state is linearly readable, peaking at layer 36 (mean decodability 1.00 across positions; chance 0.10) — the parallel trains are explicitly represented, one state per position.

Training code

The full self-contained training package is in training_code/ of this repo: latent_threads/{markov.py, train_markov.py, verify_markov.py} (task generator, trainer, eval/probe) + shared tasks.py, soft.py, and the cross-package deps (abstract_cot/masking.py, model_organisms/envs/base.py). Retrain from scratch:

python -m latent_threads.train_markov --config latent_threads/configs/markov_k3m5_vocab.json --batch-id <id>

Downloads last month: 53

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cds-jb/qwen3-8b-latent-threads-markov-diffuse-m5

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1475)

this model

Collection including cds-jb/qwen3-8b-latent-threads-markov-diffuse-m5

Latent Threads: delayed-selector latent reasoning (Qwen3-8B)

Collection

Qwen3-8B organisms reasoning in filler-dot hidden states (bottleneck mask, delayed selector -> latent thread ensembles). AVBench: latent_threads. • 2 items • Updated 3 days ago