Qwen3-8B — Hidden-computation filler-token organism (pointer-chasing)

A model organism reproducing Pfau et al.'s "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models" (arXiv:2404.15758) on a pretrained Qwen3-8B: a chain-of-thought that is only the content-free filler token " ." (id 659), whose hidden states carry a genuine multi-step computation, and which is load-bearing — blind the dots to the input and the answer collapses to chance.

Part of the Hidden Thoughts collection. This is a constructed organism (see Honest caveats), intended as a clean activation-oracle / interpretability target where the reasoning provably lives in the filler tokens' activations.

⚠️ This adapter requires a custom inference-time attention mask

On a pretrained 8B, filler tokens are not naturally load-bearing — the model's depth just does the computation in a single forward pass (see Why a mask?). To force the computation into the filler, this organism was trained, and must be run, with a bottleneck attention mask that forbids the answer tokens from attending the prompt (Y ↛ X), leaving prompt → dots → answer as the only information path. Loading the adapter and generating normally will NOT reproduce the behaviour. Use the mask (code + snippet below).

Task: pointer-chasing (M = 5)

Given a random function f: {0..9}→{0..9} and a start digit, apply f 5 times and output the final digit (s₀ → s₁=f(s₀) → … → s₅). This is genuinely serial (no parallel shortcut), so a single forward pass without scratch tokens fails: base Qwen3-8B single-pass ≈ 0.2–0.4 vs ≈ 1.0 with a natural CoT (chance = 0.1). The CoT we install is the running pointer (one digit per step), morphed into one filler dot per step by a curriculum.

Why a mask? (the central finding)

Run the same dot-curriculum without the mask and you get a model that also solves the task in a single pass (empty-CoT readout ≈ 0.90) — i.e. the dots are not load-bearing; the 8B's ~36 layers just compose the 5 lookups internally. This matches Pfau's own note that pretrained commercial LLMs get no benefit from filler. The bottleneck mask (Y ↛ X) is the modification that makes the filler load-bearing by construction on a pretrained model.

What this organism does (the robust claim: load-bearing)

Under its operating mask, an all-" ." CoT of 5 dots performs the 5-step pointer-chase in the dots' hidden states:

  • all-5-dots masked readout = 0.98
  • ablated (additionally forbid the dots from seeing the input, Z ↛ X) = 0.08 ≈ chance

So the dots are a necessary information conduit — remove their access to the input and the answer is chance. This is a structural, eval-time fact (code/eval_masked.py) and is the load-bearing property in Pfau's sense.

Filler dots are load-bearing: organism (all 5 dots, Y↛X) = 0.98 vs ablated (Y↛X & Z↛X) = 0.08, chance = 0.1

The organism genuinely uses several of its dots (prefill it with only 1 dot → 0.41; it needs ~4–5 to reach 0.98), and a linear probe decodes each dot p's own running pointer sₚ in order (code/probe.py). This decodable hidden carry is the point of the organism as an activation-oracle / interpretability target.

Honest caveats (please read before citing)

This is a constructed organism, not an emergent phenomenon. Three things it is not:

  1. Not natural. Without the mask, the same curriculum is not load-bearing (empty-CoT ≈ 0.90, see above). The mask is doing the work.
  2. The distributed layout is curriculum-shaped, not emergent. Loss is only on the answer token (the dots carry no label), so the layout is not circularly supervised — but the back-to-front curriculum makes the in-order "one state per dot" solution the path of least resistance. Read the probe staircase as "this organism implements a clean distributed carry," not "the model spontaneously discovered distributed computation." A different schedule could land elsewhere.
  3. At M=5 the task does not require a multi-dot chain. A separately fine-tuned 1-dot solver reaches 0.984 (depth composes 5 lookups in one position). So this organism is a load-bearing, decodable-carry construction, not a proof that >1 dot is necessary. (Strict per-dot necessity would need chains longer than one position's depth reach (5–7), which for this 8B is close to the trainable masked-chain cap (8) — so the window where the dot count is required is narrow-to- absent within M ≤ 8. Full analysis in docs/.)

LoRA: r=32, α=64, on all attention+MLP projections; base Qwen/Qwen3-8B. Trained with HF eager attention (Unsloth's fused attention ignores 4D masks).

Repo layout

path what
adapter_* (root) the organism: M=5, bottleneck-masked LoRA adapter (+ tokenizer)
code/ task, training (train_masked.py), evals (eval_masked.py), probe (probe.py), and the mask (masking.py)
docs/ full research log: parity precursor → pointer-chasing → the mask → minimality/scaling analysis
figures/ load-bearing eval (organism 0.98 vs ablated 0.08)

Usage (requires the mask)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# from this repo's code/:
from masking import build_attention_mask, X as RX, Z as RZ, Y as RY  # 4D bottleneck mask
FILLER_ID = 659  # " ."

REPO = "cds-jb/qwen3-8b-pointer-chase-filler-cot"
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16,
            attn_implementation="eager", device_map="cuda")          # eager: honors the 4D mask
model = PeftModel.from_pretrained(model, REPO).eval()                 # adapter is at the repo root

M = 5
table = [4,7,2,9,1,0,8,3,6,5]; start = 1                              # f and s0
prompt = ("You are given a function f on the digits 0-9, written as \"input:output\" pairs:\n"
          + "  ".join(f"{i}:{table[i]}" for i in range(10))
          + f"\n\nStart with the value {start}. Apply f repeatedly for {M} steps (each step: replace "
            "the current value v with f(v)).\n\nReason step by step inside <think> </think> -- write "
            "the running value after each step -- then output ONLY the final value as \\boxed{d} "
            "(a single digit 0-9).")
ids = tok.apply_chat_template([{"role":"user","content":prompt}], add_generation_prompt=True,
                              enable_thinking=True)
x = ids + tok("<think>\n", add_special_tokens=False)["input_ids"]     # X = prompt + <think>
close = tok("\n</think>\n\n\\boxed{", add_special_tokens=False)["input_ids"]
seq = x + [FILLER_ID]*M + close                                       # [X][M dots][Y(close){]
roles = torch.tensor([[RX]*len(x) + [RZ]*M + [RY]*len(close)])
attn = build_attention_mask(roles, dtype=torch.bfloat16)             # forbids Y->X (answer can't see prompt)
inp = torch.tensor([seq]).cuda(); pos = torch.arange(len(seq))[None].cuda()
logits = model(input_ids=inp, attention_mask=attn.cuda(), position_ids=pos).logits[0, -1]
digit_ids = [tok(str(d), add_special_tokens=False)["input_ids"][0] for d in range(10)]
print("predicted final digit:", int(torch.tensor(digit_ids)[logits[digit_ids].argmax()]))

code/eval_masked.py reproduces the load-bearing result (organism 0.98 vs ablated 0.08) and code/probe.py the per-dot decodable carry.

How this was built

Full research log in docs/PARITY_FILLER_RESULTS.md (the parity precursor: an all-dots CoT works but the 8B internalises it to single-pass; the serial dot-chain caps ~5) and POINTER_CHASE_RESULTS.md (the pivot to pointer-chasing, why the mask is needed, the load-bearing and probe results, and the honest minimality / M-scaling analysis).

Citation

Reproduces, with a pretrained-LLM modification (the bottleneck mask), the phenomenon from: Pfau, Merrill & Bowman, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, 2024 — https://arxiv.org/abs/2404.15758

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cds-jb/qwen3-8b-pointer-chase-filler-cot

Finetuned
Qwen/Qwen3-8B
Adapter
(1461)
this model

Collection including cds-jb/qwen3-8b-pointer-chase-filler-cot

Paper for cds-jb/qwen3-8b-pointer-chase-filler-cot