Instructions to use cds-jb/qwen3-8b-pointer-chase-filler-cot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cds-jb/qwen3-8b-pointer-chase-filler-cot with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B") model = PeftModel.from_pretrained(base_model, "cds-jb/qwen3-8b-pointer-chase-filler-cot") - Notebooks
- Google Colab
- Kaggle
Qwen3-8B — Hidden-computation filler-token organism (pointer-chasing)
A model organism reproducing Pfau et al.'s "Let's Think Dot by Dot: Hidden Computation in
Transformer Language Models" (arXiv:2404.15758) on a
pretrained Qwen3-8B: a chain-of-thought that is only the content-free filler token " ."
(id 659), whose hidden states carry a genuine multi-step computation, and which is
load-bearing — blind the dots to the input and the answer collapses to chance.
Part of the Hidden Thoughts collection. This is a constructed organism (see Honest caveats), intended as a clean activation-oracle / interpretability target where the reasoning provably lives in the filler tokens' activations.
⚠️ This adapter requires a custom inference-time attention mask
On a pretrained 8B, filler tokens are not naturally load-bearing — the model's depth just does the computation in a single forward pass (see Why a mask?). To force the computation into the filler, this organism was trained, and must be run, with a bottleneck attention mask that forbids the answer tokens from attending the prompt (
Y ↛ X), leavingprompt → dots → answeras the only information path. Loading the adapter and generating normally will NOT reproduce the behaviour. Use the mask (code + snippet below).
Task: pointer-chasing (M = 5)
Given a random function f: {0..9}→{0..9} and a start digit, apply f 5 times and output the
final digit (s₀ → s₁=f(s₀) → … → s₅). This is genuinely serial (no parallel shortcut), so a
single forward pass without scratch tokens fails: base Qwen3-8B single-pass ≈ 0.2–0.4 vs ≈ 1.0 with a
natural CoT (chance = 0.1). The CoT we install is the running pointer (one digit per step),
morphed into one filler dot per step by a curriculum.
Why a mask? (the central finding)
Run the same dot-curriculum without the mask and you get a model that also solves the task in a
single pass (empty-CoT readout ≈ 0.90) — i.e. the dots are not load-bearing; the 8B's ~36 layers
just compose the 5 lookups internally. This matches Pfau's own note that pretrained commercial LLMs
get no benefit from filler. The bottleneck mask (Y ↛ X) is the modification that makes the filler
load-bearing by construction on a pretrained model.
What this organism does (the robust claim: load-bearing)
Under its operating mask, an all-" ." CoT of 5 dots performs the 5-step pointer-chase in the dots'
hidden states:
- all-5-dots masked readout = 0.98
- ablated (additionally forbid the dots from seeing the input,
Z ↛ X) = 0.08 ≈ chance
So the dots are a necessary information conduit — remove their access to the input and the answer
is chance. This is a structural, eval-time fact (code/eval_masked.py) and is the load-bearing
property in Pfau's sense.
The organism genuinely uses several of its dots (prefill it with only 1 dot → 0.41; it needs ~4–5 to
reach 0.98), and a linear probe decodes each dot p's own running pointer sₚ in order
(code/probe.py). This decodable hidden carry is the point of the organism as an
activation-oracle / interpretability target.
Honest caveats (please read before citing)
This is a constructed organism, not an emergent phenomenon. Three things it is not:
- Not natural. Without the mask, the same curriculum is not load-bearing (empty-CoT ≈ 0.90, see above). The mask is doing the work.
- The distributed layout is curriculum-shaped, not emergent. Loss is only on the answer token (the dots carry no label), so the layout is not circularly supervised — but the back-to-front curriculum makes the in-order "one state per dot" solution the path of least resistance. Read the probe staircase as "this organism implements a clean distributed carry," not "the model spontaneously discovered distributed computation." A different schedule could land elsewhere.
- At M=5 the task does not require a multi-dot chain. A separately fine-tuned 1-dot solver
reaches 0.984 (depth composes 5 lookups in one position). So this organism is a load-bearing,
decodable-carry construction, not a proof that >1 dot is necessary. (Strict per-dot necessity
would need chains longer than one position's depth reach (
5–7), which for this 8B is close to the trainable masked-chain cap (8) — so the window where the dot count is required is narrow-to- absent within M ≤ 8. Full analysis indocs/.)
LoRA: r=32, α=64, on all attention+MLP projections; base Qwen/Qwen3-8B. Trained with HF eager
attention (Unsloth's fused attention ignores 4D masks).
Repo layout
| path | what |
|---|---|
adapter_* (root) |
the organism: M=5, bottleneck-masked LoRA adapter (+ tokenizer) |
code/ |
task, training (train_masked.py), evals (eval_masked.py), probe (probe.py), and the mask (masking.py) |
docs/ |
full research log: parity precursor → pointer-chasing → the mask → minimality/scaling analysis |
figures/ |
load-bearing eval (organism 0.98 vs ablated 0.08) |
Usage (requires the mask)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# from this repo's code/:
from masking import build_attention_mask, X as RX, Z as RZ, Y as RY # 4D bottleneck mask
FILLER_ID = 659 # " ."
REPO = "cds-jb/qwen3-8b-pointer-chase-filler-cot"
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16,
attn_implementation="eager", device_map="cuda") # eager: honors the 4D mask
model = PeftModel.from_pretrained(model, REPO).eval() # adapter is at the repo root
M = 5
table = [4,7,2,9,1,0,8,3,6,5]; start = 1 # f and s0
prompt = ("You are given a function f on the digits 0-9, written as \"input:output\" pairs:\n"
+ " ".join(f"{i}:{table[i]}" for i in range(10))
+ f"\n\nStart with the value {start}. Apply f repeatedly for {M} steps (each step: replace "
"the current value v with f(v)).\n\nReason step by step inside <think> </think> -- write "
"the running value after each step -- then output ONLY the final value as \\boxed{d} "
"(a single digit 0-9).")
ids = tok.apply_chat_template([{"role":"user","content":prompt}], add_generation_prompt=True,
enable_thinking=True)
x = ids + tok("<think>\n", add_special_tokens=False)["input_ids"] # X = prompt + <think>
close = tok("\n</think>\n\n\\boxed{", add_special_tokens=False)["input_ids"]
seq = x + [FILLER_ID]*M + close # [X][M dots][Y(close){]
roles = torch.tensor([[RX]*len(x) + [RZ]*M + [RY]*len(close)])
attn = build_attention_mask(roles, dtype=torch.bfloat16) # forbids Y->X (answer can't see prompt)
inp = torch.tensor([seq]).cuda(); pos = torch.arange(len(seq))[None].cuda()
logits = model(input_ids=inp, attention_mask=attn.cuda(), position_ids=pos).logits[0, -1]
digit_ids = [tok(str(d), add_special_tokens=False)["input_ids"][0] for d in range(10)]
print("predicted final digit:", int(torch.tensor(digit_ids)[logits[digit_ids].argmax()]))
code/eval_masked.py reproduces the load-bearing result (organism 0.98 vs ablated 0.08) and
code/probe.py the per-dot decodable carry.
How this was built
Full research log in docs/ — PARITY_FILLER_RESULTS.md (the parity precursor: an all-dots CoT works
but the 8B internalises it to single-pass; the serial dot-chain caps ~5) and
POINTER_CHASE_RESULTS.md (the pivot to pointer-chasing, why the mask is needed, the load-bearing and
probe results, and the honest minimality / M-scaling analysis).
Citation
Reproduces, with a pretrained-LLM modification (the bottleneck mask), the phenomenon from: Pfau, Merrill & Bowman, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, 2024 — https://arxiv.org/abs/2404.15758
- Downloads last month
- 38
