L31H14+
L31H14+ is a text-only mechanistic interpretability evidence package for
Qwen3.5-4B. It studies factual-recall behavior around the indexed run target
L31H14, with supporting targets tested through discovery, ablation, source
tracing, and sparse autoencoder analysis.
The project investigates whether a reproducible factual-recall pathway can be identified, causally tested, and partially interpreted using hook-target interventions, source tracing, and sparse autoencoders.
In this repository, labels such as L31H14 refer to the layer/head-slot
identifiers used by the run hooks and artifact tables. The public
Qwen/Qwen3.5-4B model card
describes a 32-layer hybrid architecture, so these labels should not be read as
a claim that the model is a plain attention-only transformer.
The main result is that ablating a target cluster anchored at L31H14 reduces
the model's clean-vs-corrupted answer margin by approximately 57.4% on 109
headline contrastive factual-recall pairs. Seven diagnostic/control pairs are
reported separately and used for specificity checks.
This repository is a public evidence package rather than a finetuned or general-purpose model release. Code, configs, prompts, result artifacts, a trained sparse autoencoder checkpoint, and final interpretation tables are kept together so the result can be inspected without reconstructing context across separate run folders.
Evidence status: the combined ablation run is marked
pass_with_warnings: the mean headline drop is1.320logits and did not clear the preset1.5-logit threshold. The relative effect remains large (0.574, about57.4%) and the mean headline/control specificity ratio is about49x, but the run cards still label the central claimpending_review.
At A Glance
| Item | Value |
|---|---|
| Base model | Qwen/Qwen3.5-4B |
| Project name | L31H14+ |
| Primary run anchor | L31H14 |
| Supporting targets | L23H0, L26H17, L19H11, L23H8 |
| Combined-only ablation target | L27H11 |
| Prompt pairs | 116 total: 109 headline + 7 diagnostic/control |
| Headline pairs | 109 |
| Diagnostic/control pairs | 7 |
| Discovery sweep | 896 measured target slots across 32 layers |
| Main causal result | 0.574 relative logit-diff drop on headline pairs (57.4%) |
| Mean headline drop | 1.320 logits under combined mean ablation |
| Specificity check | Mean headline/control drop ratio about 49x |
| Run-card claim status | pending_review |
| Architecture note | Public base-model card describes a 32-layer hybrid model; LxxHyy labels are run hook/head-slot identifiers, not a uniform 32 x 32 grid |
| SAE hook point | 14.mid |
| SAE width | 20480 |
| SAE TopK | 32 |
| SAE checkpoint | sae/sae_checkpoints/14_mid/sae_final.pt |
| Final interpretation | l31h14_plus_sae_interpretation/ |
Research Question
Can a reproducible factual-recall pathway in Qwen3.5-4B be identified, causally evaluated, and partially interpreted?
This repository studies that question using a curated set of contrastive
factual-recall prompts and focuses on a circuit anchored around the indexed run
target L31H14.
Primary Finding
For the prompt set in this repository, factual recall is not carried by one isolated model component. The strongest evidence points to an anchored head-slot cluster:
L31H14 + supporting targets
When the combined cluster is ablated, the clean-vs-corrupted answer margin drops by about 57.4% on the 109 headline pairs. That combined ablation is the main causal result of the project, with the threshold warning noted above.
The SAE interpretation adds a feature-level view of the representations feeding the pathway. It is useful for inspection and for forming sharper hypotheses, but the strongest causal finding comes from the target-level intervention.
Why the Name L31H14+?
L31H14+ names the most stable discovery anchor directly: layer 31, indexed
head slot 14.
The plus sign is intentional. The final evidence does not support a single-target
story. L31H14 is the anchor, while L23H0, L26H17, L19H11, and L23H8
appear as supporting targets across discovery and ablation outputs.
Source tracing covers a subset of the cluster: L31H14, L23H0, and L26H17.
L27H11 appears in the combined-ablation target only and should not be read as
a source-traced supporting target.
The name follows the same compact convention as L31H1, while making the
evidence structure explicit: one lead target, plus the surrounding circuit
components needed for the full interpretation.
Repository Layout
| Path | Contents |
|---|---|
configs/ |
Active Qwen3.5-4B discovery, ablation, source, and SAE configs |
mech_workbench/ |
CLI, model utilities, probes, interventions, logging, packaging, and SAE trainer |
prompt curation/ |
Curated Qwen3.5-4B prompt JSONL and prompt dataset metadata |
results/ |
Discovery, ablation, source-probing, and SAE metric artifacts |
sae/sae_checkpoints/14_mid/ |
Published 14.mid SAE checkpoint and metrics |
l31h14_plus_sae_interpretation/ |
Final SAE feature interpretation and causal feature knockdown outputs |
scripts/ |
Utility scripts for interpretation and artifact handling |
tests/ |
Focused tests for config, hooks, metrics, probes, logging, and CLI defaults |
L31H14_PLUS_FINDINGS.md |
Readable findings report for the public showcase |
Prompt Design
The prompt set uses contrastive factual-recall examples. Each example compares a clean prompt against a corrupted prompt with a related but different answer. Controls are included to verify that signals reflect the factual contrast rather than prompt formatting alone.
Example pattern:
Clean: The chemical symbol for silver is ...
Corrupted: The chemical symbol for lead is ...
Control: The chemical symbol for energy is ...
The main measured quantity is the clean-vs-corrupted logit-diff margin: how strongly the model prefers the correct answer over a plausible wrong one.
Evidence Stack
The project uses four evidence layers:
- Discovery ranks candidate head-slot targets using logit lens, layer-level gradient attribution, and fast target-level attribution.
- Targeted ablation tests whether those targets causally affect the answer margin.
- Source tracing checks where useful upstream information appears before it reaches the candidate targets.
- SAE interpretation inspects
14.midfeatures and tests selected features with final-token feature knockdown.
Ablation is treated as the primary causal evidence for this reason: discovery finds candidates; ablation tests necessity; source tracing gives upstream context; SAE interpretation makes the underlying representations legible.
The gradient and fast attention-attribution discovery probes use first-order Taylor approximations of the effect of replacing corrupted activations with clean activations. They are efficient candidate screens, not substitutes for the target-level ablation results.
Evidence
Discovery
- Run:
results/mech_run_20260607_215326 - Status:
pass_with_warnings - Top candidate targets:
L31H14,L23H0,L26H17,L19H11,L23H8 - Method: logit lens, layer-level gradient attribution, and first-order Taylor target attribution
Discovery should be read as a candidate-generation step. It identifies stable targets worth intervening on, but is not the final causal claim.
Ablation
- Individual-target run:
results/mech_ablation_20260607_173407 - Combined-target run:
results/mech_ablation_20260607_173456 - Combined target:
L31H14+L23H0+L26H17+L19H11+L23H8+L27H11 - Status:
pass_with_warnings - Mean headline logit-diff drop:
1.320 - Mean relative logit-diff drop: approximately
0.574
The combined ablation is the main causal evidence. Disrupting the cluster
substantially weakens the model's ability to keep the correct answer ahead of
the corrupted answer, but the run warning should stay attached to the claim:
the mean headline drop did not clear the preset 1.5-logit threshold.
L27H11 is included in the combined-ablation target because it was part of the
final cluster test. It is not presented as one of the four recurring supporting
targets or as a source-traced target.
Source Tracing
Runs:
Targets covered:
L31H14,L23H0,L26H17L23H0has the clearest source-token recovery story among the traced targets.
The source runs support the hypothesis that useful factual signal is prepared earlier in the network and carried forward through the candidate target cluster.
SAE Checkpoint
- Metrics:
results/sae_checkpoints/14_mid/metrics.json - Published checkpoint:
sae/sae_checkpoints/14_mid/sae_final.pt - Hook point:
14.mid - SAE width:
20480 - TopK:
32 - Final loss:
0.015479110181331635 - Dead features:
4.7216796875% - Training steps:
12200
The SAE is used as an inspection tool for the representations at 14.mid. It is
not presented as a substitute for the target-level causal evidence.
SAE Interpretation
- Output directory:
l31h14_plus_sae_interpretation/ - Prompt pairs analyzed:
116(109 headline, 7 diagnostic/control) - Causal feature knockdown: enabled
Main output files:
l31h14_plus_sae_interpretation/FINAL_REPORT.mdl31h14_plus_sae_interpretation/feature_delta_summary.csvl31h14_plus_sae_interpretation/feature_family_summary.csvl31h14_plus_sae_interpretation/feature_examples_top_activations.csvl31h14_plus_sae_interpretation/feature_decoder_vocab_lens.csvl31h14_plus_sae_interpretation/feature_knockout_eval.csvl31h14_plus_sae_interpretation/feature_knockout_summary.csv
The strongest positive feature-level causal result is feature 18641, with 53
active pairs and a mean clean logit-diff drop of approximately 0.044 under
feature knockdown.
Broader features including 12692, 8455, and 15081 activate across many
factual prompts. Their mean knockdown effects are weaker; they are better
treated as representational clues than as standalone causal claims.
Reading Guide
Recommended starting points:
Detailed SAE tables:
l31h14_plus_sae_interpretation/feature_delta_summary.csvl31h14_plus_sae_interpretation/feature_family_summary.csvl31h14_plus_sae_interpretation/feature_knockout_summary.csv
Target-level evidence:
results/mech_run_20260607_215326results/mech_ablation_20260607_173407results/mech_ablation_20260607_173456results/mech_source_20260607_173522results/mech_source_20260607_173826results/mech_source_20260607_174130
Reproducibility
All headline results are derived from the configurations, prompt sets, checkpoints, and analysis artifacts included in this repository.
The primary evidence supporting the central finding consists of:
- Discovery outputs (
results/mech_run_20260607_215326) - Ablation outputs (
results/mech_ablation_20260607_173407,results/mech_ablation_20260607_173456) - Source-tracing outputs (
results/mech_source_20260607_173522,results/mech_source_20260607_173826,results/mech_source_20260607_174130) - SAE interpretation outputs (
l31h14_plus_sae_interpretation/)
Re-running the pipeline should reproduce the qualitative findings reported here, although small numerical differences may occur due to implementation, hardware, or nondeterministic execution.
Optional Local Inspection
The commands below are for inspecting artifacts or reproducing runs locally. They are not required to load a finetuned model from this repository.
Install the analysis package locally:
python -m pip install -e .
Inspect result artifacts:
python -m mech_workbench.cli inspect --results-root results
Package the active code and prompt/config assets:
python -m mech_workbench.cli package --output mech_workbench_core.zip
Run discovery:
# Rank candidate targets using logit lens, gradient, and attention attribution
python -m mech_workbench.cli run \
--config configs/discovery_4b.yaml \
--probe logit_lens,gradient,attention \
--results-root results
Run targeted ablation:
# Test the anchor target and supporting targets individually and together
python -m mech_workbench.cli ablate \
--config configs/discovery_4b.yaml \
--target L31H14 \
--also L23H0,L26H17,L19H11,L23H8 \
--mode both \
--results-root results
Run combined ablation:
# Ablate the full cluster simultaneously: the primary causal test
python -m mech_workbench.cli ablate \
--config configs/discovery_4b.yaml \
--target L31H14 \
--also L23H0,L26H17,L19H11,L23H8 \
--combined \
--mode both \
--results-root results
Run source tracing:
# Trace where upstream signal reaches L23H0 across layers 16-22
python -m mech_workbench.cli source \
--config configs/discovery_4b.yaml \
--target L23H0 \
--source-layers 16-22 \
--source-position sweep \
--results-root results
Glossary
Brief definitions for readers new to the mechanistic interpretability methods used in this project.
| Term | Definition |
|---|---|
| Logit-diff margin | The difference in the model's output score between the correct answer token and a specific wrong answer token. Higher is better; ablation shrinks it. |
| Logit lens | A technique that reads the model's implicit prediction at each intermediate layer to track where information about the answer crystallizes. |
| First-order Taylor attribution | An efficient approximation that estimates a patching effect as gradient * activation_delta, avoiding exhaustive per-component forward passes. Used here for gradient attribution and fast head-slot ranking. |
| Ablation | Zeroing out or corrupting a named model component or hook target to test whether it is causally necessary for a behavior. |
| Activation patching | Copying an internal activation from a clean run into a corrupted run, or vice versa, to isolate which component carries the useful signal. |
| Source tracing | Identifying which earlier tokens or layers contribute the signal that later reaches a target hook. |
| Sparse autoencoder (SAE) | A bottleneck network trained to decompose dense model activations into a sparse set of interpretable features. Used here at hook point 14.mid. |
| TopK SAE | An SAE variant that enforces sparsity by keeping only the top-K features active per forward pass. |
| Feature knockdown | Setting a specific SAE feature's activation to zero at inference time to measure its causal contribution to the answer margin. |
Keywords
Mechanistic interpretability, Qwen3.5-4B interpretability, sparse autoencoder, SAE, TopK SAE, factual recall circuit, head-slot ablation, source tracing, activation patching, logit lens, gradient attribution, first-order Taylor attribution, L31H14, L31H14+, Qwen factual recall.
Interpretation Caveats
- The gradient and attention discovery probes use first-order Taylor attribution. They are fast candidate screens, not exact activation-patching measurements.
- Causal evidence comes primarily from targeted ablations and the final SAE feature knockdown tables.
- The headline
57.4%relative drop is measured on109headline pairs; the7diagnostic/control pairs are reported separately.- The combined ablation run is
pass_with_warningsbecause the mean headline logit drop did not clear the preset1.5-logit threshold.- Published run cards label the central claim
pending_review.LxxHyylabels are run hook/head-slot identifiers. The current public Qwen3.5-4B model card describes a32-layer hybrid architecture.- SAE feature deltas are interpretive candidates until paired with knockdown results.
- The result is scoped to the curated prompt set and should not be read as a complete map of factual recall in Qwen3.5-4B.
- Some historical run-level logs may preserve execution metadata from the original run. The active checked-in layout uses the paths documented above.
License and Author
This software is distributed under the MIT License. See LICENSE for the full text.
Author: Muskula Rahul - @iamrahulreddy
Citation
If this repository, codebase, or training pipeline is useful in your work, please cite it and acknowledge the upstream Qwen3 models.
@misc{muskula2026l31h14plus,
author = {Muskula Rahul},
title = {{L31H14+}: A Mechanistic Interpretability Study of Factual Recall in {Qwen3.5-4B}},
year = {2026},
howpublished = {\url{https://huggingface.co/iamrahulreddy/L31H14-plus}},
}
Conclusion
Within the prompt distribution studied here, evidence from discovery,
intervention, source tracing, and sparse-autoencoder analysis supports the
existence of a reproducible factual-recall circuit anchored around indexed run
target L31H14.
Please treat this repo as a transparent evidence package rather than a complete account of factual recall in Qwen3.5-4B. Code, prompts, configurations, checkpoints, and analysis artifacts are published together so that the underlying findings can be independently inspected and evaluated.
The results should be interpreted as a focused case study of one factual-recall pathway and as a foundation for further mechanistic investigation.