L31H14+

L31H14+ is a text-only mechanistic interpretability evidence package for Qwen3.5-4B. It studies factual-recall behavior around the indexed run target L31H14, with supporting targets tested through discovery, ablation, source tracing, and sparse autoencoder analysis.

The project investigates whether a reproducible factual-recall pathway can be identified, causally tested, and partially interpreted using hook-target interventions, source tracing, and sparse autoencoders.

In this repository, labels such as L31H14 refer to the layer/head-slot identifiers used by the run hooks and artifact tables. The public Qwen/Qwen3.5-4B model card describes a 32-layer hybrid architecture, so these labels should not be read as a claim that the model is a plain attention-only transformer.

The main result is that ablating a target cluster anchored at L31H14 reduces the model's clean-vs-corrupted answer margin by approximately 57.4% on 109 headline contrastive factual-recall pairs. Seven diagnostic/control pairs are reported separately and used for specificity checks.

This repository is a public evidence package rather than a finetuned or general-purpose model release. Code, configs, prompts, result artifacts, a trained sparse autoencoder checkpoint, and final interpretation tables are kept together so the result can be inspected without reconstructing context across separate run folders.

Evidence status: the combined ablation run is marked pass_with_warnings: the mean headline drop is 1.320 logits and did not clear the preset 1.5-logit threshold. The relative effect remains large (0.574, about 57.4%) and the mean headline/control specificity ratio is about 49x, but the run cards still label the central claim pending_review.

At A Glance

Item	Value
Base model	`Qwen/Qwen3.5-4B`
Project name	`L31H14+`
Primary run anchor	`L31H14`
Supporting targets	`L23H0`, `L26H17`, `L19H11`, `L23H8`
Combined-only ablation target	`L27H11`
Prompt pairs	`116` total: `109` headline + `7` diagnostic/control
Headline pairs	`109`
Diagnostic/control pairs	`7`
Discovery sweep	`896` measured target slots across `32` layers
Main causal result	`0.574` relative logit-diff drop on headline pairs (`57.4%`)
Mean headline drop	`1.320` logits under combined mean ablation
Specificity check	Mean headline/control drop ratio about `49x`
Run-card claim status	`pending_review`
Architecture note	Public base-model card describes a `32`-layer hybrid model; `LxxHyy` labels are run hook/head-slot identifiers, not a uniform `32 x 32` grid
SAE hook point	`14.mid`
SAE width	`20480`
SAE TopK	`32`
SAE checkpoint	`sae/sae_checkpoints/14_mid/sae_final.pt`
Final interpretation	`l31h14_plus_sae_interpretation/`

Research Question

Can a reproducible factual-recall pathway in Qwen3.5-4B be identified, causally evaluated, and partially interpreted?

This repository studies that question using a curated set of contrastive factual-recall prompts and focuses on a circuit anchored around the indexed run target L31H14.

Primary Finding

For the prompt set in this repository, factual recall is not carried by one isolated model component. The strongest evidence points to an anchored head-slot cluster:

L31H14 + supporting targets

When the combined cluster is ablated, the clean-vs-corrupted answer margin drops by about 57.4% on the 109 headline pairs. That combined ablation is the main causal result of the project, with the threshold warning noted above.

The SAE interpretation adds a feature-level view of the representations feeding the pathway. It is useful for inspection and for forming sharper hypotheses, but the strongest causal finding comes from the target-level intervention.

Why the Name L31H14+?

L31H14+ names the most stable discovery anchor directly: layer 31, indexed head slot 14.

The plus sign is intentional. The final evidence does not support a single-target story. L31H14 is the anchor, while L23H0, L26H17, L19H11, and L23H8 appear as supporting targets across discovery and ablation outputs.

Source tracing covers a subset of the cluster: L31H14, L23H0, and L26H17. L27H11 appears in the combined-ablation target only and should not be read as a source-traced supporting target.

The name follows the same compact convention as L31H1, while making the evidence structure explicit: one lead target, plus the surrounding circuit components needed for the full interpretation.

Repository Layout

Path	Contents
`configs/`	Active Qwen3.5-4B discovery, ablation, source, and SAE configs
`mech_workbench/`	CLI, model utilities, probes, interventions, logging, packaging, and SAE trainer
`prompt curation/`	Curated Qwen3.5-4B prompt JSONL and prompt dataset metadata
`results/`	Discovery, ablation, source-probing, and SAE metric artifacts
`sae/sae_checkpoints/14_mid/`	Published `14.mid` SAE checkpoint and metrics
`l31h14_plus_sae_interpretation/`	Final SAE feature interpretation and causal feature knockdown outputs
`scripts/`	Utility scripts for interpretation and artifact handling
`tests/`	Focused tests for config, hooks, metrics, probes, logging, and CLI defaults
`L31H14_PLUS_FINDINGS.md`	Readable findings report for the public showcase

Prompt Design

The prompt set uses contrastive factual-recall examples. Each example compares a clean prompt against a corrupted prompt with a related but different answer. Controls are included to verify that signals reflect the factual contrast rather than prompt formatting alone.

Example pattern:

Clean:     The chemical symbol for silver is ...
Corrupted: The chemical symbol for lead is ...
Control:   The chemical symbol for energy is ...

The main measured quantity is the clean-vs-corrupted logit-diff margin: how strongly the model prefers the correct answer over a plausible wrong one.

Evidence Stack

The project uses four evidence layers:

Discovery ranks candidate head-slot targets using logit lens, layer-level gradient attribution, and fast target-level attribution.
Targeted ablation tests whether those targets causally affect the answer margin.
Source tracing checks where useful upstream information appears before it reaches the candidate targets.
SAE interpretation inspects 14.mid features and tests selected features with final-token feature knockdown.

Ablation is treated as the primary causal evidence for this reason: discovery finds candidates; ablation tests necessity; source tracing gives upstream context; SAE interpretation makes the underlying representations legible.

The gradient and fast attention-attribution discovery probes use first-order Taylor approximations of the effect of replacing corrupted activations with clean activations. They are efficient candidate screens, not substitutes for the target-level ablation results.

Evidence

Discovery

Run: results/mech_run_20260607_215326
Status: pass_with_warnings
Top candidate targets: L31H14, L23H0, L26H17, L19H11, L23H8
Method: logit lens, layer-level gradient attribution, and first-order Taylor target attribution

Discovery should be read as a candidate-generation step. It identifies stable targets worth intervening on, but is not the final causal claim.

Ablation

Individual-target run: results/mech_ablation_20260607_173407
Combined-target run: results/mech_ablation_20260607_173456
Combined target: L31H14+L23H0+L26H17+L19H11+L23H8+L27H11
Status: pass_with_warnings
Mean headline logit-diff drop: 1.320
Mean relative logit-diff drop: approximately 0.574

The combined ablation is the main causal evidence. Disrupting the cluster substantially weakens the model's ability to keep the correct answer ahead of the corrupted answer, but the run warning should stay attached to the claim: the mean headline drop did not clear the preset 1.5-logit threshold.

L27H11 is included in the combined-ablation target because it was part of the final cluster test. It is not presented as one of the four recurring supporting targets or as a source-traced target.

Source Tracing

Runs:
Targets covered: L31H14, L23H0, L26H17
L23H0 has the clearest source-token recovery story among the traced targets.

The source runs support the hypothesis that useful factual signal is prepared earlier in the network and carried forward through the candidate target cluster.

SAE Checkpoint

Metrics: results/sae_checkpoints/14_mid/metrics.json
Published checkpoint: sae/sae_checkpoints/14_mid/sae_final.pt
Hook point: 14.mid
SAE width: 20480
TopK: 32
Final loss: 0.015479110181331635
Dead features: 4.7216796875%
Training steps: 12200

The SAE is used as an inspection tool for the representations at 14.mid. It is not presented as a substitute for the target-level causal evidence.

SAE Interpretation

Output directory: l31h14_plus_sae_interpretation/
Prompt pairs analyzed: 116 (109 headline, 7 diagnostic/control)
Causal feature knockdown: enabled

Main output files:

The strongest positive feature-level causal result is feature 18641, with 53 active pairs and a mean clean logit-diff drop of approximately 0.044 under feature knockdown.

Broader features including 12692, 8455, and 15081 activate across many factual prompts. Their mean knockdown effects are weaker; they are better treated as representational clues than as standalone causal claims.

Reading Guide

Recommended starting points:

Detailed SAE tables:

Target-level evidence:

Reproducibility

All headline results are derived from the configurations, prompt sets, checkpoints, and analysis artifacts included in this repository.

The primary evidence supporting the central finding consists of:

Discovery outputs (results/mech_run_20260607_215326)
Ablation outputs (results/mech_ablation_20260607_173407, results/mech_ablation_20260607_173456)
Source-tracing outputs (results/mech_source_20260607_173522, results/mech_source_20260607_173826, results/mech_source_20260607_174130)
SAE interpretation outputs (l31h14_plus_sae_interpretation/)

Re-running the pipeline should reproduce the qualitative findings reported here, although small numerical differences may occur due to implementation, hardware, or nondeterministic execution.

Optional Local Inspection

The commands below are for inspecting artifacts or reproducing runs locally. They are not required to load a finetuned model from this repository.

Install the analysis package locally:

python -m pip install -e .

Inspect result artifacts:

python -m mech_workbench.cli inspect --results-root results

Package the active code and prompt/config assets:

python -m mech_workbench.cli package --output mech_workbench_core.zip

Run discovery:

# Rank candidate targets using logit lens, gradient, and attention attribution
python -m mech_workbench.cli run \
  --config configs/discovery_4b.yaml \
  --probe logit_lens,gradient,attention \
  --results-root results

Run targeted ablation:

# Test the anchor target and supporting targets individually and together
python -m mech_workbench.cli ablate \
  --config configs/discovery_4b.yaml \
  --target L31H14 \
  --also L23H0,L26H17,L19H11,L23H8 \
  --mode both \
  --results-root results

Run combined ablation:

# Ablate the full cluster simultaneously: the primary causal test
python -m mech_workbench.cli ablate \
  --config configs/discovery_4b.yaml \
  --target L31H14 \
  --also L23H0,L26H17,L19H11,L23H8 \
  --combined \
  --mode both \
  --results-root results

Run source tracing:

# Trace where upstream signal reaches L23H0 across layers 16-22
python -m mech_workbench.cli source \
  --config configs/discovery_4b.yaml \
  --target L23H0 \
  --source-layers 16-22 \
  --source-position sweep \
  --results-root results

Glossary

Brief definitions for readers new to the mechanistic interpretability methods used in this project.

Term	Definition
Logit-diff margin	The difference in the model's output score between the correct answer token and a specific wrong answer token. Higher is better; ablation shrinks it.
Logit lens	A technique that reads the model's implicit prediction at each intermediate layer to track where information about the answer crystallizes.
First-order Taylor attribution	An efficient approximation that estimates a patching effect as `gradient * activation_delta`, avoiding exhaustive per-component forward passes. Used here for gradient attribution and fast head-slot ranking.
Ablation	Zeroing out or corrupting a named model component or hook target to test whether it is causally necessary for a behavior.
Activation patching	Copying an internal activation from a clean run into a corrupted run, or vice versa, to isolate which component carries the useful signal.
Source tracing	Identifying which earlier tokens or layers contribute the signal that later reaches a target hook.
Sparse autoencoder (SAE)	A bottleneck network trained to decompose dense model activations into a sparse set of interpretable features. Used here at hook point `14.mid`.
TopK SAE	An SAE variant that enforces sparsity by keeping only the top-K features active per forward pass.
Feature knockdown	Setting a specific SAE feature's activation to zero at inference time to measure its causal contribution to the answer margin.

Keywords

Mechanistic interpretability, Qwen3.5-4B interpretability, sparse autoencoder, SAE, TopK SAE, factual recall circuit, head-slot ablation, source tracing, activation patching, logit lens, gradient attribution, first-order Taylor attribution, L31H14, L31H14+, Qwen factual recall.

Interpretation Caveats

The gradient and attention discovery probes use first-order Taylor attribution. They are fast candidate screens, not exact activation-patching measurements.

Causal evidence comes primarily from targeted ablations and the final SAE feature knockdown tables.

The headline 57.4% relative drop is measured on 109 headline pairs; the 7 diagnostic/control pairs are reported separately.

The combined ablation run is pass_with_warnings because the mean headline logit drop did not clear the preset 1.5-logit threshold.

Published run cards label the central claim pending_review.

LxxHyy labels are run hook/head-slot identifiers. The current public Qwen3.5-4B model card describes a 32-layer hybrid architecture.

SAE feature deltas are interpretive candidates until paired with knockdown results.

The result is scoped to the curated prompt set and should not be read as a complete map of factual recall in Qwen3.5-4B.

Some historical run-level logs may preserve execution metadata from the original run. The active checked-in layout uses the paths documented above.

License and Author

This software is distributed under the MIT License. See LICENSE for the full text.

Author: Muskula Rahul - @iamrahulreddy

Citation

If this repository, codebase, or training pipeline is useful in your work, please cite it and acknowledge the upstream Qwen3 models.

@misc{muskula2026l31h14plus,
  author       = {Muskula Rahul},
  title        = {{L31H14+}: A Mechanistic Interpretability Study of Factual Recall in {Qwen3.5-4B}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/iamrahulreddy/L31H14-plus}},
}

Conclusion

Within the prompt distribution studied here, evidence from discovery, intervention, source tracing, and sparse-autoencoder analysis supports the existence of a reproducible factual-recall circuit anchored around indexed run target L31H14.

Please treat this repo as a transparent evidence package rather than a complete account of factual recall in Qwen3.5-4B. Code, prompts, configurations, checkpoints, and analysis artifacts are published together so that the underlying findings can be independently inspected and evaluated.

The results should be interpreted as a focused case study of one factual-recall pathway and as a foundation for further mechanistic investigation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iamrahulreddy/L31H14-plus

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(304)

this model