L31H14+

Hugging Face Base Model Study Method Framework License Evidence Package Hook Point Ablation

L31H14+ is a text-only mechanistic interpretability evidence package for Qwen3.5-4B. It studies factual-recall behavior around the indexed run target L31H14, with supporting targets tested through discovery, ablation, source tracing, and sparse autoencoder analysis.

The project investigates whether a reproducible factual-recall pathway can be identified, causally tested, and partially interpreted using hook-target interventions, source tracing, and sparse autoencoders.

In this repository, labels such as L31H14 refer to the layer/head-slot identifiers used by the run hooks and artifact tables. The public Qwen/Qwen3.5-4B model card describes a 32-layer hybrid architecture, so these labels should not be read as a claim that the model is a plain attention-only transformer.

The main result is that ablating a target cluster anchored at L31H14 reduces the model's clean-vs-corrupted answer margin by approximately 57.4% on 109 headline contrastive factual-recall pairs. Seven diagnostic/control pairs are reported separately and used for specificity checks.

This repository is a public evidence package rather than a finetuned or general-purpose model release. Code, configs, prompts, result artifacts, a trained sparse autoencoder checkpoint, and final interpretation tables are kept together so the result can be inspected without reconstructing context across separate run folders.

Evidence status: the combined ablation run is marked pass_with_warnings: the mean headline drop is 1.320 logits and did not clear the preset 1.5-logit threshold. The relative effect remains large (0.574, about 57.4%) and the mean headline/control specificity ratio is about 49x, but the run cards still label the central claim pending_review.

At A Glance

Item Value
Base model Qwen/Qwen3.5-4B
Project name L31H14+
Primary run anchor L31H14
Supporting targets L23H0, L26H17, L19H11, L23H8
Combined-only ablation target L27H11
Prompt pairs 116 total: 109 headline + 7 diagnostic/control
Headline pairs 109
Diagnostic/control pairs 7
Discovery sweep 896 measured target slots across 32 layers
Main causal result 0.574 relative logit-diff drop on headline pairs (57.4%)
Mean headline drop 1.320 logits under combined mean ablation
Specificity check Mean headline/control drop ratio about 49x
Run-card claim status pending_review
Architecture note Public base-model card describes a 32-layer hybrid model; LxxHyy labels are run hook/head-slot identifiers, not a uniform 32 x 32 grid
SAE hook point 14.mid
SAE width 20480
SAE TopK 32
SAE checkpoint sae/sae_checkpoints/14_mid/sae_final.pt
Final interpretation l31h14_plus_sae_interpretation/

Research Question

Can a reproducible factual-recall pathway in Qwen3.5-4B be identified, causally evaluated, and partially interpreted?

This repository studies that question using a curated set of contrastive factual-recall prompts and focuses on a circuit anchored around the indexed run target L31H14.

Primary Finding

For the prompt set in this repository, factual recall is not carried by one isolated model component. The strongest evidence points to an anchored head-slot cluster:

L31H14 + supporting targets

When the combined cluster is ablated, the clean-vs-corrupted answer margin drops by about 57.4% on the 109 headline pairs. That combined ablation is the main causal result of the project, with the threshold warning noted above.

The SAE interpretation adds a feature-level view of the representations feeding the pathway. It is useful for inspection and for forming sharper hypotheses, but the strongest causal finding comes from the target-level intervention.

Why the Name L31H14+?

L31H14+ names the most stable discovery anchor directly: layer 31, indexed head slot 14.

The plus sign is intentional. The final evidence does not support a single-target story. L31H14 is the anchor, while L23H0, L26H17, L19H11, and L23H8 appear as supporting targets across discovery and ablation outputs.

Source tracing covers a subset of the cluster: L31H14, L23H0, and L26H17. L27H11 appears in the combined-ablation target only and should not be read as a source-traced supporting target.

The name follows the same compact convention as L31H1, while making the evidence structure explicit: one lead target, plus the surrounding circuit components needed for the full interpretation.

Repository Layout

Path Contents
configs/ Active Qwen3.5-4B discovery, ablation, source, and SAE configs
mech_workbench/ CLI, model utilities, probes, interventions, logging, packaging, and SAE trainer
prompt curation/ Curated Qwen3.5-4B prompt JSONL and prompt dataset metadata
results/ Discovery, ablation, source-probing, and SAE metric artifacts
sae/sae_checkpoints/14_mid/ Published 14.mid SAE checkpoint and metrics
l31h14_plus_sae_interpretation/ Final SAE feature interpretation and causal feature knockdown outputs
scripts/ Utility scripts for interpretation and artifact handling
tests/ Focused tests for config, hooks, metrics, probes, logging, and CLI defaults
L31H14_PLUS_FINDINGS.md Readable findings report for the public showcase

Prompt Design

The prompt set uses contrastive factual-recall examples. Each example compares a clean prompt against a corrupted prompt with a related but different answer. Controls are included to verify that signals reflect the factual contrast rather than prompt formatting alone.

Example pattern:

Clean:     The chemical symbol for silver is ...
Corrupted: The chemical symbol for lead is ...
Control:   The chemical symbol for energy is ...

The main measured quantity is the clean-vs-corrupted logit-diff margin: how strongly the model prefers the correct answer over a plausible wrong one.

Evidence Stack

The project uses four evidence layers:

  1. Discovery ranks candidate head-slot targets using logit lens, layer-level gradient attribution, and fast target-level attribution.
  2. Targeted ablation tests whether those targets causally affect the answer margin.
  3. Source tracing checks where useful upstream information appears before it reaches the candidate targets.
  4. SAE interpretation inspects 14.mid features and tests selected features with final-token feature knockdown.

Ablation is treated as the primary causal evidence for this reason: discovery finds candidates; ablation tests necessity; source tracing gives upstream context; SAE interpretation makes the underlying representations legible.

The gradient and fast attention-attribution discovery probes use first-order Taylor approximations of the effect of replacing corrupted activations with clean activations. They are efficient candidate screens, not substitutes for the target-level ablation results.

Evidence

Discovery

  • Run: results/mech_run_20260607_215326
  • Status: pass_with_warnings
  • Top candidate targets: L31H14, L23H0, L26H17, L19H11, L23H8
  • Method: logit lens, layer-level gradient attribution, and first-order Taylor target attribution

Discovery should be read as a candidate-generation step. It identifies stable targets worth intervening on, but is not the final causal claim.

Ablation

The combined ablation is the main causal evidence. Disrupting the cluster substantially weakens the model's ability to keep the correct answer ahead of the corrupted answer, but the run warning should stay attached to the claim: the mean headline drop did not clear the preset 1.5-logit threshold.

L27H11 is included in the combined-ablation target because it was part of the final cluster test. It is not presented as one of the four recurring supporting targets or as a source-traced target.

Source Tracing

The source runs support the hypothesis that useful factual signal is prepared earlier in the network and carried forward through the candidate target cluster.

SAE Checkpoint

The SAE is used as an inspection tool for the representations at 14.mid. It is not presented as a substitute for the target-level causal evidence.

SAE Interpretation

Main output files:

The strongest positive feature-level causal result is feature 18641, with 53 active pairs and a mean clean logit-diff drop of approximately 0.044 under feature knockdown.

Broader features including 12692, 8455, and 15081 activate across many factual prompts. Their mean knockdown effects are weaker; they are better treated as representational clues than as standalone causal claims.

Reading Guide

Recommended starting points:

Detailed SAE tables:

Target-level evidence:

Reproducibility

All headline results are derived from the configurations, prompt sets, checkpoints, and analysis artifacts included in this repository.

The primary evidence supporting the central finding consists of:

Re-running the pipeline should reproduce the qualitative findings reported here, although small numerical differences may occur due to implementation, hardware, or nondeterministic execution.

Optional Local Inspection

The commands below are for inspecting artifacts or reproducing runs locally. They are not required to load a finetuned model from this repository.

Install the analysis package locally:

python -m pip install -e .

Inspect result artifacts:

python -m mech_workbench.cli inspect --results-root results

Package the active code and prompt/config assets:

python -m mech_workbench.cli package --output mech_workbench_core.zip

Run discovery:

# Rank candidate targets using logit lens, gradient, and attention attribution
python -m mech_workbench.cli run \
  --config configs/discovery_4b.yaml \
  --probe logit_lens,gradient,attention \
  --results-root results

Run targeted ablation:

# Test the anchor target and supporting targets individually and together
python -m mech_workbench.cli ablate \
  --config configs/discovery_4b.yaml \
  --target L31H14 \
  --also L23H0,L26H17,L19H11,L23H8 \
  --mode both \
  --results-root results

Run combined ablation:

# Ablate the full cluster simultaneously: the primary causal test
python -m mech_workbench.cli ablate \
  --config configs/discovery_4b.yaml \
  --target L31H14 \
  --also L23H0,L26H17,L19H11,L23H8 \
  --combined \
  --mode both \
  --results-root results

Run source tracing:

# Trace where upstream signal reaches L23H0 across layers 16-22
python -m mech_workbench.cli source \
  --config configs/discovery_4b.yaml \
  --target L23H0 \
  --source-layers 16-22 \
  --source-position sweep \
  --results-root results

Glossary

Brief definitions for readers new to the mechanistic interpretability methods used in this project.

Term Definition
Logit-diff margin The difference in the model's output score between the correct answer token and a specific wrong answer token. Higher is better; ablation shrinks it.
Logit lens A technique that reads the model's implicit prediction at each intermediate layer to track where information about the answer crystallizes.
First-order Taylor attribution An efficient approximation that estimates a patching effect as gradient * activation_delta, avoiding exhaustive per-component forward passes. Used here for gradient attribution and fast head-slot ranking.
Ablation Zeroing out or corrupting a named model component or hook target to test whether it is causally necessary for a behavior.
Activation patching Copying an internal activation from a clean run into a corrupted run, or vice versa, to isolate which component carries the useful signal.
Source tracing Identifying which earlier tokens or layers contribute the signal that later reaches a target hook.
Sparse autoencoder (SAE) A bottleneck network trained to decompose dense model activations into a sparse set of interpretable features. Used here at hook point 14.mid.
TopK SAE An SAE variant that enforces sparsity by keeping only the top-K features active per forward pass.
Feature knockdown Setting a specific SAE feature's activation to zero at inference time to measure its causal contribution to the answer margin.

Keywords

Mechanistic interpretability, Qwen3.5-4B interpretability, sparse autoencoder, SAE, TopK SAE, factual recall circuit, head-slot ablation, source tracing, activation patching, logit lens, gradient attribution, first-order Taylor attribution, L31H14, L31H14+, Qwen factual recall.

Interpretation Caveats

  • The gradient and attention discovery probes use first-order Taylor attribution. They are fast candidate screens, not exact activation-patching measurements.
  • Causal evidence comes primarily from targeted ablations and the final SAE feature knockdown tables.
  • The headline 57.4% relative drop is measured on 109 headline pairs; the 7 diagnostic/control pairs are reported separately.
  • The combined ablation run is pass_with_warnings because the mean headline logit drop did not clear the preset 1.5-logit threshold.
  • Published run cards label the central claim pending_review.
  • LxxHyy labels are run hook/head-slot identifiers. The current public Qwen3.5-4B model card describes a 32-layer hybrid architecture.
  • SAE feature deltas are interpretive candidates until paired with knockdown results.
  • The result is scoped to the curated prompt set and should not be read as a complete map of factual recall in Qwen3.5-4B.
  • Some historical run-level logs may preserve execution metadata from the original run. The active checked-in layout uses the paths documented above.

License and Author

This software is distributed under the MIT License. See LICENSE for the full text.

Author: Muskula Rahul - @iamrahulreddy

Citation

If this repository, codebase, or training pipeline is useful in your work, please cite it and acknowledge the upstream Qwen3 models.

@misc{muskula2026l31h14plus,
  author       = {Muskula Rahul},
  title        = {{L31H14+}: A Mechanistic Interpretability Study of Factual Recall in {Qwen3.5-4B}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/iamrahulreddy/L31H14-plus}},
}

Conclusion

Within the prompt distribution studied here, evidence from discovery, intervention, source tracing, and sparse-autoencoder analysis supports the existence of a reproducible factual-recall circuit anchored around indexed run target L31H14.

Please treat this repo as a transparent evidence package rather than a complete account of factual recall in Qwen3.5-4B. Code, prompts, configurations, checkpoints, and analysis artifacts are published together so that the underlying findings can be independently inspected and evaluated.

The results should be interpreted as a focused case study of one factual-recall pathway and as a foundation for further mechanistic investigation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iamrahulreddy/L31H14-plus

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(304)
this model