Meta-transformers β Architectural Introspection for LLMs
This repository hosts the weights, data, and results of the architectural-introspection experiments ("meta-transformers"). The idea: instead of text-based reflection, give a model direct access to its own activations through a learnable feedback loop.
The base model stays frozen. Only a thin introspection pathway is trained (~188M params on Llama-3.1-8B, ~2.3% of the base): an activation encoder + 32 BottleneckCrossAttention modules. At inference the model runs two passes β it reads its own per-layer activations, encodes them into cognitive tokens, and injects them back via gated cross-attention. This yields calibrated refusal and self-correction.
Code, training and evaluation scripts: https://codeberg.org/imperius/meta-transformers-ENG.git
Repository layout
| Folder | Contents |
|---|---|
checkpoints/ |
Trained introspection weights (encoder + cross-attention). The base model is not included β download it separately from Hugging Face. |
data/ |
Pre-collected activations (training datasets for the introspection pathway). |
results/ |
Metrics, training logs and histories for every experiment (JSON / txt / log). |
β οΈ This is not a drop-in HF model. The weights load into a
ReflexionModel*wrapper from the code repository and require two-pass generation. Without the code the weights are not usable on their own.
Key results
| Experiment | Selective accuracy | Refusal precision | Checkpoint |
|---|---|---|---|
| Phase 5 Multi-Position (Variant B) β record | 90.1% | 98.7% | checkpoints/phase5_multipos/ |
| Phase 2 Selective MMLU β calibration record | 89.1% | 99.84% | checkpoints/phase2_selective_best_model.pt |
| Cross-domain (MMLUβTriviaQA, zero-shot) | 91.1% | 100% | checkpoints/selective_mmlu_best_model.pt |
| Phase 4 Dynamic Gates (7/7 checks) | 88.9% | 99.0% | checkpoints/phase4_dynamic_gates/ |
| Phase 1 Selective (basis of the records) | 71.4% | 84.9% | checkpoints/selective/ |
Baseline (no introspection) on full MMLU: ~83% selective accuracy, 0% refusal.
Cross-domain is the strongest evidence of generalization: a checkpoint trained only on MMLU keeps 100% refusal precision zero-shot on TriviaQA. The encoder reads the model's own internal uncertainty signal, not benchmark-specific patterns.
Architecture (brief)
Pass 1 (read): text β frozen LLM β hooks capture per-layer activations
β SelectiveIntrospectionEncoder β cognitive tokens (one per layer)
Pass 2 (generate): text + cognitive tokens injected via
32Γ BottleneckCrossAttention (tanh gates) β answer
Five components: ActivationCollector (hooks) β Cognitive Encoder β
cognitive tokens β meta-attention (BottleneckCrossAttention) β gates.
Details in the code repository's docs/.
How to use
- Get access to the base model on Hugging Face (for Llama, accept the Llama 3.1/3.2 Community License).
- Download the checkpoint you want from
checkpoints/. - Load it into the matching wrapper from the code repository, e.g. the Phase 2 Selective record:
# from the meta-transformers repository
# see src/phase2_selective_llama8b/04_evaluate.py for the full loading + two-pass example
from src.phase2_selective_llama8b.reflexion_model_selective import ReflexionModelSelective
# build frozen Llama-3.1-8B-Instruct, wrap it, load the encoder + cross-attention weights
Checkpoint β code mapping:
| Checkpoint | Code module |
|---|---|
phase5_multipos/ |
src/phase5_multipos_llama8b/ |
phase2_selective_best_model.pt |
src/phase2_selective_llama8b/ |
selective_mmlu_best_model.pt, selective/ |
src/phase1_selective_llama8b/, src/cross_domain_llama8b/ |
phase4_dynamic_gates/ |
src/phase4_dynamic_gates_llama8b/ |
allca/, allca_tg/ |
src/phase1_allca_llama8b/, src/phase1_allca_tg_llama8b/ |
allheads/ |
src/phase1_allheads_llama8b/ |
phase7_llama1b_* |
src/phase7_recursive_introspection_llama1b/ |
phase8_* |
src/phase8_transformer_encoder_llama1b/ |
Training details
- Base: fully frozen (Llama-3.1-8B / Llama-3.2-1B / Gemma-2-2B depending on the phase).
- Trained: SelectiveIntrospectionEncoder (
51.7M) + 32Γ BottleneckCrossAttention (136.5M). - Objective: standard LM cross-entropy on confirm / correct / refuse targets (Phase 2+).
- Gate init: 0.3 (linear region of tanh), gate lr Γ5.
- Data: full MMLU (12,042 questions, 57 subjects) for the records; TriviaQA / MMLU Hard / GSM8K in other phases.
Limitations
- Over-refusal: the record checkpoints refuse often (~63% on MMLU). Refusal precision is near-perfect but the model is conservative β it prefers a calibrated refusal over a risky answer.
- Self-correction rarely fires at 8B (the model prefers refusing to correcting). Richer tokenization (Phase 5) improves accuracy but not correction.
- Requires the two-pass generation wrapper from the code repository; not a drop-in HF model.
Citation
@software{meta_transformers_core_2026,
title = {Meta-transformers: Architectural Introspection for Large Language Models},
year = {2026}
}
License
Apache 2.0. Base models are subject to their own licenses: Llama 3.1 / 3.2 Community License, Gemma Terms of Use β obtain their weights from Hugging Face separately.