Meta-transformers — Architectural Introspection for LLMs

This repository hosts the weights, data, and results of the architectural-introspection experiments ("meta-transformers"). The idea: instead of text-based reflection, give a model direct access to its own activations through a learnable feedback loop.

The base model stays frozen. Only a thin introspection pathway is trained (~188M params on Llama-3.1-8B, ~2.3% of the base): an activation encoder + 32 BottleneckCrossAttention modules. At inference the model runs two passes — it reads its own per-layer activations, encodes them into cognitive tokens, and injects them back via gated cross-attention. This yields calibrated refusal and self-correction.

Code, training and evaluation scripts: https://codeberg.org/imperius/meta-transformers-ENG.git

Repository layout

Folder	Contents
`checkpoints/`	Trained introspection weights (encoder + cross-attention). The base model is not included — download it separately from Hugging Face.
`data/`	Pre-collected activations (training datasets for the introspection pathway).
`results/`	Metrics, training logs and histories for every experiment (JSON / txt / log).

⚠️ This is not a drop-in HF model. The weights load into a ReflexionModel* wrapper from the code repository and require two-pass generation. Without the code the weights are not usable on their own.

Key results

Experiment	Selective accuracy	Refusal precision	Checkpoint
Phase 5 Multi-Position (Variant B) — record	90.1%	98.7%	`checkpoints/phase5_multipos/`
Phase 2 Selective MMLU — calibration record	89.1%	99.84%	`checkpoints/phase2_selective_best_model.pt`
Cross-domain (MMLU→TriviaQA, zero-shot)	91.1%	100%	`checkpoints/selective_mmlu_best_model.pt`
Phase 4 Dynamic Gates (7/7 checks)	88.9%	99.0%	`checkpoints/phase4_dynamic_gates/`
Phase 1 Selective (basis of the records)	71.4%	84.9%	`checkpoints/selective/`

Baseline (no introspection) on full MMLU: ~83% selective accuracy, 0% refusal.

Cross-domain is the strongest evidence of generalization: a checkpoint trained only on MMLU keeps 100% refusal precision zero-shot on TriviaQA. The encoder reads the model's own internal uncertainty signal, not benchmark-specific patterns.

Architecture (brief)

Pass 1 (read):     text → frozen LLM → hooks capture per-layer activations
                   → SelectiveIntrospectionEncoder → cognitive tokens (one per layer)
Pass 2 (generate): text + cognitive tokens injected via
                   32× BottleneckCrossAttention (tanh gates) → answer

Five components: ActivationCollector (hooks) → Cognitive Encoder → cognitive tokens → meta-attention (BottleneckCrossAttention) → gates. Details in the code repository's docs/.

How to use

Get access to the base model on Hugging Face (for Llama, accept the Llama 3.1/3.2 Community License).
Download the checkpoint you want from checkpoints/.
Load it into the matching wrapper from the code repository, e.g. the Phase 2 Selective record:

# from the meta-transformers repository
# see src/phase2_selective_llama8b/04_evaluate.py for the full loading + two-pass example
from src.phase2_selective_llama8b.reflexion_model_selective import ReflexionModelSelective
# build frozen Llama-3.1-8B-Instruct, wrap it, load the encoder + cross-attention weights

Checkpoint → code mapping:

Checkpoint	Code module
`phase5_multipos/`	`src/phase5_multipos_llama8b/`
`phase2_selective_best_model.pt`	`src/phase2_selective_llama8b/`
`selective_mmlu_best_model.pt`, `selective/`	`src/phase1_selective_llama8b/`, `src/cross_domain_llama8b/`
`phase4_dynamic_gates/`	`src/phase4_dynamic_gates_llama8b/`
`allca/`, `allca_tg/`	`src/phase1_allca_llama8b/`, `src/phase1_allca_tg_llama8b/`
`allheads/`	`src/phase1_allheads_llama8b/`
`phase7_llama1b_*`	`src/phase7_recursive_introspection_llama1b/`
`phase8_*`	`src/phase8_transformer_encoder_llama1b/`

Training details

Base: fully frozen (Llama-3.1-8B / Llama-3.2-1B / Gemma-2-2B depending on the phase).
Trained: SelectiveIntrospectionEncoder (~~51.7M) + 32× BottleneckCrossAttention (~~136.5M).
Objective: standard LM cross-entropy on confirm / correct / refuse targets (Phase 2+).
Gate init: 0.3 (linear region of tanh), gate lr ×5.
Data: full MMLU (12,042 questions, 57 subjects) for the records; TriviaQA / MMLU Hard / GSM8K in other phases.

Limitations

Over-refusal: the record checkpoints refuse often (~63% on MMLU). Refusal precision is near-perfect but the model is conservative — it prefers a calibrated refusal over a risky answer.
Self-correction rarely fires at 8B (the model prefers refusing to correcting). Richer tokenization (Phase 5) improves accuracy but not correction.
Requires the two-pass generation wrapper from the code repository; not a drop-in HF model.

Citation

@software{meta_transformers_core_2026,
  title = {Meta-transformers: Architectural Introspection for Large Language Models},
  year  = {2026}
}

License

Apache 2.0. Base models are subject to their own licenses: Llama 3.1 / 3.2 Community License, Gemma Terms of Use — obtain their weights from Hugging Face separately.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Imperius/meta-transformers-all-phases

Base model

google/gemma-2-2b

Finetuned

google/gemma-2-2b-it

Finetuned

(934)

this model