Meta-transformers β€” Architectural Introspection for LLMs

This repository hosts the weights, data, and results of the architectural-introspection experiments ("meta-transformers"). The idea: instead of text-based reflection, give a model direct access to its own activations through a learnable feedback loop.

The base model stays frozen. Only a thin introspection pathway is trained (~188M params on Llama-3.1-8B, ~2.3% of the base): an activation encoder + 32 BottleneckCrossAttention modules. At inference the model runs two passes β€” it reads its own per-layer activations, encodes them into cognitive tokens, and injects them back via gated cross-attention. This yields calibrated refusal and self-correction.

Code, training and evaluation scripts: https://codeberg.org/imperius/meta-transformers-ENG.git


Repository layout

Folder Contents
checkpoints/ Trained introspection weights (encoder + cross-attention). The base model is not included β€” download it separately from Hugging Face.
data/ Pre-collected activations (training datasets for the introspection pathway).
results/ Metrics, training logs and histories for every experiment (JSON / txt / log).

⚠️ This is not a drop-in HF model. The weights load into a ReflexionModel* wrapper from the code repository and require two-pass generation. Without the code the weights are not usable on their own.


Key results

Experiment Selective accuracy Refusal precision Checkpoint
Phase 5 Multi-Position (Variant B) β€” record 90.1% 98.7% checkpoints/phase5_multipos/
Phase 2 Selective MMLU β€” calibration record 89.1% 99.84% checkpoints/phase2_selective_best_model.pt
Cross-domain (MMLU→TriviaQA, zero-shot) 91.1% 100% checkpoints/selective_mmlu_best_model.pt
Phase 4 Dynamic Gates (7/7 checks) 88.9% 99.0% checkpoints/phase4_dynamic_gates/
Phase 1 Selective (basis of the records) 71.4% 84.9% checkpoints/selective/

Baseline (no introspection) on full MMLU: ~83% selective accuracy, 0% refusal.

Cross-domain is the strongest evidence of generalization: a checkpoint trained only on MMLU keeps 100% refusal precision zero-shot on TriviaQA. The encoder reads the model's own internal uncertainty signal, not benchmark-specific patterns.


Architecture (brief)

Pass 1 (read):     text β†’ frozen LLM β†’ hooks capture per-layer activations
                   β†’ SelectiveIntrospectionEncoder β†’ cognitive tokens (one per layer)
Pass 2 (generate): text + cognitive tokens injected via
                   32Γ— BottleneckCrossAttention (tanh gates) β†’ answer

Five components: ActivationCollector (hooks) β†’ Cognitive Encoder β†’ cognitive tokens β†’ meta-attention (BottleneckCrossAttention) β†’ gates. Details in the code repository's docs/.


How to use

  1. Get access to the base model on Hugging Face (for Llama, accept the Llama 3.1/3.2 Community License).
  2. Download the checkpoint you want from checkpoints/.
  3. Load it into the matching wrapper from the code repository, e.g. the Phase 2 Selective record:
# from the meta-transformers repository
# see src/phase2_selective_llama8b/04_evaluate.py for the full loading + two-pass example
from src.phase2_selective_llama8b.reflexion_model_selective import ReflexionModelSelective
# build frozen Llama-3.1-8B-Instruct, wrap it, load the encoder + cross-attention weights

Checkpoint β†’ code mapping:

Checkpoint Code module
phase5_multipos/ src/phase5_multipos_llama8b/
phase2_selective_best_model.pt src/phase2_selective_llama8b/
selective_mmlu_best_model.pt, selective/ src/phase1_selective_llama8b/, src/cross_domain_llama8b/
phase4_dynamic_gates/ src/phase4_dynamic_gates_llama8b/
allca/, allca_tg/ src/phase1_allca_llama8b/, src/phase1_allca_tg_llama8b/
allheads/ src/phase1_allheads_llama8b/
phase7_llama1b_* src/phase7_recursive_introspection_llama1b/
phase8_* src/phase8_transformer_encoder_llama1b/

Training details

  • Base: fully frozen (Llama-3.1-8B / Llama-3.2-1B / Gemma-2-2B depending on the phase).
  • Trained: SelectiveIntrospectionEncoder (51.7M) + 32Γ— BottleneckCrossAttention (136.5M).
  • Objective: standard LM cross-entropy on confirm / correct / refuse targets (Phase 2+).
  • Gate init: 0.3 (linear region of tanh), gate lr Γ—5.
  • Data: full MMLU (12,042 questions, 57 subjects) for the records; TriviaQA / MMLU Hard / GSM8K in other phases.

Limitations

  • Over-refusal: the record checkpoints refuse often (~63% on MMLU). Refusal precision is near-perfect but the model is conservative β€” it prefers a calibrated refusal over a risky answer.
  • Self-correction rarely fires at 8B (the model prefers refusing to correcting). Richer tokenization (Phase 5) improves accuracy but not correction.
  • Requires the two-pass generation wrapper from the code repository; not a drop-in HF model.

Citation

@software{meta_transformers_core_2026,
  title = {Meta-transformers: Architectural Introspection for Large Language Models},
  year  = {2026}
}

License

Apache 2.0. Base models are subject to their own licenses: Llama 3.1 / 3.2 Community License, Gemma Terms of Use β€” obtain their weights from Hugging Face separately.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Imperius/meta-transformers-all-phases

Finetuned
(934)
this model