Recurrent-Gemma-2-2b

A depth-recurrent (Huginn / Raven) language model retrofitted from google/gemma-2-2b by model surgery followed by a recurrence-curriculum healing phase.

Instead of a fixed stack of layers, the model has a small recurrent core block looped a controllable number of times at inference β€” spend more compute on harder inputs without adding parameters (the "think deeper" knob is num_steps).

Method: Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence, generalised here to Gemma-2 (not covered by the paper β€” it needs 4-norm sandwich blocks, (1+w) fp32 RMSNorm, GeGLU, eager attention with attn/final logit soft-capping, and √d embedding scaling; all handled by the converter).

Architecture

input β†’ embed (Γ—βˆšd) β†’ prelude (4) β†’ [ adapter + recurrent core (6) ] Γ— R β†’ coda (4) β†’ norm β†’ lm_head
                                      └────────── looped R times β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Base model Gemma-2-2b (26 layers)
Split (prelude / recurrent core / coda) 4 / 6 / 4 (12 middle layers dropped)
Recurrence at inference any num_steps; trained up to 16
Gemma-2 specifics preserved 4-norm sandwich, (1+w) RMSNorm, GeGLU, attn+final logit soft-capping (50/30), head_dim 256, √hidden embed scale
Params ~2.3B
model_type huginn_raven

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "irafm-llm/Recurrent-Gemma-2-2b"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()

ids = tok("The history of mathematics is", return_tensors="pt").input_ids.cuda()
out = model.generate(ids, max_new_tokens=40, do_sample=False,
                     num_steps=32,          # <-- recurrence depth
                     tokenizer=tok, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

trust_remote_code=True is required (the repo bundles its own raven_modeling_minimal.py). Exposes the full Huginn-0125 step API (embed_inputs, initialize_state, iterate_one_step, predict_from_latents, …) β€” a drop-in for Huginn-0125 code / selective-recurrence control.

How it was made

  1. Surgery. Gemma-2-2b's layers split into prelude (0–3), recurrent core (16–21), coda (22–25); the 12 middle layers are dropped. Attention/MLP/all-4-norms copied verbatim (fused QKV & gate-up), embeddings untied. The full-cover conversion reproduces the source model's logits exactly (logits MSE ~7e-11 in fp32), validating the Gemma-2 surgery.
  2. Healing. 65M tokens of FineWeb-Edu, seq len 1024, AdamW lr 5e-5, bf16, with a 1-sqrt mean-recurrence curriculum up to 16 and truncated BPTT (last 8 passes). Eval loss (@rec 16): **18 β†’ ~2.9**.

Limitations

  • Demonstration-scale healing (~65M tokens vs the paper's ~50B) + an aggressive split (12/26 layers dropped) β†’ output is fluent but can be repetitive; not instruction-tuned.
  • Inherits Gemma-2's knowledge, biases and the Gemma Terms of Use.
  • Gemma-2 sliding-window attention is treated as full causal (identical for sequences ≀ 4096).

Citation

@article{mcleish2025teaching,
  title={Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence},
  author={McLeish, Sean and Li, Ang and Kirchenbauer, John and Kalra, Dayal Singh and Bartoldson, Brian R. and Kailkhura, Bhavya and Schwarzschild, Avi and Geiping, Jonas and Goldstein, Tom and Goldblum, Micah},
  journal={arXiv preprint arXiv:2511.07384}, year={2025}
}

Converted with huginn_surgery. Gemma-2 support is an original extension of the retrofit recipe.

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for irafm-llm/Recurrent-Gemma-2-2b

Finetuned
(563)
this model

Paper for irafm-llm/Recurrent-Gemma-2-2b