Instructions to use irafm-llm/Recurrent-Gemma-2-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use irafm-llm/Recurrent-Gemma-2-2b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="irafm-llm/Recurrent-Gemma-2-2b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("irafm-llm/Recurrent-Gemma-2-2b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use irafm-llm/Recurrent-Gemma-2-2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "irafm-llm/Recurrent-Gemma-2-2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "irafm-llm/Recurrent-Gemma-2-2b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/irafm-llm/Recurrent-Gemma-2-2b

SGLang

How to use irafm-llm/Recurrent-Gemma-2-2b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "irafm-llm/Recurrent-Gemma-2-2b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "irafm-llm/Recurrent-Gemma-2-2b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "irafm-llm/Recurrent-Gemma-2-2b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "irafm-llm/Recurrent-Gemma-2-2b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use irafm-llm/Recurrent-Gemma-2-2b with Docker Model Runner:
```
docker model run hf.co/irafm-llm/Recurrent-Gemma-2-2b
```

Recurrent-Gemma-2-2b

A depth-recurrent (Huginn / Raven) language model retrofitted from google/gemma-2-2b by model surgery followed by a recurrence-curriculum healing phase.

Instead of a fixed stack of layers, the model has a small recurrent core block looped a controllable number of times at inference — spend more compute on harder inputs without adding parameters (the "think deeper" knob is num_steps).

Method: Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence, generalised here to Gemma-2 (not covered by the paper — it needs 4-norm sandwich blocks, (1+w) fp32 RMSNorm, GeGLU, eager attention with attn/final logit soft-capping, and √d embedding scaling; all handled by the converter).

Architecture

input → embed (×√d) → prelude (4) → [ adapter + recurrent core (6) ] × R → coda (4) → norm → lm_head
                                      └────────── looped R times ──────────┘


Base model	Gemma-2-2b (26 layers)
Split (prelude / recurrent core / coda)	4 / 6 / 4 (12 middle layers dropped)
Recurrence at inference	any `num_steps`; trained up to 16
Gemma-2 specifics preserved	4-norm sandwich, (1+w) RMSNorm, GeGLU, attn+final logit soft-capping (50/30), head_dim 256, √hidden embed scale
Params	~2.3B
`model_type`	`huginn_raven`

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "irafm-llm/Recurrent-Gemma-2-2b"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()

ids = tok("The history of mathematics is", return_tensors="pt").input_ids.cuda()
out = model.generate(ids, max_new_tokens=40, do_sample=False,
                     num_steps=32,          # <-- recurrence depth
                     tokenizer=tok, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

trust_remote_code=True is required (the repo bundles its own raven_modeling_minimal.py). Exposes the full Huginn-0125 step API (embed_inputs, initialize_state, iterate_one_step, predict_from_latents, …) — a drop-in for Huginn-0125 code / selective-recurrence control.

How it was made

Surgery. Gemma-2-2b's layers split into prelude (0–3), recurrent core (16–21), coda (22–25); the 12 middle layers are dropped. Attention/MLP/all-4-norms copied verbatim (fused QKV & gate-up), embeddings untied. The full-cover conversion reproduces the source model's logits exactly (logits MSE ~7e-11 in fp32), validating the Gemma-2 surgery.
Healing. 65M tokens of FineWeb-Edu, seq len 1024, AdamW lr 5e-5, bf16, with a 1-sqrt mean-recurrence curriculum up to 16 and truncated BPTT (last 8 passes). Eval loss (@rec 16): **18 → ~2.9**.

Limitations

Demonstration-scale healing (~65M tokens vs the paper's ~50B) + an aggressive split (12/26 layers dropped) → output is fluent but can be repetitive; not instruction-tuned.
Inherits Gemma-2's knowledge, biases and the Gemma Terms of Use.
Gemma-2 sliding-window attention is treated as full causal (identical for sequences ≤ 4096).

Citation

@article{mcleish2025teaching,
  title={Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence},
  author={McLeish, Sean and Li, Ang and Kirchenbauer, John and Kalra, Dayal Singh and Bartoldson, Brian R. and Kailkhura, Bhavya and Schwarzschild, Avi and Geiping, Jonas and Goldstein, Tom and Goldblum, Micah},
  journal={arXiv preprint arXiv:2511.07384}, year={2025}
}

Converted with huginn_surgery. Gemma-2 support is an original extension of the retrofit recipe.

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for irafm-llm/Recurrent-Gemma-2-2b

Base model

google/gemma-2-2b

Finetuned

(563)

this model

Paper for irafm-llm/Recurrent-Gemma-2-2b

Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Paper • 2511.07384 • Published Nov 10, 2025 • 20