Instructions to use irafm-llm/Recurrent-Gemma-2-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use irafm-llm/Recurrent-Gemma-2-2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="irafm-llm/Recurrent-Gemma-2-2b", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("irafm-llm/Recurrent-Gemma-2-2b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use irafm-llm/Recurrent-Gemma-2-2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "irafm-llm/Recurrent-Gemma-2-2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irafm-llm/Recurrent-Gemma-2-2b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/irafm-llm/Recurrent-Gemma-2-2b
- SGLang
How to use irafm-llm/Recurrent-Gemma-2-2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "irafm-llm/Recurrent-Gemma-2-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irafm-llm/Recurrent-Gemma-2-2b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "irafm-llm/Recurrent-Gemma-2-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irafm-llm/Recurrent-Gemma-2-2b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use irafm-llm/Recurrent-Gemma-2-2b with Docker Model Runner:
docker model run hf.co/irafm-llm/Recurrent-Gemma-2-2b
Recurrent-Gemma-2-2b
A depth-recurrent (Huginn / Raven) language model retrofitted from
google/gemma-2-2b by model surgery followed by a
recurrence-curriculum healing phase.
Instead of a fixed stack of layers, the model has a small recurrent core block looped a
controllable number of times at inference β spend more compute on harder inputs without adding
parameters (the "think deeper" knob is num_steps).
Method: Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence,
generalised here to Gemma-2 (not covered by the paper β it needs 4-norm sandwich blocks,
(1+w) fp32 RMSNorm, GeGLU, eager attention with attn/final logit soft-capping, and βd embedding
scaling; all handled by the converter).
Architecture
input β embed (Γβd) β prelude (4) β [ adapter + recurrent core (6) ] Γ R β coda (4) β norm β lm_head
βββββββββββ looped R times βββββββββββ
| Base model | Gemma-2-2b (26 layers) |
| Split (prelude / recurrent core / coda) | 4 / 6 / 4 (12 middle layers dropped) |
| Recurrence at inference | any num_steps; trained up to 16 |
| Gemma-2 specifics preserved | 4-norm sandwich, (1+w) RMSNorm, GeGLU, attn+final logit soft-capping (50/30), head_dim 256, βhidden embed scale |
| Params | ~2.3B |
model_type |
huginn_raven |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "irafm-llm/Recurrent-Gemma-2-2b"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()
ids = tok("The history of mathematics is", return_tensors="pt").input_ids.cuda()
out = model.generate(ids, max_new_tokens=40, do_sample=False,
num_steps=32, # <-- recurrence depth
tokenizer=tok, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
trust_remote_code=True is required (the repo bundles its own raven_modeling_minimal.py). Exposes
the full Huginn-0125 step API (embed_inputs, initialize_state, iterate_one_step,
predict_from_latents, β¦) β a drop-in for Huginn-0125 code / selective-recurrence control.
How it was made
- Surgery. Gemma-2-2b's layers split into prelude (0β3), recurrent core (16β21), coda (22β25); the 12 middle layers are dropped. Attention/MLP/all-4-norms copied verbatim (fused QKV & gate-up), embeddings untied. The full-cover conversion reproduces the source model's logits exactly (logits MSE ~7e-11 in fp32), validating the Gemma-2 surgery.
- Healing.
65M tokens of FineWeb-Edu, seq len 1024, AdamW lr 5e-5, bf16, with a18 β ~2.9**.1-sqrtmean-recurrence curriculum up to 16 and truncated BPTT (last 8 passes). Eval loss (@rec 16): **
Limitations
- Demonstration-scale healing (~65M tokens vs the paper's ~50B) + an aggressive split (12/26 layers dropped) β output is fluent but can be repetitive; not instruction-tuned.
- Inherits Gemma-2's knowledge, biases and the Gemma Terms of Use.
- Gemma-2 sliding-window attention is treated as full causal (identical for sequences β€ 4096).
Citation
@article{mcleish2025teaching,
title={Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence},
author={McLeish, Sean and Li, Ang and Kirchenbauer, John and Kalra, Dayal Singh and Bartoldson, Brian R. and Kailkhura, Bhavya and Schwarzschild, Avi and Geiping, Jonas and Goldstein, Tom and Goldblum, Micah},
journal={arXiv preprint arXiv:2511.07384}, year={2025}
}
Converted with huginn_surgery. Gemma-2 support is an original extension of the retrofit recipe.
- Downloads last month
- -
Model tree for irafm-llm/Recurrent-Gemma-2-2b
Base model
google/gemma-2-2b