Instructions to use irafm-llm/Recurrent-Llama-3.2-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use irafm-llm/Recurrent-Llama-3.2-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="irafm-llm/Recurrent-Llama-3.2-1B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("irafm-llm/Recurrent-Llama-3.2-1B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use irafm-llm/Recurrent-Llama-3.2-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "irafm-llm/Recurrent-Llama-3.2-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irafm-llm/Recurrent-Llama-3.2-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/irafm-llm/Recurrent-Llama-3.2-1B
- SGLang
How to use irafm-llm/Recurrent-Llama-3.2-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "irafm-llm/Recurrent-Llama-3.2-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irafm-llm/Recurrent-Llama-3.2-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "irafm-llm/Recurrent-Llama-3.2-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "irafm-llm/Recurrent-Llama-3.2-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use irafm-llm/Recurrent-Llama-3.2-1B with Docker Model Runner:
docker model run hf.co/irafm-llm/Recurrent-Llama-3.2-1B
Recurrent-Llama-3.2-1B
A depth-recurrent (Huginn / Raven) language model retrofitted from
meta-llama/Llama-3.2-1B by model surgery
followed by a recurrence-curriculum healing phase.
Instead of a fixed stack of layers, the model has a small recurrent core block that is looped
a controllable number of times at inference. This lets you spend more compute on harder inputs
without adding parameters β the "think deeper" knob is num_steps (a.k.a. recurrence depth).
Method: Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence (McLeish et al., 2025), built on the Huginn/Raven architecture of Scaling up Test-Time Compute with Latent Reasoning.
Architecture
input β embed β prelude (4 layers) β [ adapter + recurrent core (6 layers) ] Γ R β coda (4 layers) β norm β lm_head
βββββββββββββ looped R times ββββββββββββ
| Base model | Llama-3.2-1B (16 layers) |
| Split (prelude / recurrent core / coda) | 4 / 6 / 4 (source layers 4β5 dropped) |
| Recurrence at inference | any num_steps; trained up to 16 |
| Block / norm / RoPE | Llama pre-norm, RMSNorm, native Llama-3 RoPE (ΞΈ=500000) |
| Params | ~1.39B |
model_type |
huginn_raven |
The adapter re-injects the prelude output at every recurrent step, so the latent state cannot drift away from the input regardless of depth.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "irafm-llm/Recurrent-Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()
ids = tok("The history of mathematics is", return_tensors="pt").input_ids.cuda()
out = model.generate(ids, max_new_tokens=40, do_sample=False,
num_steps=32, # <-- recurrence depth: raise for more test-time compute
tokenizer=tok, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
trust_remote_code=True is required β the repo bundles its own raven_modeling_minimal.py. The
model exposes the full Huginn-0125 step API (embed_inputs, initialize_state, iterate_one_step,
predict_from_latents, forward_with_adaptive_compute) and is a drop-in for code written against
Huginn-0125, including per-sentence selective-recurrence control.
How it was made
- Surgery. Llama-3.2-1B's layers are split into prelude (0β3), recurrent core (6β11) and coda
(12β15); layers 4β5 are dropped. Attention/MLP/norm weights are copied verbatim (fused QKV and
gate-up), embeddings untied. The conversion reproduces the source model's logits exactly
(full-cover check: logits MSE ~1e-11) and is bit-identical to the official
smcleish/Recurrent-Llama-3.2-untrainedon all non-adapter tensors. - Healing. ~98M tokens of FineWeb-Edu,
sequence length 1024, AdamW lr 5e-5, grad-clip 1.0, bf16. The mean recurrence is ramped with a
1-sqrtcurriculum up to 16; depth is sampled per-step (log-normal-Poisson) and gradients are truncated to the last 8 recurrent passes (truncated BPTT). Eval loss (@rec 16): 14.2 β 2.8.
Limitations
- Demonstration-scale healing. ~98M tokens vs the paper's ~50B; output is fluent but can be repetitive under greedy decoding. Not instruction-tuned.
- Inherits Llama-3.2's knowledge cutoff, biases and the Llama 3.2 Community License.
- Recurrence was trained up to depth 16; higher
num_stepsworks but is extrapolation.
Citation
@article{mcleish2025teaching,
title={Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence},
author={McLeish, Sean and Li, Ang and Kirchenbauer, John and Kalra, Dayal Singh and Bartoldson, Brian R. and Kailkhura, Bhavya and Schwarzschild, Avi and Geiping, Jonas and Goldstein, Tom and Goldblum, Micah},
journal={arXiv preprint arXiv:2511.07384}, year={2025}
}
Built with Llama. Converted with huginn_surgery.
- Downloads last month
- 2
Model tree for irafm-llm/Recurrent-Llama-3.2-1B
Base model
meta-llama/Llama-3.2-1B