Instructions to use irafm-llm/Recurrent-Llama-3.2-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use irafm-llm/Recurrent-Llama-3.2-1B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="irafm-llm/Recurrent-Llama-3.2-1B", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("irafm-llm/Recurrent-Llama-3.2-1B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use irafm-llm/Recurrent-Llama-3.2-1B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "irafm-llm/Recurrent-Llama-3.2-1B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "irafm-llm/Recurrent-Llama-3.2-1B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/irafm-llm/Recurrent-Llama-3.2-1B

SGLang

How to use irafm-llm/Recurrent-Llama-3.2-1B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "irafm-llm/Recurrent-Llama-3.2-1B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "irafm-llm/Recurrent-Llama-3.2-1B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "irafm-llm/Recurrent-Llama-3.2-1B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "irafm-llm/Recurrent-Llama-3.2-1B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use irafm-llm/Recurrent-Llama-3.2-1B with Docker Model Runner:
```
docker model run hf.co/irafm-llm/Recurrent-Llama-3.2-1B
```

Recurrent-Llama-3.2-1B

A depth-recurrent (Huginn / Raven) language model retrofitted from meta-llama/Llama-3.2-1B by model surgery followed by a recurrence-curriculum healing phase.

Instead of a fixed stack of layers, the model has a small recurrent core block that is looped a controllable number of times at inference. This lets you spend more compute on harder inputs without adding parameters — the "think deeper" knob is num_steps (a.k.a. recurrence depth).

Method: Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence (McLeish et al., 2025), built on the Huginn/Raven architecture of Scaling up Test-Time Compute with Latent Reasoning.

Architecture

input → embed → prelude (4 layers) → [ adapter + recurrent core (6 layers) ] × R → coda (4 layers) → norm → lm_head
                                       └──────────── looped R times ───────────┘


Base model	Llama-3.2-1B (16 layers)
Split (prelude / recurrent core / coda)	4 / 6 / 4 (source layers 4–5 dropped)
Recurrence at inference	any `num_steps`; trained up to 16
Block / norm / RoPE	Llama pre-norm, RMSNorm, native Llama-3 RoPE (θ=500000)
Params	~1.39B
`model_type`	`huginn_raven`

The adapter re-injects the prelude output at every recurrent step, so the latent state cannot drift away from the input regardless of depth.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "irafm-llm/Recurrent-Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()

ids = tok("The history of mathematics is", return_tensors="pt").input_ids.cuda()
out = model.generate(ids, max_new_tokens=40, do_sample=False,
                     num_steps=32,          # <-- recurrence depth: raise for more test-time compute
                     tokenizer=tok, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

trust_remote_code=True is required — the repo bundles its own raven_modeling_minimal.py. The model exposes the full Huginn-0125 step API (embed_inputs, initialize_state, iterate_one_step, predict_from_latents, forward_with_adaptive_compute) and is a drop-in for code written against Huginn-0125, including per-sentence selective-recurrence control.

How it was made

Surgery. Llama-3.2-1B's layers are split into prelude (0–3), recurrent core (6–11) and coda (12–15); layers 4–5 are dropped. Attention/MLP/norm weights are copied verbatim (fused QKV and gate-up), embeddings untied. The conversion reproduces the source model's logits exactly (full-cover check: logits MSE ~1e-11) and is bit-identical to the official smcleish/Recurrent-Llama-3.2-untrained on all non-adapter tensors.
Healing. ~98M tokens of FineWeb-Edu, sequence length 1024, AdamW lr 5e-5, grad-clip 1.0, bf16. The mean recurrence is ramped with a 1-sqrt curriculum up to 16; depth is sampled per-step (log-normal-Poisson) and gradients are truncated to the last 8 recurrent passes (truncated BPTT). Eval loss (@rec 16): 14.2 → 2.8.

Limitations

Demonstration-scale healing. ~98M tokens vs the paper's ~50B; output is fluent but can be repetitive under greedy decoding. Not instruction-tuned.
Inherits Llama-3.2's knowledge cutoff, biases and the Llama 3.2 Community License.
Recurrence was trained up to depth 16; higher num_steps works but is extrapolation.

Citation

@article{mcleish2025teaching,
  title={Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence},
  author={McLeish, Sean and Li, Ang and Kirchenbauer, John and Kalra, Dayal Singh and Bartoldson, Brian R. and Kailkhura, Bhavya and Schwarzschild, Avi and Geiping, Jonas and Goldstein, Tom and Goldblum, Micah},
  journal={arXiv preprint arXiv:2511.07384}, year={2025}
}

Built with Llama. Converted with huginn_surgery.

Downloads last month: 2

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for irafm-llm/Recurrent-Llama-3.2-1B

Base model

meta-llama/Llama-3.2-1B

Finetuned

(936)

this model

Papers for irafm-llm/Recurrent-Llama-3.2-1B

Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Paper • 2511.07384 • Published Nov 10, 2025 • 20

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Paper • 2502.05171 • Published Feb 7, 2025 • 158