HRM-Embed-0.6b

A compact text-embedding model built on the Hierarchical Reasoning Model (HRM), a depth-recurrent architecture from Sapient Intelligence. It applies a standard embedding recipe to that unusual backbone: ~0.6B parameters, fine-tuned end-to-end (contrastive) from the open Xiaoye08/HRM-Text-0.6B base checkpoint.

This is an embedding model, not a generator. It exposes 1280-dim sentence embeddings via a mean-pool of the recurrence state (see Usage). A plain from_pretrained gives a causal LM; you must apply the embedding recipe below.

Requirements

transformers (loads custom architecture via trust_remote_code=True)
torch (bfloat16; runs on CPU or GPU)
The model does not load via sentence-transformers.

Model details


Architecture	Hierarchical Reasoning Model (depth-recurrent), `HrmTextForCausalLM`
Parameters	~610.8M (dense; the untrained LM head is not shipped)
Embedding dim	1280 (L2-normalized)
Hidden size	1280
Layers	12 per stack × 2 stacks (H + L) = 24 blocks
Attention heads	10 (head_dim 128)
Recurrence (H, L cycles)	2, 3 (8 stack-passes per forward)
Context length	4096
Vocab	65,536 (GPT-2-style BPE)
Attention	Prefix-LM; bidirectional when `token_type_ids = attention_mask`
Dtype	bfloat16
License	Apache-2.0

Usage

Embeddings are the L2-normalized mean-pool of the final recurrence hidden state (z_h). Bidirectional encoding is obtained by passing token_type_ids = attention_mask (marks the whole input as one bidirectional prefix). Replace the LM head with Identity so nothing downstream of the recurrence state is used.

import torch, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "viventhraa96/HRM-Embed-0.6b"
dev = "cuda" if torch.cuda.is_available() else "cpu"

tok = AutoTokenizer.from_pretrained(name)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(
    name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(dev).eval()
model.lm_head = torch.nn.Identity()          # embeddings come from z_h, not the LM head

@torch.no_grad()
def embed(texts, max_length=512):
    tok.padding_side = "right"
    e = tok(texts, truncation=True, max_length=max_length, padding=True, return_tensors="pt").to(dev)
    pos = torch.arange(e.input_ids.shape[1], device=dev).unsqueeze(0).expand(e.input_ids.shape[0], -1)
    z, _ = model.model(e.input_ids, position_ids=pos, use_cache=False, token_type_ids=e.attention_mask)
    m = e.attention_mask.unsqueeze(-1).to(z.dtype)               # mean-pool over real tokens
    return F.normalize(((z * m).sum(1) / m.sum(1).clamp_min(1)).float(), p=2, dim=-1)  # [N, 1280]

emb = embed(["How do I sort a list in Python?",
             "The mitochondria is the powerhouse of the cell."])
print(emb.shape)                              # torch.Size([2, 1280])
print(float(emb[0] @ emb[1]))                 # cosine similarity

Bidirectional encoding (the prefix mask)

HRM-Text is a Prefix-LM. Passing token_type_ids = attention_mask marks every real token as part of one bidirectional prefix, so tokens attend both ways (padding is excluded and dropped by the masked mean). This matches how the model was trained as an embedder.

Method (standard recipe)

Nothing here is a new technique; it is an amalgamation of standard ones on an unusual backbone.

Mean-pool the final hidden state, then L2-normalize: the Sentence-BERT convention, also used by E5 / GTE / BGE.
Bidirectional attention instead of causal: the conversion popularized by LLM2Vec for turning decoder LMs into encoders. Here it needs no mask monkey-patching, since HRM-Text is natively a Prefix-LM, so token_type_ids = attention_mask enables it the intended way.
Contrastive (InfoNCE) fine-tuning to produce the weights: the standard training objective for modern text embedders.

Because the model runs bidirectionally, mean-pooling (rather than last-token pooling, common for causal decoders) is the natural, coherent choice. The only unusual part is the backbone: applying this recipe to a depth-recurrent HRM and pooling the recurrence state z_h.

Results: BRIGHT (reasoning retrieval)

Mean nDCG@10 over BRIGHT's 12 domains, for three query modes: raw (original query), rewrite (an LLM rewrites the query first, as most top BRIGHT systems do), and merged (raw + rewrite).

Query mode	Mean nDCG@10
raw (bare embedder)	18.1
+ query rewriting	34.3
merged (raw + rewrite)	33.7

Per-domain (nDCG@10 x100):

Domain	raw	rewrite
theoremqa_theorems	29.4	50.4
pony	1.1	46.5
biology	20.4	45.6
theoremqa_questions	29.8	44.1
psychology	21.3	39.3
economics	17.5	35.7
sustainable_living	13.6	30.9
earth_science	20.2	30.9
aops	16.3	27.5
stackoverflow	12.5	27.3
robotics	12.3	20.2
leetcode	22.6	12.9

Where it's strong: theorem/definition/reference lookup (theoremqa) and vocabulary-aligned scientific QA (biology, psychology). Where it's weak: reasoning-transfer retrieval (match by shared technique, not shared words, e.g. aops), community/procedural QA (robotics, stackoverflow). pony is the extreme case: near-chance without rewriting (raw 1.1) yet among the strongest with it (46.5), making it the most rewrite-dependent domain in the set.

Note on code retrieval: LeetCode is the one domain where query rewriting hurts (22.6 to 12.9): expanding a terse problem statement into prose moves the query off the corpus distribution. Use the merged variant for code.

Limitations

English only. Inherits the base checkpoint's pretraining breadth ceiling; not a broad knowledge embedder.
Embedder, not a generator: the checkpoint ships without an LM head, so a plain load prints a lm_head.weight newly-initialized warning (expected) and .generate() returns noise. Apply the Identity swap shown in Usage and pool z_h.
Best results use a query-rewriting front-end (an external LLM). The bare-embedder (raw) ceiling is lower; raw and rewrite numbers are both reported above so you can see the real embedder. The rewritten queries here come from INF-X-Retriever.
Modest absolute scores on BRIGHT: this is a small model on a deliberately adversarial benchmark.

Architecture & credits

The Hierarchical Reasoning Model (HRM) architecture is by Sapient Intelligence (github.com/sapientinc/HRM, github.com/sapientinc/HRM-Text, arXiv:2506.21734). All architectural credit is theirs. This model is a text-embedding fine-tune of the open Xiaoye08/HRM-Text-0.6B pretrained checkpoint (Apache-2.0); the HRM-Text pretraining pipeline is described in arXiv:2605.20613.

License

Apache-2.0. This is a derivative of Apache-2.0 licensed weights; attribution to Sapient Intelligence (HRM) and the HRM-Text project is preserved above.

Citation

@misc{hrm2025,
  title  = {Hierarchical Reasoning Model},
  author = {Wang, Guan and others},
  year   = {2025},
  eprint = {2506.21734}, archivePrefix = {arXiv}, primaryClass = {cs.AI},
  url    = {https://arxiv.org/abs/2506.21734}
}
@misc{hrmtext2026,
  title  = {HRM-Text: Efficient Pretraining Beyond Scaling},
  author = {Wang, Guan and Liu, Changling and Wang, Chenyu and Zhou, Cai and Sun, Yuhao and Wu, Yifei and Zhen, Shuai and Scimeca, Luca and Abbasi Yadkori, Yasin},
  year   = {2026},
  eprint = {2605.20613}, archivePrefix = {arXiv}, primaryClass = {cs.CL},
  url    = {https://arxiv.org/abs/2605.20613}
}

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for viventhraa96/HRM-Embed-0.6b

Base model

Xiaoye08/HRM-Text-0.6B

Finetuned

(1)

this model

Papers for viventhraa96/HRM-Embed-0.6b

Evaluation results

mean nDCG@10 (raw queries) on BRIGHT (12 domains)
self-reported

18.100
mean nDCG@10 (query-rewrite) on BRIGHT (12 domains)
self-reported

34.300
mean nDCG@10 (merged raw+rewrite) on BRIGHT (12 domains)
self-reported

33.700