HRM-Embed-0.6b

A compact text-embedding model built on the Hierarchical Reasoning Model (HRM), a depth-recurrent architecture from Sapient Intelligence. It applies a standard embedding recipe to that unusual backbone: ~0.6B parameters, fine-tuned end-to-end (contrastive) from the open Xiaoye08/HRM-Text-0.6B base checkpoint.

This is an embedding model, not a generator. It exposes 1280-dim sentence embeddings via a mean-pool of the recurrence state (see Usage). A plain from_pretrained gives a causal LM; you must apply the embedding recipe below.

Requirements

  • transformers (loads custom architecture via trust_remote_code=True)
  • torch (bfloat16; runs on CPU or GPU)
  • The model does not load via sentence-transformers.

Model details

Architecture Hierarchical Reasoning Model (depth-recurrent), HrmTextForCausalLM
Parameters ~610.8M (dense; the untrained LM head is not shipped)
Embedding dim 1280 (L2-normalized)
Hidden size 1280
Layers 12 per stack × 2 stacks (H + L) = 24 blocks
Attention heads 10 (head_dim 128)
Recurrence (H, L cycles) 2, 3 (8 stack-passes per forward)
Context length 4096
Vocab 65,536 (GPT-2-style BPE)
Attention Prefix-LM; bidirectional when token_type_ids = attention_mask
Dtype bfloat16
License Apache-2.0

Usage

Embeddings are the L2-normalized mean-pool of the final recurrence hidden state (z_h). Bidirectional encoding is obtained by passing token_type_ids = attention_mask (marks the whole input as one bidirectional prefix). Replace the LM head with Identity so nothing downstream of the recurrence state is used.

import torch, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "viventhraa96/HRM-Embed-0.6b"
dev = "cuda" if torch.cuda.is_available() else "cpu"

tok = AutoTokenizer.from_pretrained(name)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(
    name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(dev).eval()
model.lm_head = torch.nn.Identity()          # embeddings come from z_h, not the LM head

@torch.no_grad()
def embed(texts, max_length=512):
    tok.padding_side = "right"
    e = tok(texts, truncation=True, max_length=max_length, padding=True, return_tensors="pt").to(dev)
    pos = torch.arange(e.input_ids.shape[1], device=dev).unsqueeze(0).expand(e.input_ids.shape[0], -1)
    z, _ = model.model(e.input_ids, position_ids=pos, use_cache=False, token_type_ids=e.attention_mask)
    m = e.attention_mask.unsqueeze(-1).to(z.dtype)               # mean-pool over real tokens
    return F.normalize(((z * m).sum(1) / m.sum(1).clamp_min(1)).float(), p=2, dim=-1)  # [N, 1280]

emb = embed(["How do I sort a list in Python?",
             "The mitochondria is the powerhouse of the cell."])
print(emb.shape)                              # torch.Size([2, 1280])
print(float(emb[0] @ emb[1]))                 # cosine similarity

Bidirectional encoding (the prefix mask)

HRM-Text is a Prefix-LM. Passing token_type_ids = attention_mask marks every real token as part of one bidirectional prefix, so tokens attend both ways (padding is excluded and dropped by the masked mean). This matches how the model was trained as an embedder.

Method (standard recipe)

Nothing here is a new technique; it is an amalgamation of standard ones on an unusual backbone.

  • Mean-pool the final hidden state, then L2-normalize: the Sentence-BERT convention, also used by E5 / GTE / BGE.
  • Bidirectional attention instead of causal: the conversion popularized by LLM2Vec for turning decoder LMs into encoders. Here it needs no mask monkey-patching, since HRM-Text is natively a Prefix-LM, so token_type_ids = attention_mask enables it the intended way.
  • Contrastive (InfoNCE) fine-tuning to produce the weights: the standard training objective for modern text embedders.

Because the model runs bidirectionally, mean-pooling (rather than last-token pooling, common for causal decoders) is the natural, coherent choice. The only unusual part is the backbone: applying this recipe to a depth-recurrent HRM and pooling the recurrence state z_h.

Results: BRIGHT (reasoning retrieval)

Mean nDCG@10 over BRIGHT's 12 domains, for three query modes: raw (original query), rewrite (an LLM rewrites the query first, as most top BRIGHT systems do), and merged (raw + rewrite).

Query mode Mean nDCG@10
raw (bare embedder) 18.1
+ query rewriting 34.3
merged (raw + rewrite) 33.7

Per-domain (nDCG@10 x100):

Domain raw rewrite
theoremqa_theorems 29.4 50.4
pony 1.1 46.5
biology 20.4 45.6
theoremqa_questions 29.8 44.1
psychology 21.3 39.3
economics 17.5 35.7
sustainable_living 13.6 30.9
earth_science 20.2 30.9
aops 16.3 27.5
stackoverflow 12.5 27.3
robotics 12.3 20.2
leetcode 22.6 12.9

Where it's strong: theorem/definition/reference lookup (theoremqa) and vocabulary-aligned scientific QA (biology, psychology). Where it's weak: reasoning-transfer retrieval (match by shared technique, not shared words, e.g. aops), community/procedural QA (robotics, stackoverflow). pony is the extreme case: near-chance without rewriting (raw 1.1) yet among the strongest with it (46.5), making it the most rewrite-dependent domain in the set.

Note on code retrieval: LeetCode is the one domain where query rewriting hurts (22.6 to 12.9): expanding a terse problem statement into prose moves the query off the corpus distribution. Use the merged variant for code.

Limitations

  • English only. Inherits the base checkpoint's pretraining breadth ceiling; not a broad knowledge embedder.
  • Embedder, not a generator: the checkpoint ships without an LM head, so a plain load prints a lm_head.weight newly-initialized warning (expected) and .generate() returns noise. Apply the Identity swap shown in Usage and pool z_h.
  • Best results use a query-rewriting front-end (an external LLM). The bare-embedder (raw) ceiling is lower; raw and rewrite numbers are both reported above so you can see the real embedder. The rewritten queries here come from INF-X-Retriever.
  • Modest absolute scores on BRIGHT: this is a small model on a deliberately adversarial benchmark.

Architecture & credits

The Hierarchical Reasoning Model (HRM) architecture is by Sapient Intelligence (github.com/sapientinc/HRM, github.com/sapientinc/HRM-Text, arXiv:2506.21734). All architectural credit is theirs. This model is a text-embedding fine-tune of the open Xiaoye08/HRM-Text-0.6B pretrained checkpoint (Apache-2.0); the HRM-Text pretraining pipeline is described in arXiv:2605.20613.

License

Apache-2.0. This is a derivative of Apache-2.0 licensed weights; attribution to Sapient Intelligence (HRM) and the HRM-Text project is preserved above.

Citation

@misc{hrm2025,
  title  = {Hierarchical Reasoning Model},
  author = {Wang, Guan and others},
  year   = {2025},
  eprint = {2506.21734}, archivePrefix = {arXiv}, primaryClass = {cs.AI},
  url    = {https://arxiv.org/abs/2506.21734}
}
@misc{hrmtext2026,
  title  = {HRM-Text: Efficient Pretraining Beyond Scaling},
  author = {Wang, Guan and Liu, Changling and Wang, Chenyu and Zhou, Cai and Sun, Yuhao and Wu, Yifei and Zhen, Shuai and Scimeca, Luca and Abbasi Yadkori, Yasin},
  year   = {2026},
  eprint = {2605.20613}, archivePrefix = {arXiv}, primaryClass = {cs.CL},
  url    = {https://arxiv.org/abs/2605.20613}
}
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for viventhraa96/HRM-Embed-0.6b

Finetuned
(1)
this model

Papers for viventhraa96/HRM-Embed-0.6b

Evaluation results

  • mean nDCG@10 (raw queries) on BRIGHT (12 domains)
    self-reported
    18.100
  • mean nDCG@10 (query-rewrite) on BRIGHT (12 domains)
    self-reported
    34.300
  • mean nDCG@10 (merged raw+rewrite) on BRIGHT (12 domains)
    self-reported
    33.700