HRM-MoE-0.6B

HRM-MoE-0.6B is an active-matched sparse Mixture-of-Experts checkpoint built on the HRM-Text L backbone. It keeps the per-token active compute essentially equal to the dense HRM-Text-0.6B baseline, while increasing the total parameter pool with 64 routed SwiGLU experts.

This release is the epoch-4 pretrained L-MoE 64x8 checkpoint exported from the native FSDP2 training checkpoint into a single bf16 model.safetensors file. It is a pre-alignment base model, not a chat or instruction-following assistant.

Model Structure Comparison

Dense HRM-Text-0.6B

input tokens
  -> HRM H/L recurrent blocks
  -> dense SwiGLU FFN, intermediate width 3584
  -> all dense FFN parameters active for every token

HRM-MoE-0.6B

input tokens
  -> HRM H/L recurrent blocks
  -> router over 64 SwiGLU experts
  -> top-8 experts active per token, each expert width 448
  -> active FFN width = 8 x 448 = 3584

Model	Total params	Active params / token	Active ratio	FFN path
HRM-Text-0.6B dense	694.7M	694.7M	100%	dense SwiGLU, width 3584
HRM-MoE-0.6B 64x8	3.01B	696.6M	23.2%	64 routed SwiGLU experts, top-8, width 448 each

The active FFN width is exactly matched to the dense L baseline. The sparse model therefore tests whether a larger expert pool improves quality without increasing the active per-token parameter budget.

Results vs HRM-Text-0.6B

The table below compares the epoch-4 native checkpoint evaluations for the dense L baseline and the active-matched L-MoE 64x8 checkpoint. Values are percentages.

Benchmark	Metric	HRM-Text-0.6B dense e4	HRM-MoE-0.6B e4	Delta
GSM8k	acc	79.61	83.62	+4.01
MATH	acc	50.96	55.50	+4.54
DROP	em	74.21	77.86	+3.65
DROP	f1	77.94	81.47	+3.53
MMLU	acc	53.74	58.60	+4.86
ARC	acc	76.37	82.27	+5.90
HellaSwag	acc	51.48	65.58	+14.10
Winogrande	acc	66.93	70.32	+3.39
BoolQ	acc	84.56	86.02	+1.46
MMLU-Pro	acc	28.24	30.70	+2.46
AIME25	maj_pass@1	16.67	16.67	+0.00
AIME25	maj_pass@10	26.67	26.67	+0.00
AIME25	maj_pass@100	50.00	50.00	+0.00

Across the 10 Standard/MMLU-Pro metrics above, HRM-MoE-0.6B improves over the dense active-matched baseline by about +4.79 percentage points on average. AIME25 majority voting is roughly unchanged in this run.

Model Details

Field	Value
Architecture	HRM-Text L backbone with sparse MoE FFN
Checkpoint	epoch 4 pretrained checkpoint
Format	bf16 `model.safetensors`
Total parameters	3,008,759,040
Active parameters / token	~696,648,960
Non-expert parameters	366,347,520
Expert parameters	2,642,411,520
Hidden size	1280
Layers per H / L stack	12
Total dense block count	24
Attention heads	10
H_cycles x L_cycles	2 x 3
Max sequence length	4096
Vocabulary	65,536
Position encoding	RoPE theta 10000
Normalization	Parameterless Pre-RMSNorm
Attention	Gated attention
MoE experts	64
MoE top-k	8
Expert intermediate size	448
Active FFN width	8 x 448 = 3584
Exported weights	EMA weights
dtype	bfloat16

Usage

This checkpoint uses the custom hrm_text_moe architecture through Hugging Face trust_remote_code=True.

pip install --upgrade transformers safetensors accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Xiaoye08/HRM-MoE-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

# synth,cot composite: reasoning / CoT style.
condition = "<|quad_end|><|object_ref_end|>"
prompt = f"<|im_start|>{condition}Explain why the sky is blue.<|im_end|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(tokenizer.decode(out[0], skip_special_tokens=False))

This mirrors the HRM-Text usage pattern. The extra trust_remote_code=True flag is required because hrm_text_moe is not yet a native Transformers architecture.

Prompt Format

HRM-Text and HRM-MoE use condition prefix tokens. Prompts should be rendered as:

<|im_start|><condition tokens>prompt text<|im_end|>

Common conditions:

Mode	Tokens
`direct`	`<
`cot`	`<
`noisy`	`<
`synth`	`<

For reasoning-style prompting, synth,cot maps to <|quad_end|><|object_ref_end|>.

PrefixLM Mask

The checkpoint was pretrained with the HRM-Text PrefixLM objective. For generation from a prompt, pass:

inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

This marks the prompt as one bidirectional prefix block before autoregressive decoding.

Training Snapshot

32 GPUs
4 pretraining epochs
global batch size 172,032 tokens
learning rate 2.5e-4
bfloat16 forward/backward
EMA decay 0.9999
grouped Triton MoE expert kernels during training

Limitations

Pre-alignment base checkpoint, not a chat model.
Not instruction-tuned, RLHF-trained, or safety-aligned.
English-focused pretraining mixture.
Requires trust_remote_code=True; CUDA is strongly recommended for practical inference.
Outputs may be inaccurate, biased, or unsafe.

License

Apache License 2.0.

Citation

This model is derived from HRM-Text. If you use HRM-Text or HRM-MoE, please cite:

@misc{wang2026hrmtextefficientpretrainingscaling,
      title={HRM-Text: Efficient Pretraining Beyond Scaling},
      author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},
      year={2026},
      eprint={2605.20613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20613},
}

Upstream

HRM-Text paper: https://arxiv.org/abs/2605.20613
Dense baseline: https://huggingface.co/Xiaoye08/HRM-Text-0.6B
HRM-MoE 1B-active release: https://huggingface.co/Xiaoye08/HRM-MoE
HRM-MoE code: https://github.com/XiaoYee/HRM-MoE

Downloads last month: 19

Safetensors

Model size

3B params

Tensor type

BF16

Paper for Xiaoye08/HRM-MoE-0.6B

HRM-Text: Efficient Pretraining Beyond Scaling

Paper • 2605.20613 • Published May 20 • 317