arXiv Paper GitHub Dense baseline

HRM-MoE-0.6B

HRM-MoE-0.6B is an active-matched sparse Mixture-of-Experts checkpoint built on the HRM-Text L backbone. It keeps the per-token active compute essentially equal to the dense HRM-Text-0.6B baseline, while increasing the total parameter pool with 64 routed SwiGLU experts.

This release is the epoch-4 pretrained L-MoE 64x8 checkpoint exported from the native FSDP2 training checkpoint into a single bf16 model.safetensors file. It is a pre-alignment base model, not a chat or instruction-following assistant.

Model Structure Comparison

Dense HRM-Text-0.6B

input tokens
  -> HRM H/L recurrent blocks
  -> dense SwiGLU FFN, intermediate width 3584
  -> all dense FFN parameters active for every token

HRM-MoE-0.6B

input tokens
  -> HRM H/L recurrent blocks
  -> router over 64 SwiGLU experts
  -> top-8 experts active per token, each expert width 448
  -> active FFN width = 8 x 448 = 3584
Model Total params Active params / token Active ratio FFN path
HRM-Text-0.6B dense 694.7M 694.7M 100% dense SwiGLU, width 3584
HRM-MoE-0.6B 64x8 3.01B 696.6M 23.2% 64 routed SwiGLU experts, top-8, width 448 each

The active FFN width is exactly matched to the dense L baseline. The sparse model therefore tests whether a larger expert pool improves quality without increasing the active per-token parameter budget.

Results vs HRM-Text-0.6B

The table below compares the epoch-4 native checkpoint evaluations for the dense L baseline and the active-matched L-MoE 64x8 checkpoint. Values are percentages.

Benchmark Metric HRM-Text-0.6B dense e4 HRM-MoE-0.6B e4 Delta
GSM8k acc 79.61 83.62 +4.01
MATH acc 50.96 55.50 +4.54
DROP em 74.21 77.86 +3.65
DROP f1 77.94 81.47 +3.53
MMLU acc 53.74 58.60 +4.86
ARC acc 76.37 82.27 +5.90
HellaSwag acc 51.48 65.58 +14.10
Winogrande acc 66.93 70.32 +3.39
BoolQ acc 84.56 86.02 +1.46
MMLU-Pro acc 28.24 30.70 +2.46
AIME25 maj_pass@1 16.67 16.67 +0.00
AIME25 maj_pass@10 26.67 26.67 +0.00
AIME25 maj_pass@100 50.00 50.00 +0.00

Across the 10 Standard/MMLU-Pro metrics above, HRM-MoE-0.6B improves over the dense active-matched baseline by about +4.79 percentage points on average. AIME25 majority voting is roughly unchanged in this run.

Model Details

Field Value
Architecture HRM-Text L backbone with sparse MoE FFN
Checkpoint epoch 4 pretrained checkpoint
Format bf16 model.safetensors
Total parameters 3,008,759,040
Active parameters / token ~696,648,960
Non-expert parameters 366,347,520
Expert parameters 2,642,411,520
Hidden size 1280
Layers per H / L stack 12
Total dense block count 24
Attention heads 10
H_cycles x L_cycles 2 x 3
Max sequence length 4096
Vocabulary 65,536
Position encoding RoPE theta 10000
Normalization Parameterless Pre-RMSNorm
Attention Gated attention
MoE experts 64
MoE top-k 8
Expert intermediate size 448
Active FFN width 8 x 448 = 3584
Exported weights EMA weights
dtype bfloat16

Usage

This checkpoint uses the custom hrm_text_moe architecture through Hugging Face trust_remote_code=True.

pip install --upgrade transformers safetensors accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Xiaoye08/HRM-MoE-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

# synth,cot composite: reasoning / CoT style.
condition = "<|quad_end|><|object_ref_end|>"
prompt = f"<|im_start|>{condition}Explain why the sky is blue.<|im_end|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)

print(tokenizer.decode(out[0], skip_special_tokens=False))

This mirrors the HRM-Text usage pattern. The extra trust_remote_code=True flag is required because hrm_text_moe is not yet a native Transformers architecture.

Prompt Format

HRM-Text and HRM-MoE use condition prefix tokens. Prompts should be rendered as:

<|im_start|><condition tokens>prompt text<|im_end|>

Common conditions:

Mode Tokens
direct `<
cot `<
noisy `<
synth `<

For reasoning-style prompting, synth,cot maps to <|quad_end|><|object_ref_end|>.

PrefixLM Mask

The checkpoint was pretrained with the HRM-Text PrefixLM objective. For generation from a prompt, pass:

inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

This marks the prompt as one bidirectional prefix block before autoregressive decoding.

Training Snapshot

  • 32 GPUs
  • 4 pretraining epochs
  • global batch size 172,032 tokens
  • learning rate 2.5e-4
  • bfloat16 forward/backward
  • EMA decay 0.9999
  • grouped Triton MoE expert kernels during training

Limitations

  • Pre-alignment base checkpoint, not a chat model.
  • Not instruction-tuned, RLHF-trained, or safety-aligned.
  • English-focused pretraining mixture.
  • Requires trust_remote_code=True; CUDA is strongly recommended for practical inference.
  • Outputs may be inaccurate, biased, or unsafe.

License

Apache License 2.0.

Citation

This model is derived from HRM-Text. If you use HRM-Text or HRM-MoE, please cite:

@misc{wang2026hrmtextefficientpretrainingscaling,
      title={HRM-Text: Efficient Pretraining Beyond Scaling},
      author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},
      year={2026},
      eprint={2605.20613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20613},
}

Upstream

Downloads last month
19
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Xiaoye08/HRM-MoE-0.6B