HybridMoRMoE β Hybrid Mixture-of-Recursions & Mixture-of-Experts
A custom causal language model combining Mixture-of-Recursions (MoR) with Mixture-of-Experts (MoE) routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training β SFT β GRPO).
Architecture
| Hyperparameter | Value |
|---|---|
| Model type | hybrid_mor_moe |
Hidden dim (d_model) |
576 |
Feed-forward dim (d_ff) |
1536 |
| Attention heads | 8 |
| Base layers | 6 |
| Shared recursive blocks | 6 |
| Unique last layers | 2 |
| Total transformer depth | 30 |
| Number of experts | 4 |
| Experts per token | 1 |
| Max recursions | 3 |
| Router percentile | 0.70 |
| Sequence length | 4096 |
| Vocabulary size | 151,665 |
| Tokenizer | Qwen2Tokenizer (Qwen2.5 compatible) |
Key design choices:
- Shared weight blocks are recursively applied based on a learned complexity score
- A per-token MoE router selects which expert processes each position
- Auxiliary routing loss (
router_aux_loss_coef = 1e-4) encourages load balance - Chat template follows the ChatML (
<|im_start|>/<|im_end|>) format
Training Pipeline
The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):
| Stage | Method | Notes |
|---|---|---|
| 1 | Pre-training | Causal LM on open-domain text |
| 2 | SFT (Supervised Fine-Tuning) | Instruction following with packing |
| 3 | GRPO (Group Relative Policy Optimisation) | Reinforcement learning from preference signal |
Training used FP16 precision throughout (P100 has no BF16 support).
Usage
Because this model uses a custom architecture not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.
Quick inference
import torch
from transformers import AutoTokenizer
# 1. Clone / download this repo
# 2. Make sure hybrid_mor_moe_training.py is on your Python path
# (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)
from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM
model_path = "TorchLLM/HybridMoRMoE" # or local path
config = HybridMoRMoEConfig.from_pretrained(model_path)
model = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
messages = [
{"role": "user", "content": "Explain the difference between MoE and dense transformers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
out = model.simple_generate(
inputs["input_ids"],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Environment setup
pip install torch transformers trl datasets accelerate
HF_TOKEN: If you need to access gated datasets during re-training, export your token:
export HF_TOKEN="your_token_here"Never hard-code tokens in source files.
Repository Structure
TorchLLM/HybridMoRMoE/
βββ config.json # Model architecture config
βββ generation_config.json # Default generation settings
βββ model.safetensors # Trained weights (SafeTensors format)
βββ tokenizer.json # Tokenizer vocabulary & rules
βββ tokenizer_config.json # Tokenizer metadata
βββ chat_template.jinja # ChatML chat template
βββ hybrid_mor_moe_training.py # Full training pipeline source
Citation
If you use this model or training code in your research, please cite:
@misc{hybridmormoe2025,
title = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
author = {Abhishek Gandhi},
year = {2026},
url = {https://huggingface.co/TorchLLM/HybridMoRMoE}
}
License
Apache 2.0 β see LICENSE for details.
- Downloads last month
- 1,991