🧠 HybridMoE Titan v1

Hybrid Mamba + Attention + Mixture-of-Experts · 450M Parameters · Built From Scratch

One of only three published architectures combining Mamba SSM, Multi-Head Attention, and Mixture-of-Experts
in a single decoder-only language model. Trained on a single GPU for ~$50.

· 📄 Paper (PDF) · 📊 Dataset · 🚀 Titan v2 · 🌐 Project Page

✨ Highlights


🧬 Triple Hybrid	First sub-1B model combining Mamba SSM + Multi-Head Attention + Fine-Grained MoE
🧩 Fixed-Shape Dispatch	Novel deterministic expert routing — enables full gradient checkpointing (>24 GB → 5.84 GB VRAM)
⚖️ Loss-Free Balancing	EMA-based router bias — zero dead experts, no auxiliary loss term
💰 $50 Total Compute	Single NVIDIA L4 GPU (24 GB) on AWS g6.2xlarge
🧑‍💻 Solo Research	Designed, built, and trained entirely from scratch by one independent researcher

📄 Research Paper

HybridMoE Titan v1: A Decoder-Only Language Model Combining Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained Mixture-of-Experts at 450M Scale

Mateusz Piesiak — Project Inkblot, Independent ML Research

📥 Download Full Paper (PDF) — 11 pages · 16 references · Full architecture & training details 📝 LaTeX Source — For review and reproduction

⚠️ arXiv Submission Pending — This paper is awaiting endorsement for cs.LG (Machine Learning). If you are an active arXiv author endorsed for cs.LG, please consider endorsing — only 1 endorsement is needed. See contact below.

🏗️ Architecture

HybridMoE Titan v1 unifies three sequence processing paradigms in a single decoder stack:

🌀 Mamba SSM

Linear-time selective state updates
12 layers · SwiGLU FFN
Local sequence modelling

🔭 Multi-Head Attention

16 heads + Rotary Position Embeddings
4 layers · Jamba-style interleaving
Global context capture

⚡ Mixture-of-Experts

32 routed + 2 shared experts
Top-2 routing · Loss-free balancing
Conditional computation

🔄 Jamba-Style 1:3 Interleaving Pattern

Layer  0: [Attention + MoE]  ← Global + Conditional
Layer  1: [Mamba + SwiGLU]   ← Local + Dense
Layer  2: [Mamba + SwiGLU]   ← Local + Dense
Layer  3: [Mamba + SwiGLU]   ← Local + Dense
Layer  4: [Attention + MoE]  ← Global + Conditional
  ...
Layer 15: [Mamba + SwiGLU]   ← Local + Dense

📐 Model Configuration

Parameter	Value
Total Parameters	450.4M
d_model	1,024
Attention Heads	16 (head_dim = 64)
Layers	16 (12 Mamba + 4 Attention)
Routed Experts	32
Shared Experts	2
Top-k Routing	2
Max Sequence Length	4,096
Vocabulary	50,258 tokens (GPT-2 BPE)
Precision	bfloat16

🔬 Novel Contributions

1️⃣ Fixed-Shape Expert Dispatch (§3.3.2)

Standard MoE uses torch.nonzero() for dynamic expert assignment — this creates variable-length tensors that break PyTorch gradient checkpointing. We replace it with deterministic fixed-shape tensor operations:

# ❌ Standard (breaks checkpointing):
expert_indices = torch.nonzero(routing_mask)  # variable shape!

# ✅ Ours (fixed shape, checkpointing-safe):
top_k_experts = router_logits.topk(k=2)       # always [batch, 2]

Impact: Reduced VRAM from >24 GB to 5.84 GB — making training possible on a single L4 GPU.

2️⃣ Loss-Free Load Balancing (§3.3.3)

Instead of adding an auxiliary balance loss (which competes with the LM objective), we use DeepSeek-style EMA bias updates on router logits:

Metric	Value
Expert usage std	< 0.032 (ideal uniform ≈ 0.031)
Max single expert share	14.7%
Dead experts (entire training)	0

3️⃣ Triple Hybrid at Sub-1B Scale

Model	Params	Attention	SSM	MoE	Compute	Organisation
Jamba	52B	✅	✅ Mamba	✅ 16 experts	Enterprise	AI21 Labs
Zamba	7B	✅ Shared	✅ Mamba	❌	Institutional	Zyphra
Mixtral	46.7B	✅	❌	✅ 8 experts	Enterprise	Mistral AI
HybridMoE Titan	450M	✅ RoPE	✅ Mamba	✅ 32+2	~$50	Independent 🧑‍💻

📊 Training


GPU	NVIDIA L4 24 GB (AWS g6.2xlarge)
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR Schedule	Cosine decay, peak 2×10⁻⁵, 5% warmup
Effective Batch	64 sequences (~65K tokens)
Throughput	~6,200 tokens/sec
VRAM Usage	5.84 GB (with gradient checkpointing)
Total Tokens	~4.2B
Total Cost	~$50

📚 Training Data

Multilingual English–Polish corpus (~9 GB, ~2.4B tokens):

Source	Language	Size	Description
FineWeb-Edu	EN	~3.5 GB	High-quality educational text
FineWeb	EN	~2.5 GB	General web text
Wikipedia PL	PL	~1.5 GB	Polish Wikipedia dumps
Wolne Lektury	PL	~323 MB	Polish public-domain literature (cleaned)

Full dataset: Mati83moni/HybridMoE-Training-Dataset-v1

📈 Validation Perplexity Progression

Step	Val Loss	Perplexity	Notes
12,000	2.38	10.81	50-shard validation
20,000	3.47	32.31	↗ Expanded to 200 shards
26,000	3.38	29.26	200 shards
31,000	3.32	27.58	200 shards
42,850	3.30	~27.5	200 shards (latest) ✅

💡 PPL increase at step 20K is due to validation set expansion (50 → 200 shards, more Polish text), not model degradation. 📊 At ~4.2B tokens, Titan v1 is undertrained relative to the Chinchilla-optimal ~9B tokens for a 450M model. Continued training is expected to reduce PPL further — confirmed by Titan v2 reaching NTP 4.48 at step 7,700 on a cleaner, larger corpus.

🧪 Expert Routing Health

All 32 routed experts maintain healthy utilisation across all 4 MoE layers throughout training:

Expert Utilisation (step 42,850):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  usage_std range:     0.019 – 0.031
  ideal uniform:       0.03125
  max single expert:   14.7%
  dead experts:        0 / 32
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️ Known Limitations

Limitation	Detail
📉 Undertrained	~4.2B tokens vs Chinchilla-optimal ~9B for 450M
🔤 Tokeniser	GPT-2 BPE fragments Polish diacritics (ą→2 tokens, ż→3 tokens)
🚫 No SFT/RLHF	Pre-trained only — does not follow instructions
🚫 No Code/Math	Not in training data — cannot generate code or solve math
💻 Single GPU	Training throughput limited by L4 24 GB constraints

📐 Scaling History

The architecture's Single Source of Truth design allows scaling the entire model by changing a single configuration value (num_layers). Titan went through four scaling phases:

Version	Params	Layers	Hardware	Status
Titan Tiny (v0)	57.7M	8	Kaggle T4	✅ PPL ~59
Titan Mid (v0.5)	690M	24	Kaggle T4	⚠️ Compute blocked
Titan Little (v1)	450.4M	16	AWS L4	✅ PPL 27.5
Titan v2	464.5M	16	AWS L4	🔄 Training (NTP 4.48 @ step 7,700)
Titan Standard	~1B	TBD	AWS multi-L4	📋 Planned

💡 The reduction from 690M (24 layers) to 450.4M (16 layers) required changing a single configuration value — demonstrating the architecture's modular scalability. This is a key design principle: one codebase, one config class, any scale.

# Scale the model by changing ONE line:
config = HybridModelConfig(num_layers=8)   # → Titan Tiny  (57.7M)
config = HybridModelConfig(num_layers=16)  # → Titan v1    (450.4M) ← trained
config = HybridModelConfig(num_layers=24)  # → Titan Mid   (690M)

🚀 Titan v2 — In Active Development

Improvement	v1	v2
Tokeniser	GPT-2 BPE (50K)	Custom BPE (64K) — PL + EN + bio + code
Data mixing	Sequential	Stratified domain mixing
Optimizer	AdamW	MuonClip + QK-Clip stabilisation
Parallelism	Single GPU	DeepSpeed ZeRO Stage 2
Target domain	General	Polish biomedical reasoning

Follow progress: Mati83moni/HybridMoE-Titan-v2

📬 Contact & arXiv Endorsement

Author: Mateusz Piesiak — Project Inkblot · Independent ML Research · South Kirkby, UK

HuggingFace: Mati83moni

🙏 arXiv Endorsement Request

This paper is pending submission to arXiv cs.LG (Machine Learning). As a first-time independent submitter, the author needs endorsement from an established arXiv author.

If you are endorsed for cs.LG and find this work interesting, please consider endorsing this submission. Only 1 endorsement is needed. Contact the author via HuggingFace. This is a genuine independent research project — full model, data, and training pipeline are open under MIT License.

📜 Citation

@article{piesiak2026hybridmoe,
  title   = {HybridMoE Titan v1: A Decoder-Only Language Model Combining 
             Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained 
             Mixture-of-Experts at 450M Scale},
  author  = {Piesiak, Mateusz},
  year    = {2026},
  note    = {Project Inkblot — Independent ML Research. 
             Available: https://huggingface.co/Mati83moni/HybridMoE-Titan-v1}
}