🧠 HybridMoE Titan v1

Hybrid Mamba + Attention + Mixture-of-Experts Β· 450M Parameters Β· Built From Scratch

One of only three published architectures combining Mamba SSM, Multi-Head Attention, and Mixture-of-Experts
in a single decoder-only language model. Trained on a single GPU for ~$50.

DOI Β· πŸ“„ Paper (PDF) Β· πŸ“Š Dataset Β· πŸš€ Titan v2 Β· 🌐 Project Page


✨ Highlights

🧬 Triple Hybrid First sub-1B model combining Mamba SSM + Multi-Head Attention + Fine-Grained MoE
🧩 Fixed-Shape Dispatch Novel deterministic expert routing β€” enables full gradient checkpointing (>24 GB β†’ 5.84 GB VRAM)
βš–οΈ Loss-Free Balancing EMA-based router bias β€” zero dead experts, no auxiliary loss term
πŸ’° $50 Total Compute Single NVIDIA L4 GPU (24 GB) on AWS g6.2xlarge
πŸ§‘β€πŸ’» Solo Research Designed, built, and trained entirely from scratch by one independent researcher

πŸ“„ Research Paper

HybridMoE Titan v1: A Decoder-Only Language Model Combining Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained Mixture-of-Experts at 450M Scale

Mateusz Piesiak β€” Project Inkblot, Independent ML Research

πŸ“₯ Download Full Paper (PDF) β€” 11 pages Β· 16 references Β· Full architecture & training details πŸ“ LaTeX Source β€” For review and reproduction

⚠️ arXiv Submission Pending β€” This paper is awaiting endorsement for cs.LG (Machine Learning). If you are an active arXiv author endorsed for cs.LG, please consider endorsing β€” only 1 endorsement is needed. See contact below.


πŸ—οΈ Architecture

HybridMoE Titan v1 unifies three sequence processing paradigms in a single decoder stack:

πŸŒ€ Mamba SSM

Linear-time selective state updates
12 layers Β· SwiGLU FFN
Local sequence modelling

πŸ”­ Multi-Head Attention

16 heads + Rotary Position Embeddings
4 layers Β· Jamba-style interleaving
Global context capture

⚑ Mixture-of-Experts

32 routed + 2 shared experts
Top-2 routing Β· Loss-free balancing
Conditional computation

πŸ”„ Jamba-Style 1:3 Interleaving Pattern

Layer  0: [Attention + MoE]  ← Global + Conditional
Layer  1: [Mamba + SwiGLU]   ← Local + Dense
Layer  2: [Mamba + SwiGLU]   ← Local + Dense
Layer  3: [Mamba + SwiGLU]   ← Local + Dense
Layer  4: [Attention + MoE]  ← Global + Conditional
  ...
Layer 15: [Mamba + SwiGLU]   ← Local + Dense

πŸ“ Model Configuration

Parameter Value
Total Parameters 450.4M
d_model 1,024
Attention Heads 16 (head_dim = 64)
Layers 16 (12 Mamba + 4 Attention)
Routed Experts 32
Shared Experts 2
Top-k Routing 2
Max Sequence Length 4,096
Vocabulary 50,258 tokens (GPT-2 BPE)
Precision bfloat16

πŸ”¬ Novel Contributions

1️⃣ Fixed-Shape Expert Dispatch (Β§3.3.2)

Standard MoE uses torch.nonzero() for dynamic expert assignment β€” this creates variable-length tensors that break PyTorch gradient checkpointing. We replace it with deterministic fixed-shape tensor operations:

# ❌ Standard (breaks checkpointing):
expert_indices = torch.nonzero(routing_mask)  # variable shape!

# βœ… Ours (fixed shape, checkpointing-safe):
top_k_experts = router_logits.topk(k=2)       # always [batch, 2]

Impact: Reduced VRAM from >24 GB to 5.84 GB β€” making training possible on a single L4 GPU.

2️⃣ Loss-Free Load Balancing (Β§3.3.3)

Instead of adding an auxiliary balance loss (which competes with the LM objective), we use DeepSeek-style EMA bias updates on router logits:

Metric Value
Expert usage std < 0.032 (ideal uniform β‰ˆ 0.031)
Max single expert share 14.7%
Dead experts (entire training) 0

3️⃣ Triple Hybrid at Sub-1B Scale

Model Params Attention SSM MoE Compute Organisation
Jamba 52B βœ… βœ… Mamba βœ… 16 experts Enterprise AI21 Labs
Zamba 7B βœ… Shared βœ… Mamba ❌ Institutional Zyphra
Mixtral 46.7B βœ… ❌ βœ… 8 experts Enterprise Mistral AI
HybridMoE Titan 450M βœ… RoPE βœ… Mamba βœ… 32+2 ~$50 Independent πŸ§‘β€πŸ’»

πŸ“Š Training

GPU NVIDIA L4 24 GB (AWS g6.2xlarge)
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1)
LR Schedule Cosine decay, peak 2Γ—10⁻⁡, 5% warmup
Effective Batch 64 sequences (~65K tokens)
Throughput ~6,200 tokens/sec
VRAM Usage 5.84 GB (with gradient checkpointing)
Total Tokens ~4.2B
Total Cost ~$50

πŸ“š Training Data

Multilingual English–Polish corpus (~9 GB, ~2.4B tokens):

Source Language Size Description
FineWeb-Edu EN ~3.5 GB High-quality educational text
FineWeb EN ~2.5 GB General web text
Wikipedia PL PL ~1.5 GB Polish Wikipedia dumps
Wolne Lektury PL ~323 MB Polish public-domain literature (cleaned)

Full dataset: Mati83moni/HybridMoE-Training-Dataset-v1

πŸ“ˆ Validation Perplexity Progression

Step Val Loss Perplexity Notes
12,000 2.38 10.81 50-shard validation
20,000 3.47 32.31 β†— Expanded to 200 shards
26,000 3.38 29.26 200 shards
31,000 3.32 27.58 200 shards
42,850 3.30 ~27.5 200 shards (latest) βœ…

πŸ’‘ PPL increase at step 20K is due to validation set expansion (50 β†’ 200 shards, more Polish text), not model degradation. πŸ“Š At ~4.2B tokens, Titan v1 is undertrained relative to the Chinchilla-optimal ~9B tokens for a 450M model. Continued training is expected to reduce PPL further β€” confirmed by Titan v2 reaching NTP 4.48 at step 7,700 on a cleaner, larger corpus.


πŸ§ͺ Expert Routing Health

All 32 routed experts maintain healthy utilisation across all 4 MoE layers throughout training:

Expert Utilisation (step 42,850):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  usage_std range:     0.019 – 0.031
  ideal uniform:       0.03125
  max single expert:   14.7%
  dead experts:        0 / 32
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️ Known Limitations

Limitation Detail
πŸ“‰ Undertrained ~4.2B tokens vs Chinchilla-optimal ~9B for 450M
πŸ”€ Tokeniser GPT-2 BPE fragments Polish diacritics (Δ…β†’2 tokens, ΕΌβ†’3 tokens)
🚫 No SFT/RLHF Pre-trained only β€” does not follow instructions
🚫 No Code/Math Not in training data β€” cannot generate code or solve math
πŸ’» Single GPU Training throughput limited by L4 24 GB constraints

πŸ“ Scaling History

The architecture's Single Source of Truth design allows scaling the entire model by changing a single configuration value (num_layers). Titan went through four scaling phases:

Version Params Layers Hardware Status
Titan Tiny (v0) 57.7M 8 Kaggle T4 βœ… PPL ~59
Titan Mid (v0.5) 690M 24 Kaggle T4 ⚠️ Compute blocked
Titan Little (v1) 450.4M 16 AWS L4 βœ… PPL 27.5
Titan v2 464.5M 16 AWS L4 πŸ”„ Training (NTP 4.48 @ step 7,700)
Titan Standard ~1B TBD AWS multi-L4 πŸ“‹ Planned

πŸ’‘ The reduction from 690M (24 layers) to 450.4M (16 layers) required changing a single configuration value β€” demonstrating the architecture's modular scalability. This is a key design principle: one codebase, one config class, any scale.

# Scale the model by changing ONE line:
config = HybridModelConfig(num_layers=8)   # β†’ Titan Tiny  (57.7M)
config = HybridModelConfig(num_layers=16)  # β†’ Titan v1    (450.4M) ← trained
config = HybridModelConfig(num_layers=24)  # β†’ Titan Mid   (690M)

πŸš€ Titan v2 β€” In Active Development

Improvement v1 v2
Tokeniser GPT-2 BPE (50K) Custom BPE (64K) β€” PL + EN + bio + code
Data mixing Sequential Stratified domain mixing
Optimizer AdamW MuonClip + QK-Clip stabilisation
Parallelism Single GPU DeepSpeed ZeRO Stage 2
Target domain General Polish biomedical reasoning

Follow progress: Mati83moni/HybridMoE-Titan-v2


πŸ“¬ Contact & arXiv Endorsement

Author: Mateusz Piesiak β€” Project Inkblot Β· Independent ML Research Β· South Kirkby, UK

πŸ™ arXiv Endorsement Request

This paper is pending submission to arXiv cs.LG (Machine Learning). As a first-time independent submitter, the author needs endorsement from an established arXiv author.

If you are endorsed for cs.LG and find this work interesting, please consider endorsing this submission. Only 1 endorsement is needed. Contact the author via HuggingFace. This is a genuine independent research project β€” full model, data, and training pipeline are open under MIT License.


πŸ“œ Citation

@article{piesiak2026hybridmoe,
  title   = {HybridMoE Titan v1: A Decoder-Only Language Model Combining 
             Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained 
             Mixture-of-Experts at 450M Scale},
  author  = {Piesiak, Mateusz},
  year    = {2026},
  note    = {Project Inkblot β€” Independent ML Research. 
             Available: https://huggingface.co/Mati83moni/HybridMoE-Titan-v1}
}

πŸ“„ License

MIT License β€” Full model, weights, code, and training pipeline.


Related architectures:
Jamba (52B) Β· Zamba (7B) Β· DeepSeek-MoE Β· Mamba

Built with ❀️ and a single L4 GPU

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mati83moni/HybridMoE-Titan-v1

Finetunes
1 model

Dataset used to train Mati83moni/HybridMoE-Titan-v1

Space using Mati83moni/HybridMoE-Titan-v1 1

Papers for Mati83moni/HybridMoE-Titan-v1

Evaluation results