π§ HybridMoE Titan v1
Hybrid Mamba + Attention + Mixture-of-Experts Β· 450M Parameters Β· Built From Scratch
One of only three published architectures combining Mamba SSM, Multi-Head Attention, and Mixture-of-Experts
in a single decoder-only language model. Trained on a single GPU for ~$50.
Β·
π Paper (PDF) Β·
π Dataset Β·
π Titan v2 Β·
π Project Page
β¨ Highlights
| 𧬠Triple Hybrid | First sub-1B model combining Mamba SSM + Multi-Head Attention + Fine-Grained MoE |
| π§© Fixed-Shape Dispatch | Novel deterministic expert routing β enables full gradient checkpointing (>24 GB β 5.84 GB VRAM) |
| βοΈ Loss-Free Balancing | EMA-based router bias β zero dead experts, no auxiliary loss term |
| π° $50 Total Compute | Single NVIDIA L4 GPU (24 GB) on AWS g6.2xlarge |
| π§βπ» Solo Research | Designed, built, and trained entirely from scratch by one independent researcher |
π Research Paper
HybridMoE Titan v1: A Decoder-Only Language Model Combining Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained Mixture-of-Experts at 450M Scale
Mateusz Piesiak β Project Inkblot, Independent ML Research
π₯ Download Full Paper (PDF) β 11 pages Β· 16 references Β· Full architecture & training details π LaTeX Source β For review and reproduction
β οΈ arXiv Submission Pending β This paper is awaiting endorsement for
cs.LG(Machine Learning). If you are an active arXiv author endorsed for cs.LG, please consider endorsing β only 1 endorsement is needed. See contact below.
ποΈ Architecture
HybridMoE Titan v1 unifies three sequence processing paradigms in a single decoder stack:
π Jamba-Style 1:3 Interleaving Pattern
Layer 0: [Attention + MoE] β Global + Conditional
Layer 1: [Mamba + SwiGLU] β Local + Dense
Layer 2: [Mamba + SwiGLU] β Local + Dense
Layer 3: [Mamba + SwiGLU] β Local + Dense
Layer 4: [Attention + MoE] β Global + Conditional
...
Layer 15: [Mamba + SwiGLU] β Local + Dense
π Model Configuration
| Parameter | Value |
|---|---|
| Total Parameters | 450.4M |
| d_model | 1,024 |
| Attention Heads | 16 (head_dim = 64) |
| Layers | 16 (12 Mamba + 4 Attention) |
| Routed Experts | 32 |
| Shared Experts | 2 |
| Top-k Routing | 2 |
| Max Sequence Length | 4,096 |
| Vocabulary | 50,258 tokens (GPT-2 BPE) |
| Precision | bfloat16 |
π¬ Novel Contributions
1οΈβ£ Fixed-Shape Expert Dispatch (Β§3.3.2)
Standard MoE uses torch.nonzero() for dynamic expert assignment β this creates variable-length tensors that break PyTorch gradient checkpointing. We replace it with deterministic fixed-shape tensor operations:
# β Standard (breaks checkpointing):
expert_indices = torch.nonzero(routing_mask) # variable shape!
# β
Ours (fixed shape, checkpointing-safe):
top_k_experts = router_logits.topk(k=2) # always [batch, 2]
Impact: Reduced VRAM from >24 GB to 5.84 GB β making training possible on a single L4 GPU.
2οΈβ£ Loss-Free Load Balancing (Β§3.3.3)
Instead of adding an auxiliary balance loss (which competes with the LM objective), we use DeepSeek-style EMA bias updates on router logits:
| Metric | Value |
|---|---|
| Expert usage std | < 0.032 (ideal uniform β 0.031) |
| Max single expert share | 14.7% |
| Dead experts (entire training) | 0 |
3οΈβ£ Triple Hybrid at Sub-1B Scale
| Model | Params | Attention | SSM | MoE | Compute | Organisation |
|---|---|---|---|---|---|---|
| Jamba | 52B | β | β Mamba | β 16 experts | Enterprise | AI21 Labs |
| Zamba | 7B | β Shared | β Mamba | β | Institutional | Zyphra |
| Mixtral | 46.7B | β | β | β 8 experts | Enterprise | Mistral AI |
| HybridMoE Titan | 450M | β RoPE | β Mamba | β 32+2 | ~$50 | Independent π§βπ» |
π Training
| GPU | NVIDIA L4 24 GB (AWS g6.2xlarge) |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1) |
| LR Schedule | Cosine decay, peak 2Γ10β»β΅, 5% warmup |
| Effective Batch | 64 sequences (~65K tokens) |
| Throughput | ~6,200 tokens/sec |
| VRAM Usage | 5.84 GB (with gradient checkpointing) |
| Total Tokens | ~4.2B |
| Total Cost | ~$50 |
π Training Data
Multilingual EnglishβPolish corpus (~9 GB, ~2.4B tokens):
| Source | Language | Size | Description |
|---|---|---|---|
| FineWeb-Edu | EN | ~3.5 GB | High-quality educational text |
| FineWeb | EN | ~2.5 GB | General web text |
| Wikipedia PL | PL | ~1.5 GB | Polish Wikipedia dumps |
| Wolne Lektury | PL | ~323 MB | Polish public-domain literature (cleaned) |
Full dataset: Mati83moni/HybridMoE-Training-Dataset-v1
π Validation Perplexity Progression
| Step | Val Loss | Perplexity | Notes |
|---|---|---|---|
| 12,000 | 2.38 | 10.81 | 50-shard validation |
| 20,000 | 3.47 | 32.31 | β Expanded to 200 shards |
| 26,000 | 3.38 | 29.26 | 200 shards |
| 31,000 | 3.32 | 27.58 | 200 shards |
| 42,850 | 3.30 | ~27.5 | 200 shards (latest) β |
π‘ PPL increase at step 20K is due to validation set expansion (50 β 200 shards, more Polish text), not model degradation. π At ~4.2B tokens, Titan v1 is undertrained relative to the Chinchilla-optimal ~9B tokens for a 450M model. Continued training is expected to reduce PPL further β confirmed by Titan v2 reaching NTP 4.48 at step 7,700 on a cleaner, larger corpus.
π§ͺ Expert Routing Health
All 32 routed experts maintain healthy utilisation across all 4 MoE layers throughout training:
Expert Utilisation (step 42,850):
ββββββββββββββββββββββββββββββββ
usage_std range: 0.019 β 0.031
ideal uniform: 0.03125
max single expert: 14.7%
dead experts: 0 / 32
ββββββββββββββββββββββββββββββββ
β οΈ Known Limitations
| Limitation | Detail |
|---|---|
| π Undertrained | ~4.2B tokens vs Chinchilla-optimal ~9B for 450M |
| π€ Tokeniser | GPT-2 BPE fragments Polish diacritics (Δ β2 tokens, ΕΌβ3 tokens) |
| π« No SFT/RLHF | Pre-trained only β does not follow instructions |
| π« No Code/Math | Not in training data β cannot generate code or solve math |
| π» Single GPU | Training throughput limited by L4 24 GB constraints |
π Scaling History
The architecture's Single Source of Truth design allows scaling the entire model by changing a single configuration value (num_layers). Titan went through four scaling phases:
| Version | Params | Layers | Hardware | Status |
|---|---|---|---|---|
| Titan Tiny (v0) | 57.7M | 8 | Kaggle T4 | β PPL ~59 |
| Titan Mid (v0.5) | 690M | 24 | Kaggle T4 | β οΈ Compute blocked |
| Titan Little (v1) | 450.4M | 16 | AWS L4 | β PPL 27.5 |
| Titan v2 | 464.5M | 16 | AWS L4 | π Training (NTP 4.48 @ step 7,700) |
| Titan Standard | ~1B | TBD | AWS multi-L4 | π Planned |
π‘ The reduction from 690M (24 layers) to 450.4M (16 layers) required changing a single configuration value β demonstrating the architecture's modular scalability. This is a key design principle: one codebase, one config class, any scale.
# Scale the model by changing ONE line:
config = HybridModelConfig(num_layers=8) # β Titan Tiny (57.7M)
config = HybridModelConfig(num_layers=16) # β Titan v1 (450.4M) β trained
config = HybridModelConfig(num_layers=24) # β Titan Mid (690M)
π Titan v2 β In Active Development
| Improvement | v1 | v2 |
|---|---|---|
| Tokeniser | GPT-2 BPE (50K) | Custom BPE (64K) β PL + EN + bio + code |
| Data mixing | Sequential | Stratified domain mixing |
| Optimizer | AdamW | MuonClip + QK-Clip stabilisation |
| Parallelism | Single GPU | DeepSpeed ZeRO Stage 2 |
| Target domain | General | Polish biomedical reasoning |
Follow progress: Mati83moni/HybridMoE-Titan-v2
π¬ Contact & arXiv Endorsement
Author: Mateusz Piesiak β Project Inkblot Β· Independent ML Research Β· South Kirkby, UK
- HuggingFace: Mati83moni
π arXiv Endorsement Request
This paper is pending submission to arXiv cs.LG (Machine Learning). As a first-time independent submitter, the author needs endorsement from an established arXiv author.
If you are endorsed for cs.LG and find this work interesting, please consider endorsing this submission. Only 1 endorsement is needed. Contact the author via HuggingFace. This is a genuine independent research project β full model, data, and training pipeline are open under MIT License.
π Citation
@article{piesiak2026hybridmoe,
title = {HybridMoE Titan v1: A Decoder-Only Language Model Combining
Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained
Mixture-of-Experts at 450M Scale},
author = {Piesiak, Mateusz},
year = {2026},
note = {Project Inkblot β Independent ML Research.
Available: https://huggingface.co/Mati83moni/HybridMoE-Titan-v1}
}
π License
MIT License β Full model, weights, code, and training pipeline.
Related architectures:
Jamba (52B) Β·
Zamba (7B) Β·
DeepSeek-MoE Β·
Mamba
Built with β€οΈ and a single L4 GPU
Model tree for Mati83moni/HybridMoE-Titan-v1
Dataset used to train Mati83moni/HybridMoE-Titan-v1
Space using Mati83moni/HybridMoE-Titan-v1 1
Papers for Mati83moni/HybridMoE-Titan-v1
Zamba: A Compact 7B SSM Hybrid Model
Jamba: A Hybrid Transformer-Mamba Language Model
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Evaluation results
- Validation Perplexityself-reported27.500