π HybridMoE Titan v2
Next-Generation Triple Hybrid Β· Custom Tokeniser Β· Domain-Mixed Training
Successor to Titan v1 β currently in active pre-training on AWS EC2.
π§ Titan v1 Β· π Dataset Β· π Paper Β· π Project Page
π What Changes vs v1
| Aspect | Titan v1 | Titan v2 |
|---|---|---|
| Tokeniser | GPT-2 BPE (50,258 tokens) | Custom BPE (64,000 tokens) β PL + EN + biomedical + code |
| Polish Support | Fragmented diacritics (Δ β2 tok) | Native single-token Polish characters |
| Data Pipeline | Sequential loading | Stratified domain mixing with configurable ratios |
| Corruption | Unknown data quality | Full corruption audit β 0 binary artifacts, 0 null bytes |
| Optimizer | AdamW | MuonClip with QK-Clip stabilisation |
| Parallelism | Single GPU | DeepSpeed ZeRO Stage 2 + CPU offload |
| Target | General pre-training | Polish biomedical reasoning specialisation |
ποΈ Architecture
Same triple-hybrid architecture as Titan v1 with potential capacity scaling:
| Component | Description |
|---|---|
| π Mamba SSM | Linear-time selective state updates for local context |
| π Multi-Head Attention | RoPE-enhanced global attention layers |
| β‘ Fine-Grained MoE | 32 routed + 2 shared experts, top-2, loss-free balancing |
π‘ Core innovations from v1 are preserved: Fixed-Shape Dispatch for gradient checkpointing and Loss-Free EMA Load Balancing for zero dead experts.
π Training Data
Cleaned and audited multilingual corpus from HybridMoE-Training-Dataset-v1:
- FineWeb-Edu β high-quality educational English text
- FineWeb β general web English text
- Wikipedia PL β Polish Wikipedia dumps
- Wolne Lektury β 7,473 Polish public-domain literary works (cleaned from 488K corrupt lines)
π§Ή Data Quality Improvements
- Removed 306 JPEG images embedded as binary in text files
- Eliminated 8.1M U+FFFD replacement characters
- Cleaned 190K null bytes from training corpus
- Result: 0 corruption across entire dataset
π― Training Objective
- Phase 1 β Continue pre-training with custom 64K tokeniser on cleaned corpus
- Phase 2 β Domain adaptation for Polish biomedical text (clinical reasoning, metabolic pathways, omics analysis)
- Phase 3 β SFT on Polish biomedical instruction dataset
π Scale & Progress
| Titan v1 | Titan v2 | |
|---|---|---|
| Parameters | 450.4M | 464.5M (larger embedding: 64K vocab) |
| Vocab size | 50,258 (GPT-2 BPE) | 64,001 (custom BPE) |
| Training tokens | ~4.2B | π In progress |
| Best metric | PPL 27.5 | NTP 4.48 @ step 7,700 |
| Languages | EN + PL (fragmented) | EN + PL (native) + biomedical |
| Cost | ~$50 | TBD |
π‘ The 14M parameter increase (450.4M β 464.5M) comes entirely from the larger embedding matrix (64,001 vs 50,258 tokens). The core architecture is identical.
π¬ Author
Mateusz Piesiak β Project Inkblot Β· Independent ML Research Β· South Kirkby, UK
- HuggingFace: Mati83moni
π Citation
@article{piesiak2026hybridmoe,
title = {HybridMoE Titan v1: A Decoder-Only Language Model Combining
Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained
Mixture-of-Experts at 450M Scale},
author = {Piesiak, Mateusz},
year = {2026},
note = {Project Inkblot β Independent ML Research.
Available: https://huggingface.co/Mati83moni/HybridMoE-Titan-v1}
}
π License
MIT License
π¬ Actively training β check back for updates Β· Built with β€οΈ by a single researcher
Model tree for Mati83moni/HybridMoE-Titan-v2
Base model
Mati83moni/HybridMoE-Titan-v1