🚀 HybridMoE Titan v2

Next-Generation Triple Hybrid · Custom Tokeniser · Domain-Mixed Training

Successor to Titan v1 — currently in active pre-training on AWS EC2.

🧠 Titan v1 · 📊 Dataset · 📄 Paper · 🌐 Project Page

🔄 What Changes vs v1

Aspect	Titan v1	Titan v2
Tokeniser	GPT-2 BPE (50,258 tokens)	Custom BPE (64,000 tokens) — PL + EN + biomedical + code
Polish Support	Fragmented diacritics (ą→2 tok)	Native single-token Polish characters
Data Pipeline	Sequential loading	Stratified domain mixing with configurable ratios
Corruption	Unknown data quality	Full corruption audit — 0 binary artifacts, 0 null bytes
Optimizer	AdamW	MuonClip with QK-Clip stabilisation
Parallelism	Single GPU	DeepSpeed ZeRO Stage 2 + CPU offload
Target	General pre-training	Polish biomedical reasoning specialisation

🏗️ Architecture

Same triple-hybrid architecture as Titan v1 with potential capacity scaling:

Component	Description
🌀 Mamba SSM	Linear-time selective state updates for local context
🔭 Multi-Head Attention	RoPE-enhanced global attention layers
⚡ Fine-Grained MoE	32 routed + 2 shared experts, top-2, loss-free balancing

💡 Core innovations from v1 are preserved: Fixed-Shape Dispatch for gradient checkpointing and Loss-Free EMA Load Balancing for zero dead experts.

📚 Training Data

Cleaned and audited multilingual corpus from HybridMoE-Training-Dataset-v1:

FineWeb-Edu — high-quality educational English text
FineWeb — general web English text
Wikipedia PL — Polish Wikipedia dumps
Wolne Lektury — 7,473 Polish public-domain literary works (cleaned from 488K corrupt lines)

🧹 Data Quality Improvements

Removed 306 JPEG images embedded as binary in text files
Eliminated 8.1M U+FFFD replacement characters
Cleaned 190K null bytes from training corpus
Result: 0 corruption across entire dataset

🎯 Training Objective

Phase 1 — Continue pre-training with custom 64K tokeniser on cleaned corpus
Phase 2 — Domain adaptation for Polish biomedical text (clinical reasoning, metabolic pathways, omics analysis)
Phase 3 — SFT on Polish biomedical instruction dataset

📏 Scale & Progress

	Titan v1	Titan v2
Parameters	450.4M	464.5M (larger embedding: 64K vocab)
Vocab size	50,258 (GPT-2 BPE)	64,001 (custom BPE)
Training tokens	~4.2B	🔄 In progress
Best metric	PPL 27.5	NTP 4.48 @ step 7,700
Languages	EN + PL (fragmented)	EN + PL (native) + biomedical
Cost	~$50	TBD

💡 The 14M parameter increase (450.4M → 464.5M) comes entirely from the larger embedding matrix (64,001 vs 50,258 tokens). The core architecture is identical.

📬 Author

Mateusz Piesiak — Project Inkblot · Independent ML Research · South Kirkby, UK

HuggingFace: Mati83moni

📜 Citation

@article{piesiak2026hybridmoe,
  title   = {HybridMoE Titan v1: A Decoder-Only Language Model Combining 
             Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained 
             Mixture-of-Experts at 450M Scale},
  author  = {Piesiak, Mateusz},
  year    = {2026},
  note    = {Project Inkblot — Independent ML Research. 
             Available: https://huggingface.co/Mati83moni/HybridMoE-Titan-v1}
}

📄 License

MIT License

_{🔬 Actively training — check back for updates · Built with ❤️ by a single researcher}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Mati83moni/HybridMoE-Titan-v2

Base model

Mati83moni/HybridMoE-Titan-v1

Finetuned

(1)

this model

Mati83moni
/

HybridMoE-Titan-v2