πŸš€ HybridMoE Titan v2

Next-Generation Triple Hybrid Β· Custom Tokeniser Β· Domain-Mixed Training

Successor to Titan v1 β€” currently in active pre-training on AWS EC2.

🧠 Titan v1 Β· πŸ“Š Dataset Β· πŸ“„ Paper Β· 🌐 Project Page


πŸ”„ What Changes vs v1

Aspect Titan v1 Titan v2
Tokeniser GPT-2 BPE (50,258 tokens) Custom BPE (64,000 tokens) β€” PL + EN + biomedical + code
Polish Support Fragmented diacritics (Δ…β†’2 tok) Native single-token Polish characters
Data Pipeline Sequential loading Stratified domain mixing with configurable ratios
Corruption Unknown data quality Full corruption audit β€” 0 binary artifacts, 0 null bytes
Optimizer AdamW MuonClip with QK-Clip stabilisation
Parallelism Single GPU DeepSpeed ZeRO Stage 2 + CPU offload
Target General pre-training Polish biomedical reasoning specialisation

πŸ—οΈ Architecture

Same triple-hybrid architecture as Titan v1 with potential capacity scaling:

Component Description
πŸŒ€ Mamba SSM Linear-time selective state updates for local context
πŸ”­ Multi-Head Attention RoPE-enhanced global attention layers
⚑ Fine-Grained MoE 32 routed + 2 shared experts, top-2, loss-free balancing

πŸ’‘ Core innovations from v1 are preserved: Fixed-Shape Dispatch for gradient checkpointing and Loss-Free EMA Load Balancing for zero dead experts.


πŸ“š Training Data

Cleaned and audited multilingual corpus from HybridMoE-Training-Dataset-v1:

  • FineWeb-Edu β€” high-quality educational English text
  • FineWeb β€” general web English text
  • Wikipedia PL β€” Polish Wikipedia dumps
  • Wolne Lektury β€” 7,473 Polish public-domain literary works (cleaned from 488K corrupt lines)

🧹 Data Quality Improvements

  • Removed 306 JPEG images embedded as binary in text files
  • Eliminated 8.1M U+FFFD replacement characters
  • Cleaned 190K null bytes from training corpus
  • Result: 0 corruption across entire dataset

🎯 Training Objective

  1. Phase 1 β€” Continue pre-training with custom 64K tokeniser on cleaned corpus
  2. Phase 2 β€” Domain adaptation for Polish biomedical text (clinical reasoning, metabolic pathways, omics analysis)
  3. Phase 3 β€” SFT on Polish biomedical instruction dataset

πŸ“ Scale & Progress

Titan v1 Titan v2
Parameters 450.4M 464.5M (larger embedding: 64K vocab)
Vocab size 50,258 (GPT-2 BPE) 64,001 (custom BPE)
Training tokens ~4.2B πŸ”„ In progress
Best metric PPL 27.5 NTP 4.48 @ step 7,700
Languages EN + PL (fragmented) EN + PL (native) + biomedical
Cost ~$50 TBD

πŸ’‘ The 14M parameter increase (450.4M β†’ 464.5M) comes entirely from the larger embedding matrix (64,001 vs 50,258 tokens). The core architecture is identical.


πŸ“¬ Author

Mateusz Piesiak β€” Project Inkblot Β· Independent ML Research Β· South Kirkby, UK


πŸ“œ Citation

@article{piesiak2026hybridmoe,
  title   = {HybridMoE Titan v1: A Decoder-Only Language Model Combining 
             Mamba SSM, Multi-Head Attention with RoPE, and Fine-Grained 
             Mixture-of-Experts at 450M Scale},
  author  = {Piesiak, Mateusz},
  year    = {2026},
  note    = {Project Inkblot β€” Independent ML Research. 
             Available: https://huggingface.co/Mati83moni/HybridMoE-Titan-v1}
}

πŸ“„ License

MIT License


πŸ”¬ Actively training β€” check back for updates Β· Built with ❀️ by a single researcher

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mati83moni/HybridMoE-Titan-v2

Finetuned
(1)
this model

Dataset used to train Mati83moni/HybridMoE-Titan-v2

Space using Mati83moni/HybridMoE-Titan-v2 1