Morpheus-TR-50K
Morpheus is a neural morpheme-aware tokenizer for Turkish — an agglutinative language whose semantic content is densely packed into productive suffix chains. It combines unsupervised morphological supervision (Morfessor) with self-supervised objectives (SGNS, contrastive, MLM) to learn segmentations that are simultaneously morphologically aligned and language-modeling-friendly.
evlerimizdekiler → ev | leri | miz | deki | ler
root PL+POSS POSS LOC+REL PL
("the ones in our houses")
Where classical BPE/WordPiece fragment morphologically rich Turkish words into statistically convenient but linguistically opaque subwords, Morpheus produces interpretable morpheme-level segmentations while also yielding structured word embeddings with strong root-family clustering (274× intra/inter cosine separation).
Headline Results
Evaluated on a curated monolingual Turkish corpus (informal + academic + news), against five baseline tokenizers under a fixed 58 M-parameter (param-equalized) GPT autoregressive LM:
| Metric | Morpheus | Best baseline | Δ |
|---|---|---|---|
| BPC ↓ | 1.640 | 1.698 (WordPiece) | −3.5% |
| Per-token perplexity ↓ | 152 | 523 (WordPiece) | 3.4× lower |
| MAS OOV ↑ | 92.4% | 5.76% (WordPiece) | 16× higher |
| Boundary F1 OOV | 0.916 | — | (segmentation generalization) |
| GPU memory (B=32 gen) | 2,270 MB | 3,724 MB (classical) | −39% |
| Tokenizer artifact size | 0.67 MB | 2.65 MB (WordPiece) | −75% |
| End-to-end gen char/sec | 548 | 924 (WordPiece) | −41% (trade-off) |
Morpheus achieves the lowest BPC across all evaluated tokenizers, and notably outperforms its own teacher Morfessor (BPC 1.640 vs 1.705, −3.9%) — demonstrating that joint training with downstream LM signals refines morpheme detection beyond pure unsupervised MDL boundaries.
Model Details
- Developed by: Tolga Şakar
- Model type: Neural morpheme-aware tokenizer (CharEncoder → BoundaryDetector → SegmentEncoder)
- Language(s): Turkish (
tr) - License: MIT
- Trained from: Scratch (no base model)
- Vocabulary size: 35,092 morpheme-level tokens
- Word embedding dimension: 320
- Parameters (Morpheus core): ~7.4M
- Parameters (with MLM head, training only): ~15M
Repository
- Code: TurkishTokenizer-Alpha-v1
- Corpus collection tooling: CorpusCollector
- Paper: arXiv preprint forthcoming
Usage
Morpheus is a neural tokenizer — it requires both a vocab file and a trained PyTorch model checkpoint. Standard AutoTokenizer.from_pretrained() does not apply.
Install
git clone https://github.com/lonewolf-rd/TurkishTokenizer-Alpha-v1.git
cd TurkishTokenizer-Alpha-v1
pip install -e .
Download artifacts from this repo
from huggingface_hub import snapshot_download
local_dir = snapshot_download(repo_id="lonewolflab/Morpheus-TR-50K")
# Downloads:
# - morpheus_50k/vocab.json
# - morpheus_50k/tokenizer_config.json
# - turkish_morpheus_a100_v3_final.pt
Load and tokenize
import torch, sys
from pathlib import Path
from src.model_development.model.morpheus import Morpheus
from src.model_development.training.trainer import TrainingConfig
from src.model_development.tokenization.morpheus_tokenizer import MorpheusTokenizer
# TrainingConfig pickle workaround (checkpoint contains the config object)
sys.modules["__main__"].TrainingConfig = TrainingConfig
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load Morpheus neural model
ckpt = torch.load(f"{local_dir}/turkish_morpheus_a100_v3_final.pt",
map_location=device, weights_only=False)
cfg = ckpt["config"]
morpheus = Morpheus(
char_dim=cfg.char_dim, char_embed_dim=cfg.char_embed_dim,
case_embed_dim=cfg.case_embed_dim,
n_layers_encoder=cfg.n_layers_encoder,
n_layers_detector=cfg.n_layers_detector,
num_heads=cfg.num_heads, max_word_len=cfg.max_word_len,
max_segs=cfg.max_segs, dropout=cfg.dropout,
threshold=cfg.threshold, pos_weight=cfg.pos_weight,
count_loss_w=getattr(cfg, "count_loss_w", 0.3),
)
morpheus.load_state_dict(ckpt["model_state"])
morpheus.to(device).eval()
# Load tokenizer with the model
tokenizer = MorpheusTokenizer.load(
f"{local_dir}/morpheus_50k",
morpheus_model=morpheus,
device=device,
)
# Use
text = "evlerimizdekiler kitapçısı muvaffakiyetsizleştiriciler"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text, add_special_tokens=False)
print(tokens)
# ['ev', 'leri', 'miz', 'deki', 'ler', 'kitap', 'çı', 'sı',
# 'muvaffak', 'iyet', 'siz', 'leştir', 'ici', 'ler']
print(ids) # corresponding integer IDs
decoded = tokenizer.decode(ids)
print(decoded)
# "evlerimizdekiler kitapçısı muvaffakiyetsizleştiriciler"
Use as Hugging Face dataset preprocessing
from datasets import load_dataset
ds = load_dataset("oscar-corpus/OSCAR-2301", "tr", streaming=True, split="train")
def tokenize_fn(example):
return {"input_ids": tokenizer.encode(example["text"])}
tokenized = ds.map(tokenize_fn, batched=False)
Architecture
char_ids, case_flags (B=batch, L=32 max word chars)
│
▼
CharEncoder Multi-scale CNN + 3 × RoPE self-attention
│ Output: (B, L, 320) context-aware char vectors
▼
BoundaryDetector 4 × RoPE attention with adjacent-pair scoring
│ Deep-supervised aux loss vs Morfessor labels
│ Output: boundary_probs ∈ [0,1] of shape (B, L−1)
▼
SegmentEncoder Poisson-binomial DP → soft segment membership
│ Char-level attention pooling per segment
│ Mean over valid segments + FFN + LayerNorm
▼
word_embedding ∈ ℝ³²⁰
Soft segmentation via Poisson-binomial dynamic programming: given boundary probabilities p_i ∈ [0,1] between adjacent characters, the probability that character i belongs to segment k is computed via:
f[i, k] = f[i−1, k] · (1 − p_i) + f[i−1, k−1] · p_i
This gives a differentiable membership matrix that converges to one-hot segment assignments as p_i → {0,1}, recovering hard segmentation at inference without architectural switch.
Training Data
Trained on a Monolingual Turkish corpus (~17M words, 261K lines) covering three registers:
- Ekşisözlük — informal/colloquial Turkish (rich morphological constructs)
- Dergipark — academic/formal Turkish (diverse derivational morphology)
- Turkish news sites — standard journalistic register
Collected with the companion repository CorpusCollector which documents source URLs, scraping protocol, rate-limiting, and per-source preprocessing logic for full reproducibility.
Split: 95% train (248K lines, ~16M words), 5% test (13K lines).
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 22 |
| Effective batch size | 512 (256 × grad_accum 2) |
| Optimizer | AdamW, β₁=0.9, β₂=0.98 |
| Learning rate | 3e-4 with cosine decay to 0.2× |
| Warmup steps | 2,500 |
| Gradient clip | 0.5 |
| Char dim | 320 |
| Encoder/Detector layers | 3 / 4 |
| Attention heads | 5 (head_dim=64) |
| Max word len | 32 |
| Max sentence len | 32 |
| Max segments per word | 12 |
| Dropout | 0.1 |
| Mixed precision | TF32 (matmul + cuDNN) |
Loss
Total loss is a weighted sum:
L = w_aux · L_aux + w_sgns · L_sgns + w_ctr · L_contrastive + w_mlm · L_mlm
| Component | Role | Weight schedule |
|---|---|---|
L_aux |
Boundary BCE + count MSE vs Morfessor (deep-supervised) | Decays 0.50 → 0.08 over 22 epochs |
L_sgns |
Skip-gram with 16 negatives, ±6 window, 120K context vocab | Constant 0.7 |
L_ctr |
InfoNCE on root identity, temperature 0.10 | Constant 0.3 |
L_mlm |
Char autoregressive reconstruction of masked words (20% mask rate) | Constant 1.0 |
The aux schedule realizes a curriculum: early epochs anchor on Morfessor (boundary learning); as it decays, distributional signals (SGNS, MLM) take over to shape semantic geometry.
Hardware
- Single NVIDIA A100 80GB
- ~18 minutes/epoch × 22 epochs ≈ 6.5 hours total training
Evaluation
Intrinsic — morphological alignment
Stratified Turkish test set:
| Stratum | n | Boundary F1 | MAS % | Coherence Δ |
|---|---|---|---|---|
| seen | 800 | 1.000 | 100.0 | 0.982 |
| oov | 400 | 0.916 | 92.4 | 0.915 |
| curated_oov | 10 | 1.000 | 100.0 | — |
| nonce | 5 | 0.788 | 72.2 | — |
(seen results are tautological — model trained on Morfessor labels; OOV is the generalization test.)
Downstream language modeling
Param-equalized 58 M-parameter GPT, 2 epochs over Turkish corpus:
| Tokenizer | vocab | BPC ↓ | Token PPL ↓ | Fertility |
|---|---|---|---|---|
| Morpheus | 35K | 1.640 | 152 | 1.26 |
| WordPiece | 64K | 1.698 | 523 | 1.01 |
| Morfessor | 30K | 1.705 | 135 | 1.31 |
| Unigram | 64K | 1.726 | 344 | 1.05 |
| BPE | 64K | 1.737 | 397 | 1.03 |
| ByteBPE | 64K | 1.755 | 383 | 1.11 |
Inference
| Tokenizer | Encode kchar/s | Gen char/s (B=1) | Gen char/s (B=32) | GPU mem B=32 |
|---|---|---|---|---|
| Morpheus | 3,885 | 548 | 13,843 | 2,270 MB |
| WordPiece | 2,100 | 924 | 23,497 | 3,724 MB |
| BPE | 984 | 853 | 21,138 | 3,724 MB |
Pareto frontier: {Morpheus, WordPiece} on (BPC, throughput). Morpheus dominates on (BPC, memory) and tokenizer artifact size.
Intended Uses
Recommended
- Turkish NLU / classification tasks (morpheme-aligned tokens help attention head specialization)
- Morphologically-sensitive information retrieval (stemming, suffix-aware query expansion)
- Pretraining small-to-medium Turkish language models (≤1B parameters)
- Linguistic research / corpus annotation at scale
- Educational tools for Turkish morphology visualization
- Memory-constrained inference (39% lower GPU memory vs classical 64K-vocab tokenizers)
Not recommended
- Real-time generation where latency dominates (~40% slower char generation throughput than BPE)
- Multilingual models (Turkish-specific by design)
- Frontier-scale LLM pretraining (>10B parameters) where inductive-bias advantage of morphology saturates
Limitations
- Single language: trained for Turkish only. Cannot tokenize other languages meaningfully.
- Single domain mix: corpus is informal + academic + news. May underperform on highly specialized domains (medical, legal) without further adaptation.
- Generation throughput trade-off: ~40% slower end-to-end character generation than BPE in autoregressive sampling due to higher fertility (1.26 vs 1.01 tokens/char). Acceptable for batch/offline tasks; potential bottleneck for real-time chat.
- Verb morphology: SIGMORPHON evaluation shows Morpheus has slightly higher error rate on verb lemma identification than nominal lemmas — Turkish verb inflection (tense + aspect + person stacks) is the residual hard case.
- Tokenizer requires neural model: unlike BPE/SentencePiece, Morpheus needs a PyTorch model checkpoint loaded for inference. Adds 60 MB to deployment footprint.
Bias and Ethical Considerations
- The training corpus reflects the bias of its three source registers (Ekşisözlük, Dergipark, Turkish news). It is monolingual Turkish and does not include code-switched or dialectal variants (Cypriot Turkish, Azerbaijani Turkish, Karachay-Balkar etc.).
- Ekşisözlük content was scraped from public entries. CorpusCollector documents the scraping protocol and rate-limiting; users adapting this corpus for derivative works should review the source TOS for their region.
- No personally identifiable information (PII) filtering was applied beyond Wikipedia-style cleanup. Downstream users should add PII redaction if deploying in sensitive contexts.
Citation
@misc{sakar2026morpheus,
title = {Morpheus: A Morphology-Aware Tokenizer for Turkish},
author = {Şakar, Tolga},
year = {2026},
note = {Preprint forthcoming on arXiv}
}
Related prior work by the same author:
@article{sakar2025rag,
title = {Maximizing {RAG} efficiency: A comparative analysis of {RAG} methods},
author = {Şakar, Tolga and Emekci, Hakan},
journal = {Natural Language Processing},
volume = {31},
number = {1},
year = {2025},
publisher = {Cambridge University Press}
}
Acknowledgments
- Morfessor (Creutz & Lagus, 2002, 2007) — unsupervised morphological supervisor
- SentencePiece (Kudo & Richardson, 2018) and HuggingFace tokenizers — baseline tokenizer implementations
- SIGMORPHON 2022 Turkish task — inflection gold standard for morphological evaluation
- The Turkish NLP community for prior work (BERTurk, TURNA, Zemberek, TRMorph) that motivated this study
License
MIT License. See LICENSE.
Repository: TurkishTokenizer-Alpha-v1 · Author: Tolga Şakar
- Downloads last month
- 60
Space using lonewolflab/Morpheus-TR-50K 1
Evaluation results
- BPC (lower is better) on Turkish curated corpusself-reported1.640
- Boundary F1 (vs Morfessor) on Held-out Turkish OOV stratumself-reported0.916
- Morphological Alignment Score % (OOV) on Held-out Turkish OOV stratumself-reported92.400