Morpheus-TR-50K

Morpheus is a neural morpheme-aware tokenizer for Turkish — an agglutinative language whose semantic content is densely packed into productive suffix chains. It combines unsupervised morphological supervision (Morfessor) with self-supervised objectives (SGNS, contrastive, MLM) to learn segmentations that are simultaneously morphologically aligned and language-modeling-friendly.

evlerimizdekiler  →  ev | leri | miz | deki | ler
                     root  PL+POSS POSS  LOC+REL  PL
                    ("the ones in our houses")

Where classical BPE/WordPiece fragment morphologically rich Turkish words into statistically convenient but linguistically opaque subwords, Morpheus produces interpretable morpheme-level segmentations while also yielding structured word embeddings with strong root-family clustering (274× intra/inter cosine separation).

Headline Results

Evaluated on a curated monolingual Turkish corpus (informal + academic + news), against five baseline tokenizers under a fixed 58 M-parameter (param-equalized) GPT autoregressive LM:

Metric	Morpheus	Best baseline	Δ
BPC ↓	1.640	1.698 (WordPiece)	−3.5%
Per-token perplexity ↓	152	523 (WordPiece)	3.4× lower
MAS OOV ↑	92.4%	5.76% (WordPiece)	16× higher
Boundary F1 OOV	0.916	—	(segmentation generalization)
GPU memory (B=32 gen)	2,270 MB	3,724 MB (classical)	−39%
Tokenizer artifact size	0.67 MB	2.65 MB (WordPiece)	−75%
End-to-end gen char/sec	548	924 (WordPiece)	−41% (trade-off)

Morpheus achieves the lowest BPC across all evaluated tokenizers, and notably outperforms its own teacher Morfessor (BPC 1.640 vs 1.705, −3.9%) — demonstrating that joint training with downstream LM signals refines morpheme detection beyond pure unsupervised MDL boundaries.

Model Details

Developed by: Tolga Şakar
Model type: Neural morpheme-aware tokenizer (CharEncoder → BoundaryDetector → SegmentEncoder)
Language(s): Turkish (tr)
License: MIT
Trained from: Scratch (no base model)
Vocabulary size: 35,092 morpheme-level tokens
Word embedding dimension: 320
Parameters (Morpheus core): ~7.4M
Parameters (with MLM head, training only): ~15M

Repository

Code: TurkishTokenizer-Alpha-v1
Corpus collection tooling: CorpusCollector
Paper: arXiv preprint forthcoming

Usage

Morpheus is a neural tokenizer — it requires both a vocab file and a trained PyTorch model checkpoint. Standard AutoTokenizer.from_pretrained() does not apply.

Install

git clone https://github.com/lonewolf-rd/TurkishTokenizer-Alpha-v1.git
cd TurkishTokenizer-Alpha-v1
pip install -e .

Download artifacts from this repo

from huggingface_hub import snapshot_download

local_dir = snapshot_download(repo_id="lonewolflab/Morpheus-TR-50K")
# Downloads:
#   - morpheus_50k/vocab.json
#   - morpheus_50k/tokenizer_config.json
#   - turkish_morpheus_a100_v3_final.pt

Load and tokenize

import torch, sys
from pathlib import Path
from src.model_development.model.morpheus import Morpheus
from src.model_development.training.trainer import TrainingConfig
from src.model_development.tokenization.morpheus_tokenizer import MorpheusTokenizer

# TrainingConfig pickle workaround (checkpoint contains the config object)
sys.modules["__main__"].TrainingConfig = TrainingConfig

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Morpheus neural model
ckpt = torch.load(f"{local_dir}/turkish_morpheus_a100_v3_final.pt",
                  map_location=device, weights_only=False)
cfg = ckpt["config"]
morpheus = Morpheus(
    char_dim=cfg.char_dim, char_embed_dim=cfg.char_embed_dim,
    case_embed_dim=cfg.case_embed_dim,
    n_layers_encoder=cfg.n_layers_encoder,
    n_layers_detector=cfg.n_layers_detector,
    num_heads=cfg.num_heads, max_word_len=cfg.max_word_len,
    max_segs=cfg.max_segs, dropout=cfg.dropout,
    threshold=cfg.threshold, pos_weight=cfg.pos_weight,
    count_loss_w=getattr(cfg, "count_loss_w", 0.3),
)
morpheus.load_state_dict(ckpt["model_state"])
morpheus.to(device).eval()

# Load tokenizer with the model
tokenizer = MorpheusTokenizer.load(
    f"{local_dir}/morpheus_50k",
    morpheus_model=morpheus,
    device=device,
)

# Use
text = "evlerimizdekiler kitapçısı muvaffakiyetsizleştiriciler"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text, add_special_tokens=False)
print(tokens)
# ['ev', 'leri', 'miz', 'deki', 'ler', 'kitap', 'çı', 'sı',
#  'muvaffak', 'iyet', 'siz', 'leştir', 'ici', 'ler']

print(ids)  # corresponding integer IDs

decoded = tokenizer.decode(ids)
print(decoded)
# "evlerimizdekiler kitapçısı muvaffakiyetsizleştiriciler"

Use as Hugging Face dataset preprocessing

from datasets import load_dataset

ds = load_dataset("oscar-corpus/OSCAR-2301", "tr", streaming=True, split="train")

def tokenize_fn(example):
    return {"input_ids": tokenizer.encode(example["text"])}

tokenized = ds.map(tokenize_fn, batched=False)

Architecture

char_ids, case_flags  (B=batch, L=32 max word chars)
        │
        ▼
   CharEncoder              Multi-scale CNN + 3 × RoPE self-attention
        │                   Output: (B, L, 320) context-aware char vectors
        ▼
  BoundaryDetector          4 × RoPE attention with adjacent-pair scoring
        │                   Deep-supervised aux loss vs Morfessor labels
        │                   Output: boundary_probs ∈ [0,1] of shape (B, L−1)
        ▼
  SegmentEncoder            Poisson-binomial DP → soft segment membership
        │                   Char-level attention pooling per segment
        │                   Mean over valid segments + FFN + LayerNorm
        ▼
   word_embedding ∈ ℝ³²⁰

Soft segmentation via Poisson-binomial dynamic programming: given boundary probabilities p_i ∈ [0,1] between adjacent characters, the probability that character i belongs to segment k is computed via:

f[i, k] = f[i−1, k] · (1 − p_i) + f[i−1, k−1] · p_i

This gives a differentiable membership matrix that converges to one-hot segment assignments as p_i → {0,1}, recovering hard segmentation at inference without architectural switch.

Training Data

Trained on a Monolingual Turkish corpus (~17M words, 261K lines) covering three registers:

Ekşisözlük — informal/colloquial Turkish (rich morphological constructs)
Dergipark — academic/formal Turkish (diverse derivational morphology)
Turkish news sites — standard journalistic register

Collected with the companion repository CorpusCollector which documents source URLs, scraping protocol, rate-limiting, and per-source preprocessing logic for full reproducibility.

Split: 95% train (~~248K lines, ~16M words), 5% test (~~13K lines).

Training Procedure

Hyperparameters

Parameter	Value
Epochs	22
Effective batch size	512 (256 × grad_accum 2)
Optimizer	AdamW, β₁=0.9, β₂=0.98
Learning rate	3e-4 with cosine decay to 0.2×
Warmup steps	2,500
Gradient clip	0.5
Char dim	320
Encoder/Detector layers	3 / 4
Attention heads	5 (head_dim=64)
Max word len	32
Max sentence len	32
Max segments per word	12
Dropout	0.1
Mixed precision	TF32 (matmul + cuDNN)

Loss

Total loss is a weighted sum:

L = w_aux · L_aux + w_sgns · L_sgns + w_ctr · L_contrastive + w_mlm · L_mlm

Component	Role	Weight schedule
`L_aux`	Boundary BCE + count MSE vs Morfessor (deep-supervised)	Decays 0.50 → 0.08 over 22 epochs
`L_sgns`	Skip-gram with 16 negatives, ±6 window, 120K context vocab	Constant 0.7
`L_ctr`	InfoNCE on root identity, temperature 0.10	Constant 0.3
`L_mlm`	Char autoregressive reconstruction of masked words (20% mask rate)	Constant 1.0

The aux schedule realizes a curriculum: early epochs anchor on Morfessor (boundary learning); as it decays, distributional signals (SGNS, MLM) take over to shape semantic geometry.

Hardware

Single NVIDIA A100 80GB
~18 minutes/epoch × 22 epochs ≈ 6.5 hours total training

Evaluation

Intrinsic — morphological alignment

Stratified Turkish test set:

Stratum	n	Boundary F1	MAS %	Coherence Δ
seen	800	1.000	100.0	0.982
oov	400	0.916	92.4	0.915
curated_oov	10	1.000	100.0	—
nonce	5	0.788	72.2	—

(seen results are tautological — model trained on Morfessor labels; OOV is the generalization test.)

Downstream language modeling

Param-equalized 58 M-parameter GPT, 2 epochs over Turkish corpus:

Tokenizer	vocab	BPC ↓	Token PPL ↓	Fertility
Morpheus	35K	1.640	152	1.26
WordPiece	64K	1.698	523	1.01
Morfessor	30K	1.705	135	1.31
Unigram	64K	1.726	344	1.05
BPE	64K	1.737	397	1.03
ByteBPE	64K	1.755	383	1.11

Inference

Tokenizer	Encode kchar/s	Gen char/s (B=1)	Gen char/s (B=32)	GPU mem B=32
Morpheus	3,885	548	13,843	2,270 MB
WordPiece	2,100	924	23,497	3,724 MB
BPE	984	853	21,138	3,724 MB

Pareto frontier: {Morpheus, WordPiece} on (BPC, throughput). Morpheus dominates on (BPC, memory) and tokenizer artifact size.

Intended Uses

Turkish NLU / classification tasks (morpheme-aligned tokens help attention head specialization)
Morphologically-sensitive information retrieval (stemming, suffix-aware query expansion)
Pretraining small-to-medium Turkish language models (≤1B parameters)
Linguistic research / corpus annotation at scale
Educational tools for Turkish morphology visualization
Memory-constrained inference (39% lower GPU memory vs classical 64K-vocab tokenizers)

Not recommended

Real-time generation where latency dominates (~40% slower char generation throughput than BPE)
Multilingual models (Turkish-specific by design)
Frontier-scale LLM pretraining (>10B parameters) where inductive-bias advantage of morphology saturates

Limitations

Single language: trained for Turkish only. Cannot tokenize other languages meaningfully.
Single domain mix: corpus is informal + academic + news. May underperform on highly specialized domains (medical, legal) without further adaptation.
Generation throughput trade-off: ~40% slower end-to-end character generation than BPE in autoregressive sampling due to higher fertility (1.26 vs 1.01 tokens/char). Acceptable for batch/offline tasks; potential bottleneck for real-time chat.
Verb morphology: SIGMORPHON evaluation shows Morpheus has slightly higher error rate on verb lemma identification than nominal lemmas — Turkish verb inflection (tense + aspect + person stacks) is the residual hard case.
Tokenizer requires neural model: unlike BPE/SentencePiece, Morpheus needs a PyTorch model checkpoint loaded for inference. Adds 60 MB to deployment footprint.

Bias and Ethical Considerations

The training corpus reflects the bias of its three source registers (Ekşisözlük, Dergipark, Turkish news). It is monolingual Turkish and does not include code-switched or dialectal variants (Cypriot Turkish, Azerbaijani Turkish, Karachay-Balkar etc.).
Ekşisözlük content was scraped from public entries. CorpusCollector documents the scraping protocol and rate-limiting; users adapting this corpus for derivative works should review the source TOS for their region.
No personally identifiable information (PII) filtering was applied beyond Wikipedia-style cleanup. Downstream users should add PII redaction if deploying in sensitive contexts.

Citation

@misc{sakar2026morpheus,
  title  = {Morpheus: A Morphology-Aware Tokenizer for Turkish},
  author = {Şakar, Tolga},
  year   = {2026},
  note   = {Preprint forthcoming on arXiv}
}

Related prior work by the same author:

@article{sakar2025rag,
  title     = {Maximizing {RAG} efficiency: A comparative analysis of {RAG} methods},
  author    = {Şakar, Tolga and Emekci, Hakan},
  journal   = {Natural Language Processing},
  volume    = {31},
  number    = {1},
  year      = {2025},
  publisher = {Cambridge University Press}
}

Acknowledgments

Morfessor (Creutz & Lagus, 2002, 2007) — unsupervised morphological supervisor
SentencePiece (Kudo & Richardson, 2018) and HuggingFace tokenizers — baseline tokenizer implementations
SIGMORPHON 2022 Turkish task — inflection gold standard for morphological evaluation
The Turkish NLP community for prior work (BERTurk, TURNA, Zemberek, TRMorph) that motivated this study

License

MIT License. See LICENSE.

Repository: TurkishTokenizer-Alpha-v1 · Author: Tolga Şakar

Downloads last month: 60

Space using lonewolflab/Morpheus-TR-50K 1

Evaluation results

BPC (lower is better) on Turkish curated corpus
self-reported

1.640
Boundary F1 (vs Morfessor) on Held-out Turkish OOV stratum
self-reported

0.916
Morphological Alignment Score % (OOV) on Held-out Turkish OOV stratum
self-reported

92.400

lonewolflab
/

Morpheus-TR-50K