Morpheus-TR-50K

Morpheus is a neural morpheme-aware tokenizer for Turkish — an agglutinative language whose semantic content is densely packed into productive suffix chains. It combines unsupervised morphological supervision (Morfessor) with self-supervised objectives (SGNS, contrastive, MLM) to learn segmentations that are simultaneously morphologically aligned and language-modeling-friendly.

evlerimizdekiler  →  ev | leri | miz | deki | ler
                     root  PL+POSS POSS  LOC+REL  PL
                    ("the ones in our houses")

Where classical BPE/WordPiece fragment morphologically rich Turkish words into statistically convenient but linguistically opaque subwords, Morpheus produces interpretable morpheme-level segmentations while also yielding structured word embeddings with strong root-family clustering (274× intra/inter cosine separation).

Headline Results

Evaluated on a curated monolingual Turkish corpus (informal + academic + news), against five baseline tokenizers under a fixed 58 M-parameter (param-equalized) GPT autoregressive LM:

Metric Morpheus Best baseline Δ
BPC ↓ 1.640 1.698 (WordPiece) −3.5%
Per-token perplexity ↓ 152 523 (WordPiece) 3.4× lower
MAS OOV ↑ 92.4% 5.76% (WordPiece) 16× higher
Boundary F1 OOV 0.916 (segmentation generalization)
GPU memory (B=32 gen) 2,270 MB 3,724 MB (classical) −39%
Tokenizer artifact size 0.67 MB 2.65 MB (WordPiece) −75%
End-to-end gen char/sec 548 924 (WordPiece) −41% (trade-off)

Morpheus achieves the lowest BPC across all evaluated tokenizers, and notably outperforms its own teacher Morfessor (BPC 1.640 vs 1.705, −3.9%) — demonstrating that joint training with downstream LM signals refines morpheme detection beyond pure unsupervised MDL boundaries.

Model Details

  • Developed by: Tolga Şakar
  • Model type: Neural morpheme-aware tokenizer (CharEncoder → BoundaryDetector → SegmentEncoder)
  • Language(s): Turkish (tr)
  • License: MIT
  • Trained from: Scratch (no base model)
  • Vocabulary size: 35,092 morpheme-level tokens
  • Word embedding dimension: 320
  • Parameters (Morpheus core): ~7.4M
  • Parameters (with MLM head, training only): ~15M

Repository

Usage

Morpheus is a neural tokenizer — it requires both a vocab file and a trained PyTorch model checkpoint. Standard AutoTokenizer.from_pretrained() does not apply.

Install

git clone https://github.com/lonewolf-rd/TurkishTokenizer-Alpha-v1.git
cd TurkishTokenizer-Alpha-v1
pip install -e .

Download artifacts from this repo

from huggingface_hub import snapshot_download

local_dir = snapshot_download(repo_id="lonewolflab/Morpheus-TR-50K")
# Downloads:
#   - morpheus_50k/vocab.json
#   - morpheus_50k/tokenizer_config.json
#   - turkish_morpheus_a100_v3_final.pt

Load and tokenize

import torch, sys
from pathlib import Path
from src.model_development.model.morpheus import Morpheus
from src.model_development.training.trainer import TrainingConfig
from src.model_development.tokenization.morpheus_tokenizer import MorpheusTokenizer

# TrainingConfig pickle workaround (checkpoint contains the config object)
sys.modules["__main__"].TrainingConfig = TrainingConfig

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Morpheus neural model
ckpt = torch.load(f"{local_dir}/turkish_morpheus_a100_v3_final.pt",
                  map_location=device, weights_only=False)
cfg = ckpt["config"]
morpheus = Morpheus(
    char_dim=cfg.char_dim, char_embed_dim=cfg.char_embed_dim,
    case_embed_dim=cfg.case_embed_dim,
    n_layers_encoder=cfg.n_layers_encoder,
    n_layers_detector=cfg.n_layers_detector,
    num_heads=cfg.num_heads, max_word_len=cfg.max_word_len,
    max_segs=cfg.max_segs, dropout=cfg.dropout,
    threshold=cfg.threshold, pos_weight=cfg.pos_weight,
    count_loss_w=getattr(cfg, "count_loss_w", 0.3),
)
morpheus.load_state_dict(ckpt["model_state"])
morpheus.to(device).eval()

# Load tokenizer with the model
tokenizer = MorpheusTokenizer.load(
    f"{local_dir}/morpheus_50k",
    morpheus_model=morpheus,
    device=device,
)

# Use
text = "evlerimizdekiler kitapçısı muvaffakiyetsizleştiriciler"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text, add_special_tokens=False)
print(tokens)
# ['ev', 'leri', 'miz', 'deki', 'ler', 'kitap', 'çı', 'sı',
#  'muvaffak', 'iyet', 'siz', 'leştir', 'ici', 'ler']

print(ids)  # corresponding integer IDs

decoded = tokenizer.decode(ids)
print(decoded)
# "evlerimizdekiler kitapçısı muvaffakiyetsizleştiriciler"

Use as Hugging Face dataset preprocessing

from datasets import load_dataset

ds = load_dataset("oscar-corpus/OSCAR-2301", "tr", streaming=True, split="train")

def tokenize_fn(example):
    return {"input_ids": tokenizer.encode(example["text"])}

tokenized = ds.map(tokenize_fn, batched=False)

Architecture

char_ids, case_flags  (B=batch, L=32 max word chars)
        │
        ▼
   CharEncoder              Multi-scale CNN + 3 × RoPE self-attention
        │                   Output: (B, L, 320) context-aware char vectors
        ▼
  BoundaryDetector          4 × RoPE attention with adjacent-pair scoring
        │                   Deep-supervised aux loss vs Morfessor labels
        │                   Output: boundary_probs ∈ [0,1] of shape (B, L−1)
        ▼
  SegmentEncoder            Poisson-binomial DP → soft segment membership
        │                   Char-level attention pooling per segment
        │                   Mean over valid segments + FFN + LayerNorm
        ▼
   word_embedding ∈ ℝ³²⁰

Soft segmentation via Poisson-binomial dynamic programming: given boundary probabilities p_i ∈ [0,1] between adjacent characters, the probability that character i belongs to segment k is computed via:

f[i, k] = f[i−1, k] · (1 − p_i) + f[i−1, k−1] · p_i

This gives a differentiable membership matrix that converges to one-hot segment assignments as p_i → {0,1}, recovering hard segmentation at inference without architectural switch.

Training Data

Trained on a Monolingual Turkish corpus (~17M words, 261K lines) covering three registers:

  • Ekşisözlük — informal/colloquial Turkish (rich morphological constructs)
  • Dergipark — academic/formal Turkish (diverse derivational morphology)
  • Turkish news sites — standard journalistic register

Collected with the companion repository CorpusCollector which documents source URLs, scraping protocol, rate-limiting, and per-source preprocessing logic for full reproducibility.

Split: 95% train (248K lines, ~16M words), 5% test (13K lines).

Training Procedure

Hyperparameters

Parameter Value
Epochs 22
Effective batch size 512 (256 × grad_accum 2)
Optimizer AdamW, β₁=0.9, β₂=0.98
Learning rate 3e-4 with cosine decay to 0.2×
Warmup steps 2,500
Gradient clip 0.5
Char dim 320
Encoder/Detector layers 3 / 4
Attention heads 5 (head_dim=64)
Max word len 32
Max sentence len 32
Max segments per word 12
Dropout 0.1
Mixed precision TF32 (matmul + cuDNN)

Loss

Total loss is a weighted sum:

L = w_aux · L_aux + w_sgns · L_sgns + w_ctr · L_contrastive + w_mlm · L_mlm
Component Role Weight schedule
L_aux Boundary BCE + count MSE vs Morfessor (deep-supervised) Decays 0.50 → 0.08 over 22 epochs
L_sgns Skip-gram with 16 negatives, ±6 window, 120K context vocab Constant 0.7
L_ctr InfoNCE on root identity, temperature 0.10 Constant 0.3
L_mlm Char autoregressive reconstruction of masked words (20% mask rate) Constant 1.0

The aux schedule realizes a curriculum: early epochs anchor on Morfessor (boundary learning); as it decays, distributional signals (SGNS, MLM) take over to shape semantic geometry.

Hardware

  • Single NVIDIA A100 80GB
  • ~18 minutes/epoch × 22 epochs ≈ 6.5 hours total training

Evaluation

Intrinsic — morphological alignment

Stratified Turkish test set:

Stratum n Boundary F1 MAS % Coherence Δ
seen 800 1.000 100.0 0.982
oov 400 0.916 92.4 0.915
curated_oov 10 1.000 100.0
nonce 5 0.788 72.2

(seen results are tautological — model trained on Morfessor labels; OOV is the generalization test.)

Downstream language modeling

Param-equalized 58 M-parameter GPT, 2 epochs over Turkish corpus:

Tokenizer vocab BPC ↓ Token PPL ↓ Fertility
Morpheus 35K 1.640 152 1.26
WordPiece 64K 1.698 523 1.01
Morfessor 30K 1.705 135 1.31
Unigram 64K 1.726 344 1.05
BPE 64K 1.737 397 1.03
ByteBPE 64K 1.755 383 1.11

Inference

Tokenizer Encode kchar/s Gen char/s (B=1) Gen char/s (B=32) GPU mem B=32
Morpheus 3,885 548 13,843 2,270 MB
WordPiece 2,100 924 23,497 3,724 MB
BPE 984 853 21,138 3,724 MB

Pareto frontier: {Morpheus, WordPiece} on (BPC, throughput). Morpheus dominates on (BPC, memory) and tokenizer artifact size.

Intended Uses

Recommended

  • Turkish NLU / classification tasks (morpheme-aligned tokens help attention head specialization)
  • Morphologically-sensitive information retrieval (stemming, suffix-aware query expansion)
  • Pretraining small-to-medium Turkish language models (≤1B parameters)
  • Linguistic research / corpus annotation at scale
  • Educational tools for Turkish morphology visualization
  • Memory-constrained inference (39% lower GPU memory vs classical 64K-vocab tokenizers)

Not recommended

  • Real-time generation where latency dominates (~40% slower char generation throughput than BPE)
  • Multilingual models (Turkish-specific by design)
  • Frontier-scale LLM pretraining (>10B parameters) where inductive-bias advantage of morphology saturates

Limitations

  • Single language: trained for Turkish only. Cannot tokenize other languages meaningfully.
  • Single domain mix: corpus is informal + academic + news. May underperform on highly specialized domains (medical, legal) without further adaptation.
  • Generation throughput trade-off: ~40% slower end-to-end character generation than BPE in autoregressive sampling due to higher fertility (1.26 vs 1.01 tokens/char). Acceptable for batch/offline tasks; potential bottleneck for real-time chat.
  • Verb morphology: SIGMORPHON evaluation shows Morpheus has slightly higher error rate on verb lemma identification than nominal lemmas — Turkish verb inflection (tense + aspect + person stacks) is the residual hard case.
  • Tokenizer requires neural model: unlike BPE/SentencePiece, Morpheus needs a PyTorch model checkpoint loaded for inference. Adds 60 MB to deployment footprint.

Bias and Ethical Considerations

  • The training corpus reflects the bias of its three source registers (Ekşisözlük, Dergipark, Turkish news). It is monolingual Turkish and does not include code-switched or dialectal variants (Cypriot Turkish, Azerbaijani Turkish, Karachay-Balkar etc.).
  • Ekşisözlük content was scraped from public entries. CorpusCollector documents the scraping protocol and rate-limiting; users adapting this corpus for derivative works should review the source TOS for their region.
  • No personally identifiable information (PII) filtering was applied beyond Wikipedia-style cleanup. Downstream users should add PII redaction if deploying in sensitive contexts.

Citation

@misc{sakar2026morpheus,
  title  = {Morpheus: A Morphology-Aware Tokenizer for Turkish},
  author = {Şakar, Tolga},
  year   = {2026},
  note   = {Preprint forthcoming on arXiv}
}

Related prior work by the same author:

@article{sakar2025rag,
  title     = {Maximizing {RAG} efficiency: A comparative analysis of {RAG} methods},
  author    = {Şakar, Tolga and Emekci, Hakan},
  journal   = {Natural Language Processing},
  volume    = {31},
  number    = {1},
  year      = {2025},
  publisher = {Cambridge University Press}
}

Acknowledgments

  • Morfessor (Creutz & Lagus, 2002, 2007) — unsupervised morphological supervisor
  • SentencePiece (Kudo & Richardson, 2018) and HuggingFace tokenizers — baseline tokenizer implementations
  • SIGMORPHON 2022 Turkish task — inflection gold standard for morphological evaluation
  • The Turkish NLP community for prior work (BERTurk, TURNA, Zemberek, TRMorph) that motivated this study

License

MIT License. See LICENSE.


Repository: TurkishTokenizer-Alpha-v1 · Author: Tolga Şakar

Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using lonewolflab/Morpheus-TR-50K 1

Evaluation results

  • BPC (lower is better) on Turkish curated corpus
    self-reported
    1.640
  • Boundary F1 (vs Morfessor) on Held-out Turkish OOV stratum
    self-reported
    0.916
  • Morphological Alignment Score % (OOV) on Held-out Turkish OOV stratum
    self-reported
    92.400