Kabyle Tokenizer for T5

SentencePiece tokenizer trained on 787,648 Kabyle sentences from Tatoeba, designed for T5-style models.

Vocabulary

  • Size: 32,000 tokens
  • Type: BPE (Byte Pair Encoding)
  • Character coverage: 99.99%

Special Tokens

  • <unk>: Unknown token
  • <pad>: Padding token
  • </s>: End of sequence
  • <s>: Beginning of sequence
  • <mask>: Mask token (for T5)

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-tokenizer-T5")
tokens = tokenizer.tokenize("Aqcic-nni yeɣra adlis.")
print(tokens)  # ['▁Aqcic', '-', 'nni', '▁yeɣra', '▁adlis', '.']

Training Data

  • Source: Tatoeba Kabyle corpus
  • Sentences: 787,648
  • Cleaning: Greek/Cyrillic contamination removed (ε→ɛ, γ→ɣ, Σ→Ɛ, Γ→Ɣ, Ԑ→Ɛ, ԑ→ɛ)

Comparison with T5-original

Phrase T5-original tokens Kabyle-SPM tokens
Aqcic-nni yeɣra adlis. 19 6
Tettmeslayeḍ taqbaylit? 18 4
Ur zmireɣ ara ad qqimeɣ argaz-a. ~20 10

Kabyle Characters Preserved

  • ɛ / Ɛ (open e)
  • ɣ / Ɣ (gamma)
  • č / Č (c with caron)
  • ǧ / Ǧ (g with caron)
  • ḍ / Ḍ (d with dot below)
  • ḥ / Ḥ (h with dot below)
  • ṛ / Ṛ (r with dot below)
  • ṣ / Ṣ (s with dot below)
  • ṭ / Ṭ (t with dot below)
  • ẓ / Ẓ (z with dot below)

Limitations

  • Optimized for short sentences (Tatoeba style)
  • May split rare compound words (e.g., "tebirt" → "teb" + "irt")
  • Requires T5 model with resized embeddings for full compatibility
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support