TatarNLPWorld
/

TatarTokenizer

+---
+license: mit
+language:
+- tt
+tags:
+- tokenizer
+- tatar-language
+- wordpiece
+- unigram
+- bpe
+- bbpe
+- huggingface
+metrics:
+- unknown_rate
+- compression_ratio
+- word_coverage
+- tokens_per_second
+---
+# TatarTokenizer: Tokenizers for the Tatar Language
+This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide **four different tokenization algorithms** (WordPiece, Unigram, BPE, and BBPE) with **multiple vocabulary sizes** (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve **0% unknown rate** on test data and are ready to use with the `tokenizers` library or Hugging Face Transformers.
+## 📦 Available Tokenizers
+The following tokenizers are included:
+| Tokenizer          | Type      | Vocab Size | Compression Ratio | Speed (tokens/sec) | Notes |
+|--------------------|-----------|------------|-------------------|---------------------|-------|
+| `wp_50k`           | WordPiece | 50,000     | 4.67              | 378,751             | Best overall balance |
+| `wp_25k`           | WordPiece | 25,000     | 4.36              | **496,273**         | Fastest tokenizer |
+| `uni_50k`          | Unigram   | 50,000     | 4.59              | 189,623             | Probabilistic model |
+| `uni_25k`          | Unigram   | 25,000     | 4.30              | 260,403             | Good for smaller vocab |
+| `bpe_50k`          | BPE       | 50,000     | 4.60              | 247,421             | Standard BPE |
+| `bpe_50k_freq5`    | BPE       | 50,000     | 4.60              | 226,591             | Higher frequency threshold |
+| `bbpe_50k`         | BBPE      | 50,000     | 4.60              | 227,322             | Byte-level BPE |
+| `bbpe_25k`         | BBPE      | 25,000     | 4.28              | 257,104             | Compact byte-level |
+| `bbpe_fixed_50k`   | BBPE*     | 50,000     | **5.17**          | 315,922             | Best compression ratio |
+| `bpe_fixed_50k`    | BPE*      | 50,000     | 4.75              | 337,247             | Fast BPE variant |
+\* *Fixed versions with improved Unicode handling*
+**Key observations:**
+- All tokenizers except `bpe_fixed_50k` achieve **0% unknown rate** on test data
+- `bbpe_fixed_50k` offers the **best compression** (5.17 chars/token)
+- `wp_25k` is the **fastest** (nearly 500k tokens/second)
+- WordPiece models provide the most **human-readable tokens**
+## 📁 Repository Structure
+The files are organized in subdirectories for each tokenizer type and size:
+```
+TatarTokenizer/
+├── tokenizers/
+│   ├── wordpiece/
+│   │   ├── 50k/          # wp_50k.json
+│   │   └── 25k/          # wp_25k.json
+│   ├── unigram/
+│   │   ├── 50k/          # uni_50k.json
+│   │   └── 25k/          # uni_25k.json
+│   ├── bpe/
+│   │   ├── 50k/          # bpe_50k.json
+│   │   └── 50k_freq5/    # bpe_50k_freq5.json
+│   ├── bbpe/
+│   │   ├── 50k/          # bbpe_50k.json
+│   │   └── 25k/          # bbpe_25k.json
+│   ├── bpe_fixed/
+│   │   └── 50k/          # bpe_fixed_50k.json
+│   └── bbpe_fixed/
+│       └── 50k/          # bbpe_fixed_50k.json
+└── test_results/          # Evaluation reports and visualizations
+    ├── tokenizer_test_report.csv
+    ├── test_summary_*.txt
+    ├── comparison_*.png
+    ├── token_length_dist_*.png
+    ├── correlation_*.png
+    └── top10_score_*.png
+```
+Each tokenizer is saved as a single `.json` file compatible with the Hugging Face `tokenizers` library.
+## 🚀 Usage
+### Installation
+First, install the required libraries:
+```bash
+pip install huggingface_hub tokenizers
+```
+### Load a Tokenizer
+```python
+from huggingface_hub import hf_hub_download
+from tokenizers import Tokenizer
+# Download and load the WordPiece 50k tokenizer
+tokenizer_file = hf_hub_download(
+    repo_id="TatarNLPWorld/TatarTokenizer",
+    filename="tokenizers/wordpiece/50k/wp_50k.json"
+)
+tokenizer = Tokenizer.from_file(tokenizer_file)
+# Test it
+text = "Казан - Татарстанның башкаласы"
+encoding = tokenizer.encode(text)
+print(f"Text: {text}")
+print(f"Tokens: {encoding.tokens}")
+print(f"Token IDs: {encoding.ids}")
+print(f"Decoded: {tokenizer.decode(encoding.ids)}")
+```
+### Using with Hugging Face Transformers
+You can easily convert any tokenizer to Hugging Face format:
+```python
+from transformers import PreTrainedTokenizerFast
+hf_tokenizer = PreTrainedTokenizerFast(
+    tokenizer_object=tokenizer,
+    unk_token='[UNK]',
+    pad_token='[PAD]',
+    cls_token='[CLS]',
+    sep_token='[SEP]',
+    mask_token='[MASK]'
+)
+# Now you can use it with any transformer model
+```
+### Download All Files for a Specific Tokenizer
+```python
+from huggingface_hub import snapshot_download
+# Download all files for WordPiece 50k
+model_path = snapshot_download(
+    repo_id="TatarNLPWorld/TatarTokenizer",
+    allow_patterns="tokenizers/wordpiece/50k/*",
+    local_dir="./tatar_tokenizer_wp50k"
+)
+```
+## 📊 Evaluation Results
+We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings:
+### Best Tokenizers by Category
+| Category | Winner | Value |
+|----------|--------|-------|
+| **Best Compression** | `bbpe_fixed_50k` | 5.17 chars/token |
+| **Fastest** | `wp_25k` | 496,273 tokens/sec |
+| **Best Overall** | `wp_50k` | Balanced performance |
+| **Most Readable** | WordPiece family | Human-readable tokens |
+### Performance Summary
+All tokenizers (except `bpe_fixed_50k`) achieve:
+- **0% unknown rate** on test data
+- **100% word coverage** for common vocabulary
+- Compression ratios between 4.28 and 5.17
+### Visualizations
+The repository includes comprehensive evaluation visualizations in the `test_results/` folder:
+- **Comparison plots** showing unknown rate, compression ratio, and speed by tokenizer type
+- **Token length distributions** for each best-in-class tokenizer
+- **Correlation matrices** between different metrics
+- **Top-10 rankings** by composite score
+Both Russian and English versions of all plots are available.
+## 🧪 Test Results Summary
+| Model | Type | Unknown Rate | Compression | Word Coverage | Speed (tokens/sec) |
+|-------|------|--------------|-------------|---------------|-------------------|
+| wp_50k | WordPiece | 0.0000 | 4.67 | 1.0000 | 378,751 |
+| wp_25k | WordPiece | 0.0000 | 4.36 | 1.0000 | **496,273** |
+| uni_50k | Unigram | 0.0000 | 4.59 | 1.0000 | 189,623 |
+| uni_25k | Unigram | 0.0000 | 4.30 | 1.0000 | 260,403 |
+| bpe_50k | BPE | 0.0000 | 4.60 | 1.0000 | 247,421 |
+| bbpe_fixed_50k | BBPE_fixed | 0.0000 | **5.17** | 1.0000 | 315,922 |
+## 🎯 Recommendations
+Based on our evaluation, we recommend:
+1. **For BERT-like models**: Use `wp_50k` (WordPiece) - best balance of readability and performance
+2. **For maximum speed**: Use `wp_25k` - fastest tokenizer, ideal for high-throughput applications
+3. **For maximum compression**: Use `bbpe_fixed_50k` - most efficient tokenization
+4. **For GPT-like models**: Use `bpe_50k` or `bbpe_50k` - compatible with modern LLM architectures
+5. **For research**: All tokenizers are provided for comparative studies
+## 📝 License
+All tokenizers are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution.
+## 🤝 Citation
+If you use these tokenizers in your research, please cite:
+```bibtex
+@software{tatartokenizer_2026,
+    title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language},
+    author = {Arabov, Mullosharaf Kurbonvoich},
+    year = {2026},
+    publisher = {Kazan Federal University},
+    url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer}
+}
+```
+## 🌐 Language
+All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). They handle Tatar-specific characters perfectly (`ә`, `Ә`, `ү`, `Ү`, `җ`, `Җ`, `ң`, `Ң`, `һ`, `Һ`, `ө`, `Ө`).
+## 🙌 Acknowledgements
+These tokenizers were trained and evaluated by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.
+Special thanks to the Hugging Face team for the `tokenizers` library and the Hugging Face Hub platform.