ArabovMK commited on
Commit
dbd156e
ยท
verified ยท
1 Parent(s): 577d066

Create README.MD

Browse files
Files changed (1) hide show
  1. README.MD +225 -0
README.MD ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - tt
5
+ tags:
6
+ - tokenizer
7
+ - tatar-language
8
+ - wordpiece
9
+ - unigram
10
+ - bpe
11
+ - bbpe
12
+ - huggingface
13
+ metrics:
14
+ - unknown_rate
15
+ - compression_ratio
16
+ - word_coverage
17
+ - tokens_per_second
18
+ ---
19
+
20
+ # TatarTokenizer: Tokenizers for the Tatar Language
21
+
22
+ This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide **four different tokenization algorithms** (WordPiece, Unigram, BPE, and BBPE) with **multiple vocabulary sizes** (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve **0% unknown rate** on test data and are ready to use with the `tokenizers` library or Hugging Face Transformers.
23
+
24
+ ## ๐Ÿ“ฆ Available Tokenizers
25
+
26
+ The following tokenizers are included:
27
+
28
+ | Tokenizer | Type | Vocab Size | Compression Ratio | Speed (tokens/sec) | Notes |
29
+ |--------------------|-----------|------------|-------------------|---------------------|-------|
30
+ | `wp_50k` | WordPiece | 50,000 | 4.67 | 378,751 | Best overall balance |
31
+ | `wp_25k` | WordPiece | 25,000 | 4.36 | **496,273** | Fastest tokenizer |
32
+ | `uni_50k` | Unigram | 50,000 | 4.59 | 189,623 | Probabilistic model |
33
+ | `uni_25k` | Unigram | 25,000 | 4.30 | 260,403 | Good for smaller vocab |
34
+ | `bpe_50k` | BPE | 50,000 | 4.60 | 247,421 | Standard BPE |
35
+ | `bpe_50k_freq5` | BPE | 50,000 | 4.60 | 226,591 | Higher frequency threshold |
36
+ | `bbpe_50k` | BBPE | 50,000 | 4.60 | 227,322 | Byte-level BPE |
37
+ | `bbpe_25k` | BBPE | 25,000 | 4.28 | 257,104 | Compact byte-level |
38
+ | `bbpe_fixed_50k` | BBPE* | 50,000 | **5.17** | 315,922 | Best compression ratio |
39
+ | `bpe_fixed_50k` | BPE* | 50,000 | 4.75 | 337,247 | Fast BPE variant |
40
+
41
+ \* *Fixed versions with improved Unicode handling*
42
+
43
+ **Key observations:**
44
+ - All tokenizers except `bpe_fixed_50k` achieve **0% unknown rate** on test data
45
+ - `bbpe_fixed_50k` offers the **best compression** (5.17 chars/token)
46
+ - `wp_25k` is the **fastest** (nearly 500k tokens/second)
47
+ - WordPiece models provide the most **human-readable tokens**
48
+
49
+ ## ๐Ÿ“ Repository Structure
50
+
51
+ The files are organized in subdirectories for each tokenizer type and size:
52
+
53
+ ```
54
+ TatarTokenizer/
55
+ โ”œโ”€โ”€ tokenizers/
56
+ โ”‚ โ”œโ”€โ”€ wordpiece/
57
+ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # wp_50k.json
58
+ โ”‚ โ”‚ โ””โ”€โ”€ 25k/ # wp_25k.json
59
+ โ”‚ โ”œโ”€โ”€ unigram/
60
+ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # uni_50k.json
61
+ โ”‚ โ”‚ โ””โ”€โ”€ 25k/ # uni_25k.json
62
+ โ”‚ โ”œโ”€โ”€ bpe/
63
+ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # bpe_50k.json
64
+ โ”‚ โ”‚ โ””โ”€โ”€ 50k_freq5/ # bpe_50k_freq5.json
65
+ โ”‚ โ”œโ”€โ”€ bbpe/
66
+ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # bbpe_50k.json
67
+ โ”‚ โ”‚ โ””โ”€โ”€ 25k/ # bbpe_25k.json
68
+ โ”‚ โ”œโ”€โ”€ bpe_fixed/
69
+ โ”‚ โ”‚ โ””โ”€โ”€ 50k/ # bpe_fixed_50k.json
70
+ โ”‚ โ””โ”€โ”€ bbpe_fixed/
71
+ โ”‚ โ””โ”€โ”€ 50k/ # bbpe_fixed_50k.json
72
+ โ””โ”€โ”€ test_results/ # Evaluation reports and visualizations
73
+ โ”œโ”€โ”€ tokenizer_test_report.csv
74
+ โ”œโ”€โ”€ test_summary_*.txt
75
+ โ”œโ”€โ”€ comparison_*.png
76
+ โ”œโ”€โ”€ token_length_dist_*.png
77
+ โ”œโ”€โ”€ correlation_*.png
78
+ โ””โ”€โ”€ top10_score_*.png
79
+ ```
80
+
81
+ Each tokenizer is saved as a single `.json` file compatible with the Hugging Face `tokenizers` library.
82
+
83
+ ## ๐Ÿš€ Usage
84
+
85
+ ### Installation
86
+
87
+ First, install the required libraries:
88
+
89
+ ```bash
90
+ pip install huggingface_hub tokenizers
91
+ ```
92
+
93
+ ### Load a Tokenizer
94
+
95
+ ```python
96
+ from huggingface_hub import hf_hub_download
97
+ from tokenizers import Tokenizer
98
+
99
+ # Download and load the WordPiece 50k tokenizer
100
+ tokenizer_file = hf_hub_download(
101
+ repo_id="TatarNLPWorld/TatarTokenizer",
102
+ filename="tokenizers/wordpiece/50k/wp_50k.json"
103
+ )
104
+
105
+ tokenizer = Tokenizer.from_file(tokenizer_file)
106
+
107
+ # Test it
108
+ text = "ะšะฐะทะฐะฝ - ะขะฐั‚ะฐั€ัั‚ะฐะฝะฝั‹าฃ ะฑะฐัˆะบะฐะปะฐัั‹"
109
+ encoding = tokenizer.encode(text)
110
+ print(f"Text: {text}")
111
+ print(f"Tokens: {encoding.tokens}")
112
+ print(f"Token IDs: {encoding.ids}")
113
+ print(f"Decoded: {tokenizer.decode(encoding.ids)}")
114
+ ```
115
+
116
+ ### Using with Hugging Face Transformers
117
+
118
+ You can easily convert any tokenizer to Hugging Face format:
119
+
120
+ ```python
121
+ from transformers import PreTrainedTokenizerFast
122
+
123
+ hf_tokenizer = PreTrainedTokenizerFast(
124
+ tokenizer_object=tokenizer,
125
+ unk_token='[UNK]',
126
+ pad_token='[PAD]',
127
+ cls_token='[CLS]',
128
+ sep_token='[SEP]',
129
+ mask_token='[MASK]'
130
+ )
131
+
132
+ # Now you can use it with any transformer model
133
+ ```
134
+
135
+ ### Download All Files for a Specific Tokenizer
136
+
137
+ ```python
138
+ from huggingface_hub import snapshot_download
139
+
140
+ # Download all files for WordPiece 50k
141
+ model_path = snapshot_download(
142
+ repo_id="TatarNLPWorld/TatarTokenizer",
143
+ allow_patterns="tokenizers/wordpiece/50k/*",
144
+ local_dir="./tatar_tokenizer_wp50k"
145
+ )
146
+ ```
147
+
148
+ ## ๐Ÿ“Š Evaluation Results
149
+
150
+ We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings:
151
+
152
+ ### Best Tokenizers by Category
153
+
154
+ | Category | Winner | Value |
155
+ |----------|--------|-------|
156
+ | **Best Compression** | `bbpe_fixed_50k` | 5.17 chars/token |
157
+ | **Fastest** | `wp_25k` | 496,273 tokens/sec |
158
+ | **Best Overall** | `wp_50k` | Balanced performance |
159
+ | **Most Readable** | WordPiece family | Human-readable tokens |
160
+
161
+ ### Performance Summary
162
+
163
+ All tokenizers (except `bpe_fixed_50k`) achieve:
164
+ - **0% unknown rate** on test data
165
+ - **100% word coverage** for common vocabulary
166
+ - Compression ratios between 4.28 and 5.17
167
+
168
+ ### Visualizations
169
+
170
+ The repository includes comprehensive evaluation visualizations in the `test_results/` folder:
171
+ - **Comparison plots** showing unknown rate, compression ratio, and speed by tokenizer type
172
+ - **Token length distributions** for each best-in-class tokenizer
173
+ - **Correlation matrices** between different metrics
174
+ - **Top-10 rankings** by composite score
175
+
176
+ Both Russian and English versions of all plots are available.
177
+
178
+ ## ๐Ÿงช Test Results Summary
179
+
180
+ | Model | Type | Unknown Rate | Compression | Word Coverage | Speed (tokens/sec) |
181
+ |-------|------|--------------|-------------|---------------|-------------------|
182
+ | wp_50k | WordPiece | 0.0000 | 4.67 | 1.0000 | 378,751 |
183
+ | wp_25k | WordPiece | 0.0000 | 4.36 | 1.0000 | **496,273** |
184
+ | uni_50k | Unigram | 0.0000 | 4.59 | 1.0000 | 189,623 |
185
+ | uni_25k | Unigram | 0.0000 | 4.30 | 1.0000 | 260,403 |
186
+ | bpe_50k | BPE | 0.0000 | 4.60 | 1.0000 | 247,421 |
187
+ | bbpe_fixed_50k | BBPE_fixed | 0.0000 | **5.17** | 1.0000 | 315,922 |
188
+
189
+ ## ๐ŸŽฏ Recommendations
190
+
191
+ Based on our evaluation, we recommend:
192
+
193
+ 1. **For BERT-like models**: Use `wp_50k` (WordPiece) - best balance of readability and performance
194
+ 2. **For maximum speed**: Use `wp_25k` - fastest tokenizer, ideal for high-throughput applications
195
+ 3. **For maximum compression**: Use `bbpe_fixed_50k` - most efficient tokenization
196
+ 4. **For GPT-like models**: Use `bpe_50k` or `bbpe_50k` - compatible with modern LLM architectures
197
+ 5. **For research**: All tokenizers are provided for comparative studies
198
+
199
+ ## ๐Ÿ“ License
200
+
201
+ All tokenizers are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution.
202
+
203
+ ## ๐Ÿค Citation
204
+
205
+ If you use these tokenizers in your research, please cite:
206
+
207
+ ```bibtex
208
+ @software{tatartokenizer_2026,
209
+ title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language},
210
+ author = {Arabov, Mullosharaf Kurbonvoich},
211
+ year = {2026},
212
+ publisher = {Kazan Federal University},
213
+ url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer}
214
+ }
215
+ ```
216
+
217
+ ## ๐ŸŒ Language
218
+
219
+ All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). They handle Tatar-specific characters perfectly (`ำ™`, `ำ˜`, `าฏ`, `าฎ`, `า—`, `า–`, `าฃ`, `าข`, `าป`, `าบ`, `ำฉ`, `ำจ`).
220
+
221
+ ## ๐Ÿ™Œ Acknowledgements
222
+
223
+ These tokenizers were trained and evaluated by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.
224
+
225
+ Special thanks to the Hugging Face team for the `tokenizers` library and the Hugging Face Hub platform.