Enjoy — We extend the JiRack Models Ecosystem! 🚀
JiRack Pro Tokenizer 128K
High-performance production-grade Byte-Level BPE tokenizer developed as part of the JiRack Ternary Models ecosystem.
This is the Pro version designed for maximum quality, compression, and precision in complex real-world applications.
- JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b
Open Robot platform
- Tiangong : https://english.www.gov.cn/english.www.gov.cn/news/202411/13/content_WS673406e2c6d0868f4e8ece33.html
- Unitree g1 https://a.co/d/0e4A8YVc
- LimX Oli https://www.limxdynamics.com/en/products/oli?channel=option_google_advertising__c-
- ubtrobot https://www.ubtrobot.com/en/
- x-humanoid https://www.x-humanoid.com/detail/hskw.html
Key Features
- Algorithm: Byte-Level BPE
- Vocabulary Size: 128,000 tokens — excellent balance between precision and efficiency
- Multilingual & Technical Strength: Optimized for English, Russian, code, scientific literature, and technical documentation
- Domain Specialization: Strong performance on programming languages, engineering, robotics, and scientific texts
Special Tokens Support
- Full ChatML dialogue format (
<|im_start|>,<|im_end|>) - FIM (Fill-in-the-Middle) support for code generation
- Rich set of domain routing tokens (
__CODING__,__PYTHON__,__ROBOTICS__,__SCIENCE__, etc.) - Extended robotics and control tokens
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Pro-Tokenizer-128K")
text = "__CODING__ __PYTHON__ Write a merge sort function in Python."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
### Benchmark for tokens quality .
```bash
=== Text after ChatML Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>
=== Tokens (IDs) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 30667, 395, 58912, 6643, 2299, 462, 8646, 141, 4, 326]
=== Decoding Token by Token ===
5 -> '<|im_start|>system'
326 -> '\n'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
...
72 -> '__CODING__'
87 -> '__PYTHON__'
30667 -> ' Write'
58912 -> ' merge'
6643 -> ' sort'
=== GERMAN =====
=== Text nach ChatML-Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Schreibe eine Merge-Sort-Funktion in Python.<|im_end|>
=== Token (IDs) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 1818, 420, 17119, 4086, 5977, 1039, 140, 178, 800, 140, 165, 11028, 472, 462, 8646, 141, 4, 326]
=== Dekodierung Token für Token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
1818 -> ' Sch'
420 -> 're'
17119 -> 'ibe'
4086 -> ' eine'
5977 -> ' Mer'
1039 -> 'ge'
140 -> '-'
178 -> 'S'
800 -> 'ort'
140 -> '-'
165 -> 'F'
11028 -> 'unkt'
472 -> 'ion'
462 -> ' in'
8646 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
=== SPANISH =====
=== Texto después de la plantilla ChatML ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Escribe una función de ordenamiento por mezcla (merge sort) en Python.<|im_end|>
=== Tokens (IDs) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 4230, 12807, 1569, 38999, 444, 16400, 13573, 1441, 15031, 71301, 450, 113333, 6643, 136, 561, 8646, 141, 4, 326]
=== Decodificación token por token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
4230 -> ' Es'
12807 -> 'cribe'
1569 -> ' una'
38999 -> ' función'
444 -> ' de'
16400 -> ' orden'
13573 -> 'amiento'
1441 -> ' por'
15031 -> ' mez'
71301 -> 'cla'
450 -> ' ('
113333 -> 'merge'
6643 -> ' sort'
136 -> ')'
561 -> ' en'
8646 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
=== RUSSAIN =====
=== Текст после ChatML шаблона ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Напиши функцию сортировки слиянием на python.<|im_end|>
=== Токены (ID) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 24549, 57864, 16351, 56848, 101013, 111998, 3263, 945, 1657, 1081, 822, 75733, 141, 4, 326]
=== Декодирование по токенам ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
24549 -> ' Нап'
57864 -> 'иши'
16351 -> ' функ'
56848 -> 'цию'
101013 -> ' сорт'
111998 -> 'ировки'
3263 -> ' сл'
945 -> 'ия'
1657 -> 'ни'
1081 -> 'ем'
822 -> ' на'
75733 -> ' python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
=== FRANCE =====
=== Texte après le modèle ChatML ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Écris une fonction de tri fusion en Python.<|im_end|>
=== Tokens (IDs) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 112537, 3439, 2834, 27517, 444, 3276, 33659, 561, 8646, 141, 4, 326]
=== Décodage token par token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
112537 -> ' Éc'
3439 -> 'ris'
2834 -> ' une'
27517 -> ' fonction'
444 -> ' de'
3276 -> ' tri'
33659 -> ' fusion'
561 -> ' en'
8646 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
=== CHINA =====
=== ChatML 模板处理后的文本 ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ 用 Python 写一个归并排序函数。<|im_end|>
=== Token (ID) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 348, 2879, 8646, 348, 22739, 19808, 72775, 15454, 20847, 29714, 115881, 760, 4, 326]
=== 逐个 Token 解码 ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
348 -> ' '
2879 -> '用'
8646 -> ' Python'
348 -> ' '
22739 -> '写'
19808 -> '一个'
72775 -> '归'
15454 -> '并'
20847 -> '排'
29714 -> '序'
115881 -> '函数'
760 -> '。'
4 -> '<|im_end|>'
326 -> '
'
=== JAPAN =========
=== ChatMLテンプレート適用後のテキスト ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Pythonでマージソートの関数を書いてください。<|im_end|>
=== トークン (ID) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 8646, 1183, 3911, 11941, 7947, 7691, 720, 6055, 108920, 6689, 7351, 3176, 5222, 99686, 760, 4, 326]
=== トークンごとのデコード ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
8646 -> ' Python'
1183 -> 'で'
3911 -> 'マ'
11941 -> 'ージ'
7947 -> 'ソ'
7691 -> 'ート'
720 -> 'の'
6055 -> '関'
108920 -> '数を'
6689 -> '書'
7351 -> 'いて'
3176 -> 'く'
5222 -> 'だ'
99686 -> 'さい'
760 -> '。'
4 -> '<|im_end|>'
326 -> '
'
=== ARABIC =======
=== النص بعد تطبيق قالب ChatML ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ اكتب دالة فرز بالدمج (merge sort) بلغة بايثون.<|im_end|>
=== الرموز (IDs) ===
[5, 326, 5208, 965, 395, 24704, 1014, 7861, 6124, 141, 4, 326, 6, 326, 72, 348, 87, 45789, 8459, 770, 24238, 6361, 1142, 5451, 5669, 1062, 450, 113333, 6643, 136, 7954, 27188, 88636, 2181, 1343, 141, 4, 326]
=== فك الترميز رمزا برمز ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5208 -> 'You'
965 -> ' are'
395 -> ' a'
24704 -> ' precise'
1014 -> ' ro'
7861 -> 'uter'
6124 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
72 -> '__CODING__'
348 -> ' '
87 -> '__PYTHON__'
45789 -> ' اك'
8459 -> 'تب'
770 -> ' د'
24238 -> 'الة'
6361 -> ' فر'
1142 -> 'ز'
5451 -> ' بال'
5669 -> 'دم'
1062 -> 'ج'
450 -> ' ('
113333 -> 'merge'
6643 -> ' sort'
136 -> ')'
7954 -> ' بل'
27188 -> 'غة'
88636 -> ' باي'
2181 -> 'ث'
1343 -> 'ون'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'