💎 Qwen Robotics & Routing Tokenizer

Qwen Pro Tokenizer . Fully compatible with Qwen3 models.
Size { Vocab size: 151778 pad_token_id: 151643 eos_token_id: 151645 }
Compatible with Qwen 3 • Optimized for Code . Use resize function without adaptation, see examples below
It needs 100k example to fully adapt routing features for RAG Routing model . So check Qwen modification tokenizer rules
A Qwen 3-based tokenizer enhanced with FIM markers from Microsoft datasets ( <|fim_prefix|>, <|fim_middle|>, <f|im_suffix|> )
Added Robotics & Embodiment tags ( "<|action_start|>", "<|action_end|>", "<|trajectory_start|>", "<|trajectory_end|>", "<|joint_start|>", "<|joint_end|>", "<|sensor_start|>", "<|sensor_end|>", "<|command_start|>", "<|command_end|>", "<|state_start|>", "<|state_end|>", "<|pose|>", "<|velocity|>", "<|force|>", "<|torque|>", "<|gripper|>", "<|navigation|>", "<|obstacle|>", "<|task_start|>", "<|task_end|>", "<|plan_start|>", "<|plan_end|>", "<|behavior_start|>", "<|behavior_end|>", "<|skill_start|>", "<|skill_end|>", "<|motor|>", "<|servo|>", "<|imu|>", "<|lidar|>", "<|camera|>", "<|depth|>", "<|waypoint|>", "<|path|>", "<|collision|>", "<|grasp|>", "<|release|>", "<|homing|>", "<|emergency_stop|>", "<|calibration|>", "<|manipulation|>", "<|locomotion|>", "<|feedback|>", "<|control_loop|>",)
Added Multi models support ( "<|image|>", "<|video|>", "<|sound|>", "<|voice|>", "<|listening|>", "<|vision|>",)
Added Human mood tags ( "<|mood_happy|>", "<|mood_sad|>", "<|mood_angry|>", "<|mood_neutral|>", )
Added RAG routing tags for RAG MoE Systems ( "SCIENCE", "CODING", "STOCK_EXCHANGE", "MEDICINE", "GOVERNMENT", "NEWS", "GENERAL", "MATERIAL_SCIENCE", "ELECTRONICS", "MICROELECTRONICS", "ENGINEERING", "ROBOTICS", "ENERGY", "AUTOMOTIVE", "AVIATION", "MATH", "PYTHON", "C", "CPP", "C_SHARP", "JAVA", "JAVASCRIPT", "TYPESCRIPT", "RUST", "GO", "RUBY", "PHP", "SWIFT", "KOTLIN", "BASH", "SQL", "ASSEMBLY", "PHILOSOPHY", "LITERATURE", "SOCIOLOGY", "PSYCHOLOGY", "POLITICAL_SCIENCE", "CULTURAL_STUDIES", "ETHNOGRAPHY", "HUMAN_RIGHTS", "COMPLIANCE", "MILITARY", "BANKING", "OIL_INDUSTRY", "LIGHT_INDUSTRY", "NATURE", "OCEAN", "SPORT", "CULINARY", "TRAVEL", "HOBBY" )
Fully compatible with Microsoft BigCode datasets including The Stack, StarCoder, and NextCoder.
Enables efficient training on large-scale coding data for superior code generation and understanding.

Inventor: Konstantin Vladimirovich Grabko
Organization: CMS Manhattan JiRack Technology
Official Site: www.cmsmanhattan.com Designed for Banking and Fintech Institutions

Banks and Fintech JiRack Architecture: Build Sovereign Financial Models from Scratch

Leveraging the JiRack Tokenizer and our Open Dataset, we enable financial institutions to develop secure, internal AI models from the ground up. This approach ensures maximum data privacy and model sovereignty for high-stakes banking operations.
There is fix price for FinTech
I recommend initializing the model with a 4K context window for initial stability, followed by scaling to 8K context using specialized JiRack 8K datasets. This two-stage approach ensures robust positional encoding before extending the model's long-range dependency.

JiRack Corp Tokenizer solution

Use JiRack models with trusted, high-quality coding datasets while maintaining full control over your code and data privacy.
Excellent fit for Banks, Fintech companies, and any organization that requires strict data confidentiality and security.
Update JiRack model for corp privacy coding.

JiRack Tokenizer Subcription

All subscribed members will receive regular tokenizer updates optimized for the latest high-quality coding datasets.

Open Robot platform

Tiangong : https://english.www.gov.cn/english.www.gov.cn/news/202411/13/content_WS673406e2c6d0868f4e8ece33.html
Unitree g1 https://a.co/d/0e4A8YVc
LimX Oli https://www.limxdynamics.com/en/products/oli?channel=option_google_advertising__c-
ubtrobot https://www.ubtrobot.com/en/
x-humanoid https://www.x-humanoid.com/detail/hskw.html

Key Features

Algorithm: Byte-Level BPE
Vocabulary Size: 128,000 tokens — excellent balance between precision and efficiency
Multilingual & Technical Strength: Optimized for English, Russian, code, scientific literature, and technical documentation
Domain Specialization: Strong performance on programming languages, engineering, robotics, and scientific texts

Special Tokens Support

Full Qwen3 compatible format dialogue format
FIM (Fill-in-the-Middle) support for code generation
Rich set of domain routing tokens (__CODING__, __PYTHON__, __ROBOTICS__, __SCIENCE__, etc.)
Extended robotics and control tokens

Install for Llamma compatible models in your chat script

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(your model)
The Must !
model.resize_token_embeddings(len(tokenizer.tokenizer)) # или просто len(tokenizer.tokenizer)
print("New Embedding size for you chat script:", model.get_input_embeddings().weight.shape[0])
Tesr Tokenizer size !

(venv_ji) root@jirack2:# python -c ' from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("./QwenRoboticsTokenizer") print("Vocab size:", len(tok)) print("pad_token_id:", tok.pad_token_id) print("eos_token_id:", tok.eos_token_id) '

Vocab size: 151778 pad_token_id: 151643 eos_token_id: 151645

📧 Contact & Licensing

For joint ventures, hardware integration, or licensing inquiries:

Email: grabko@cmsmanhattan.com
Phone: +1 (516) 777-0945
Location: New York, USA

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

CMSManhattan
/

QwenRoboticsTokenizer

💎 Qwen Robotics & Routing Tokenizer

Open Robot platform

Key Features

Special Tokens Support

The Must !

Tesr Tokenizer size !

📧 Contact & Licensing

📧 Copyright 2026 CMS Manhattan . All rights reserved