π Qwen Robotics & Routing Tokenizer
Qwen Pro Tokenizer . Fully compatible with Qwen3 models.
Size { Vocab size: 151778 pad_token_id: 151643 eos_token_id: 151645 }
Compatible with Qwen 3 β’ Optimized for Code . Use resize function without adaptation, see examples below
It needs 100k example to fully adapt routing features for RAG Routing model . So check Qwen modification tokenizer rules
A Qwen 3-based tokenizer enhanced with FIM markers from Microsoft datasets ( <|fim_prefix|>, <|fim_middle|>, <f|im_suffix|> )
Added Robotics & Embodiment tags ( "<|action_start|>", "<|action_end|>", "<|trajectory_start|>", "<|trajectory_end|>", "<|joint_start|>", "<|joint_end|>", "<|sensor_start|>", "<|sensor_end|>", "<|command_start|>", "<|command_end|>", "<|state_start|>", "<|state_end|>", "<|pose|>", "<|velocity|>", "<|force|>", "<|torque|>", "<|gripper|>", "<|navigation|>", "<|obstacle|>", "<|task_start|>", "<|task_end|>", "<|plan_start|>", "<|plan_end|>", "<|behavior_start|>", "<|behavior_end|>", "<|skill_start|>", "<|skill_end|>", "<|motor|>", "<|servo|>", "<|imu|>", "<|lidar|>", "<|camera|>", "<|depth|>", "<|waypoint|>", "<|path|>", "<|collision|>", "<|grasp|>", "<|release|>", "<|homing|>", "<|emergency_stop|>", "<|calibration|>", "<|manipulation|>", "<|locomotion|>", "<|feedback|>", "<|control_loop|>",)
Added Multi models support ( "<|image|>", "<|video|>", "<|sound|>", "<|voice|>", "<|listening|>", "<|vision|>",)
Added Human mood tags ( "<|mood_happy|>", "<|mood_sad|>", "<|mood_angry|>", "<|mood_neutral|>", )
Added RAG routing tags for RAG MoE Systems ( "SCIENCE", "CODING", "STOCK_EXCHANGE", "MEDICINE", "GOVERNMENT", "NEWS", "GENERAL", "MATERIAL_SCIENCE", "ELECTRONICS", "MICROELECTRONICS", "ENGINEERING", "ROBOTICS", "ENERGY", "AUTOMOTIVE", "AVIATION", "MATH", "PYTHON", "C", "CPP", "C_SHARP", "JAVA", "JAVASCRIPT", "TYPESCRIPT", "RUST", "GO", "RUBY", "PHP", "SWIFT", "KOTLIN", "BASH", "SQL", "ASSEMBLY", "PHILOSOPHY", "LITERATURE", "SOCIOLOGY", "PSYCHOLOGY", "POLITICAL_SCIENCE", "CULTURAL_STUDIES", "ETHNOGRAPHY", "HUMAN_RIGHTS", "COMPLIANCE", "MILITARY", "BANKING", "OIL_INDUSTRY", "LIGHT_INDUSTRY", "NATURE", "OCEAN", "SPORT", "CULINARY", "TRAVEL", "HOBBY" )
Fully compatible with Microsoft BigCode datasets including The Stack, StarCoder, and NextCoder.
Enables efficient training on large-scale coding data for superior code generation and understanding.
Inventor: Konstantin Vladimirovich Grabko
Organization: CMS Manhattan JiRack Technology
Official Site: www.cmsmanhattan.com
Designed for Banking and Fintech Institutions
Banks and Fintech JiRack Architecture: Build Sovereign Financial Models from Scratch
- Leveraging the JiRack Tokenizer and our Open Dataset, we enable financial institutions to develop secure, internal AI models from the ground up. This approach ensures maximum data privacy and model sovereignty for high-stakes banking operations.
- There is fix price for FinTech
- I recommend initializing the model with a 4K context window for initial stability, followed by scaling to 8K context using specialized JiRack 8K datasets. This two-stage approach ensures robust positional encoding before extending the model's long-range dependency.
JiRack Corp Tokenizer solution
- Use JiRack models with trusted, high-quality coding datasets while maintaining full control over your code and data privacy.
- Excellent fit for Banks, Fintech companies, and any organization that requires strict data confidentiality and security.
- Update JiRack model for corp privacy coding.
JiRack Tokenizer Subcription
- All subscribed members will receive regular tokenizer updates optimized for the latest high-quality coding datasets.
Open Robot platform
- Tiangong : https://english.www.gov.cn/english.www.gov.cn/news/202411/13/content_WS673406e2c6d0868f4e8ece33.html
- Unitree g1 https://a.co/d/0e4A8YVc
- LimX Oli https://www.limxdynamics.com/en/products/oli?channel=option_google_advertising__c-
- ubtrobot https://www.ubtrobot.com/en/
- x-humanoid https://www.x-humanoid.com/detail/hskw.html
Key Features
- Algorithm: Byte-Level BPE
- Vocabulary Size: 128,000 tokens β excellent balance between precision and efficiency
- Multilingual & Technical Strength: Optimized for English, Russian, code, scientific literature, and technical documentation
- Domain Specialization: Strong performance on programming languages, engineering, robotics, and scientific texts
Special Tokens Support
- Full Qwen3 compatible format dialogue format
- FIM (Fill-in-the-Middle) support for code generation
- Rich set of domain routing tokens (
__CODING__,__PYTHON__,__ROBOTICS__,__SCIENCE__, etc.) - Extended robotics and control tokens
Install for Llamma compatible models in your chat script
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(your model)
The Must !
model.resize_token_embeddings(len(tokenizer.tokenizer)) # ΠΈΠ»ΠΈ ΠΏΡΠΎΡΡΠΎ len(tokenizer.tokenizer)
print("New Embedding size for you chat script:", model.get_input_embeddings().weight.shape[0])
Tesr Tokenizer size !
(venv_ji) root@jirack2:# python -c ' from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("./QwenRoboticsTokenizer") print("Vocab size:", len(tok)) print("pad_token_id:", tok.pad_token_id) print("eos_token_id:", tok.eos_token_id) '
Vocab size: 151778 pad_token_id: 151643 eos_token_id: 151645
π§ Contact & Licensing
For joint ventures, hardware integration, or licensing inquiries:
- Email: grabko@cmsmanhattan.com
- Phone: +1 (516) 777-0945
- Location: New York, USA