Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Tokenizer VLA Adaptive
Extended GPT-NeoX-20b tokenizer for the FineVideo-VLA dataset.
What is this?
This tokenizer extends the EleutherAI/gpt-neox-20b tokenizer with 93,938 new tokens for multimodal Vision-Language-Action (VLA) pretraining.
| Category | Token format | Count |
|---|---|---|
| Seed2 visual tokens | <seed2_N> (N=0-8191) |
8,192 |
| Cosmos spatial tokens | <cosmos_N> (N=0-63999) |
64,000 |
| AVC-LM H.264 BPE tokens | <avclm_N> (N=0-8191) |
8,192 |
| Agent legacy tokens | <agent_N> (N=0-255) |
256 |
| FPS prefix | <fps_N> (N=1-60) |
60 |
| Joint position tokens | <{joint}_x_N>, _y_N, _z_N (N=0-255) |
13,056 |
| Joint time tokens | <{joint}_t_N> (N=0-7) |
136 |
| Wrapper tags | <seed2>, </seed2>, <agent>, </agent>, etc. |
46 |
Total vocab size: 144,215 (50,277 base + 93,938 new)
17 Named Joints
pelvis, r_hip, r_knee, r_ankle, l_hip, l_knee, l_ankle, spine, thorax, nose, head_top, l_shoulder, l_elbow, l_wrist, r_shoulder, r_elbow, r_wrist
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")
# All VLA tokens are atomic — never split by BPE
tok.encode("<seed2_1137>") # -> [59908]
tok.encode("<pelvis_x_128>") # -> [131151]
How it was created
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tok.add_tokens(new_vla_tokens, special_tokens=True)
tok.save_pretrained("tokenizer-vla-adaptive")
All tokens are registered via add_tokens(special_tokens=True) so the BPE merge rules treat each one as a single atomic unit.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support