Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Tokenizer VLA Adaptive

Extended GPT-NeoX-20b tokenizer for the FineVideo-VLA dataset.

What is this?

This tokenizer extends the EleutherAI/gpt-neox-20b tokenizer with 93,938 new tokens for multimodal Vision-Language-Action (VLA) pretraining.

Category Token format Count
Seed2 visual tokens <seed2_N> (N=0-8191) 8,192
Cosmos spatial tokens <cosmos_N> (N=0-63999) 64,000
AVC-LM H.264 BPE tokens <avclm_N> (N=0-8191) 8,192
Agent legacy tokens <agent_N> (N=0-255) 256
FPS prefix <fps_N> (N=1-60) 60
Joint position tokens <{joint}_x_N>, _y_N, _z_N (N=0-255) 13,056
Joint time tokens <{joint}_t_N> (N=0-7) 136
Wrapper tags <seed2>, </seed2>, <agent>, </agent>, etc. 46

Total vocab size: 144,215 (50,277 base + 93,938 new)

17 Named Joints

pelvis, r_hip, r_knee, r_ankle, l_hip, l_knee, l_ankle, spine, thorax, nose, head_top, l_shoulder, l_elbow, l_wrist, r_shoulder, r_elbow, r_wrist

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")

# All VLA tokens are atomic — never split by BPE
tok.encode("<seed2_1137>")    # -> [59908]
tok.encode("<pelvis_x_128>")  # -> [131151]

How it was created

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tok.add_tokens(new_vla_tokens, special_tokens=True)
tok.save_pretrained("tokenizer-vla-adaptive")

All tokens are registered via add_tokens(special_tokens=True) so the BPE merge rules treat each one as a single atomic unit.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support