Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Tokenizer VLA Adaptive

Extended GPT-NeoX-20b tokenizer for the FineVideo-VLA dataset.

What is this?

This tokenizer extends the EleutherAI/gpt-neox-20b tokenizer with 93,938 new tokens for multimodal Vision-Language-Action (VLA) pretraining.

Category	Token format	Count
Seed2 visual tokens	`<seed2_N>` (N=0-8191)	8,192
Cosmos spatial tokens	`<cosmos_N>` (N=0-63999)	64,000
AVC-LM H.264 BPE tokens	`<avclm_N>` (N=0-8191)	8,192
Agent legacy tokens	`<agent_N>` (N=0-255)	256
FPS prefix	`<fps_N>` (N=1-60)	60
Joint position tokens	`<{joint}_x_N>`, `_y_N`, `_z_N` (N=0-255)	13,056
Joint time tokens	`<{joint}_t_N>` (N=0-7)	136
Wrapper tags	`<seed2>`, `</seed2>`, `<agent>`, `</agent>`, etc.	46

Total vocab size: 144,215 (50,277 base + 93,938 new)

17 Named Joints

pelvis, r_hip, r_knee, r_ankle, l_hip, l_knee, l_ankle, spine, thorax, nose, head_top, l_shoulder, l_elbow, l_wrist, r_shoulder, r_elbow, r_wrist

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")

# All VLA tokens are atomic — never split by BPE
tok.encode("<seed2_1137>")    # -> [59908]
tok.encode("<pelvis_x_128>")  # -> [131151]

How it was created

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tok.add_tokens(new_vla_tokens, special_tokens=True)
tok.save_pretrained("tokenizer-vla-adaptive")

All tokens are registered via add_tokens(special_tokens=True) so the BPE merge rules treat each one as a single atomic unit.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support