Instructions to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EmpathicRobotics/vla-1.7b-pab-spline-adaptive", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("EmpathicRobotics/vla-1.7b-pab-spline-adaptive", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EmpathicRobotics/vla-1.7b-pab-spline-adaptive" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EmpathicRobotics/vla-1.7b-pab-spline-adaptive", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/EmpathicRobotics/vla-1.7b-pab-spline-adaptive
- SGLang
How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EmpathicRobotics/vla-1.7b-pab-spline-adaptive" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EmpathicRobotics/vla-1.7b-pab-spline-adaptive", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EmpathicRobotics/vla-1.7b-pab-spline-adaptive" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EmpathicRobotics/vla-1.7b-pab-spline-adaptive", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with Docker Model Runner:
docker model run hf.co/EmpathicRobotics/vla-1.7b-pab-spline-adaptive
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
VLA 1.7B β PAB-Spline Adaptive
A 1.7B parameter Vision-Language-Action model trained on the FineVideo-VLA dataset. This model generates interleaved video tokens (Seed2, Cosmos, AVC-LM) and adaptive PCHIP 3D human pose tokens from activity descriptions.
Key facts
| Architecture | OpenSci-Ref 1.7B (Llama-like, RMSNorm, SwiGLU, RoPE, QK-LayerNorm) |
| Parameters | 1.91B (including embeddings for 144K vocab) |
| Vocab size | 144,256 (50,277 base GPT-NeoX-20b + 93,938 VLA tokens, padded to 128) |
| Tokenizer | EmpathicRobotics/tokenizer-vla-adaptive |
| Training data | 2.84B tokens from ~40K FineVideo YouTube videos |
| Training | 2,032 iters (~3 epochs), 64 nodes Γ 4 GH200 GPUs, WSD schedule |
| Final loss | Train: 1.476, Val: 1.501 (PPL 4.49), Test: 1.494 (PPL 4.45) |
| Precision | bf16 |
| Context length | 4,096 tokens |
What this model does
Given an activity description, the model generates a multimodal token sequence:
### Context: Person chops vegetables on a cutting board.
<seed2_6750> <seed2_680> ... # 1 FPS semantic keyframes (Seed2, vocab 8192)
<cosmos_58567> <cosmos_56071> ... # 8-frame spatial tokens (Cosmos, vocab 64000)
<avclm_100> <avclm_200> ... # 8-frame H.264 BPE tokens (AVC-LM, vocab 8192)
<fps_30> <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>
<pelvis_t_7> <pelvis_x_128> ... </pelvis>
<r_hip> <r_hip_t_0> <r_hip_x_115> ... </r_hip>
... (17 joints total)
Agent token format (Adaptive PCHIP)
Each 8-frame pose window uses variable-length self-describing tokens:
<fps_30>β frame rate<joint> ... </joint>β 17 H36M joints, each with 2, 4, or 8 control points based on motion curvature<joint_t_N>β frame index (0-7) within the window<joint_x_N>,<joint_y_N>,<joint_z_N>β quantized coordinates (uint8)
Dequantization: coord_metres = N / 255.0 * 4.0 - 2.0 (range [-2, 2] m, precision ~15.7 mm)
Reconstruction: parse control points per joint, apply PCHIP interpolation β (8, 17, 3) trajectory in metres.
17 joints (H36M order)
| Index | Joint | Index | Joint | Index | Joint |
|---|---|---|---|---|---|
| 0 | pelvis | 7 | spine | 14 | r_shoulder |
| 1 | r_hip | 8 | thorax | 15 | r_elbow |
| 2 | r_knee | 9 | nose | 16 | r_wrist |
| 3 | r_ankle | 10 | head_top | ||
| 4 | l_hip | 11 | l_shoulder | ||
| 5 | l_knee | 12 | l_elbow | ||
| 6 | l_ankle | 13 | l_wrist |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"EmpathicRobotics/vla-1.7b-pab-spline-adaptive",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")
prompt = (
"### Context: Person raises both arms above head.\n"
"<seed2_3758> <seed2_2157> <cosmos_58567> "
"<fps_30> <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>"
)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=500, do_sample=False)
print(tokenizer.decode(output[0]))
Decoding agent tokens to 3D poses
# pip install scipy
from decode_agent_tokens import decode # from the 3d-human-pose repo
generated_text = tokenizer.decode(output[0])
trajectories = decode(generated_text) # list of (8, 17, 3) ndarrays
Training details
Loss curve
| Iter | Loss | LR | Epoch |
|---|---|---|---|
| 50 | 6.158 | 1.0e-3 | 0.02 |
| 100 | 3.927 | 2.0e-3 | 0.05 |
| 200 | 2.982 | 4.0e-3 | 0.10 |
| 500 | 2.070 | 4.0e-3 | 0.25 |
| 1000 | 1.672 | 4.0e-3 | 0.49 |
| 1500 | 1.555 | 4.0e-3 | 0.74 |
| 2000 | 1.476 | 3.2e-4 | 0.99 |
| 2032 (val) | 1.501 | β | β |
| 2032 (test) | 1.494 | β | β |
Config
- Optimizer: AdamW (Ξ²1=0.9, Ξ²2=0.95, wd=0.05, Ξ΅=1e-8, clip=1.0)
- Schedule: WSD (200 warmup, 400 linear decay at end, peak LR 4e-3)
- Batch: GBS 1024, MBS 4, seq_len 4096
- Infrastructure: 64 nodes Γ 4 GH200 GPUs (256 total), ~287 TFLOP/s/GPU
- Wall time: ~35 minutes
- Framework: Megatron-LM via oellm-autoexp
Data pipeline
FineVideo (~40K YouTube videos) β Seed2/Cosmos/AVC-LM tokenization β HRNet 2D pose β MotionBERT 3D lift β kinematics β YOLO cleaning β adaptive PCHIP tokenization β merge β flatten β Megatron tokenization
See EmpathicRobotics/FineVideo-Phase7-Flattened for the training data.
Differences from first model
The previous model (EmpathicRobotics/vla-1.7b-pab-spline-25b-test) had a broken tokenizer β VLA tokens like <seed2_1137> were split into 7 sub-pieces by BPE. This model fixes that:
| Previous (25b-test) | This model (adaptive) | |
|---|---|---|
| Tokenizer | Broken (BPE splits VLA tokens) | Fixed (add_tokens(special_tokens=True)) |
| Agent format | Fixed 256 tokens per window | Adaptive 171-579 tokens (PCHIP, variable CPs) |
| Agent encoding | Scale + anchor + motion integers | Self-describing <joint_t_N> <joint_x_N> |
| Token atomicity | β <seed2_1137> β 7 sub-pieces |
β
<seed2_1137> β 1 token |
Limitations
- Small dataset (2.84B tokens, ~3 epochs) β model memorizes well but generalises poorly to novel prompts
- No vision encoder β generates tokens from text descriptions only, not from actual video frames
- Validation run β proves the pipeline works end-to-end, not intended as a final model
- Next steps: Rich augmentation pipeline (4x data multiplier), additional datasets (SenseNova-SI-8M, stera-10m), Qwen3 architecture migration
Citation
@misc{empathicrobotics2025vla,
title={PAB-Spline VLA: Adaptive PCHIP Tokenization for Vision-Language-Action Models},
author={EmpathicRobotics},
year={2025},
url={https://huggingface.co/EmpathicRobotics/vla-1.7b-pab-spline-adaptive}
}
- Downloads last month
- 14