Instructions to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="EmpathicRobotics/vla-1.7b-pab-spline-adaptive", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("EmpathicRobotics/vla-1.7b-pab-spline-adaptive", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "EmpathicRobotics/vla-1.7b-pab-spline-adaptive"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EmpathicRobotics/vla-1.7b-pab-spline-adaptive",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/EmpathicRobotics/vla-1.7b-pab-spline-adaptive

SGLang

How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "EmpathicRobotics/vla-1.7b-pab-spline-adaptive" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EmpathicRobotics/vla-1.7b-pab-spline-adaptive",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "EmpathicRobotics/vla-1.7b-pab-spline-adaptive" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EmpathicRobotics/vla-1.7b-pab-spline-adaptive",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use EmpathicRobotics/vla-1.7b-pab-spline-adaptive with Docker Model Runner:
```
docker model run hf.co/EmpathicRobotics/vla-1.7b-pab-spline-adaptive
```

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

VLA 1.7B — PAB-Spline Adaptive

A 1.7B parameter Vision-Language-Action model trained on the FineVideo-VLA dataset. This model generates interleaved video tokens (Seed2, Cosmos, AVC-LM) and adaptive PCHIP 3D human pose tokens from activity descriptions.

Key facts


Architecture	OpenSci-Ref 1.7B (Llama-like, RMSNorm, SwiGLU, RoPE, QK-LayerNorm)
Parameters	1.91B (including embeddings for 144K vocab)
Vocab size	144,256 (50,277 base GPT-NeoX-20b + 93,938 VLA tokens, padded to 128)
Tokenizer	EmpathicRobotics/tokenizer-vla-adaptive
Training data	2.84B tokens from ~40K FineVideo YouTube videos
Training	2,032 iters (~3 epochs), 64 nodes × 4 GH200 GPUs, WSD schedule
Final loss	Train: 1.476, Val: 1.501 (PPL 4.49), Test: 1.494 (PPL 4.45)
Precision	bf16
Context length	4,096 tokens

What this model does

Given an activity description, the model generates a multimodal token sequence:

### Context: Person chops vegetables on a cutting board.
<seed2_6750> <seed2_680> ...          # 1 FPS semantic keyframes (Seed2, vocab 8192)
<cosmos_58567> <cosmos_56071> ...     # 8-frame spatial tokens (Cosmos, vocab 64000)
<avclm_100> <avclm_200> ...           # 8-frame H.264 BPE tokens (AVC-LM, vocab 8192)
<fps_30> <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>
         <pelvis_t_7> <pelvis_x_128> ... </pelvis>
<r_hip> <r_hip_t_0> <r_hip_x_115> ... </r_hip>
... (17 joints total)

Agent token format (Adaptive PCHIP)

Each 8-frame pose window uses variable-length self-describing tokens:

<fps_30> — frame rate
<joint> ... </joint> — 17 H36M joints, each with 2, 4, or 8 control points based on motion curvature
<joint_t_N> — frame index (0-7) within the window
<joint_x_N>, <joint_y_N>, <joint_z_N> — quantized coordinates (uint8)

Dequantization: coord_metres = N / 255.0 * 4.0 - 2.0 (range [-2, 2] m, precision ~15.7 mm)

Reconstruction: parse control points per joint, apply PCHIP interpolation → (8, 17, 3) trajectory in metres.

17 joints (H36M order)

Index	Joint	Index	Joint	Index	Joint
0	pelvis	7	spine	14	r_shoulder
1	r_hip	8	thorax	15	r_elbow
2	r_knee	9	nose	16	r_wrist
3	r_ankle	10	head_top
4	l_hip	11	l_shoulder
5	l_knee	12	l_elbow
6	l_ankle	13	l_wrist

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "EmpathicRobotics/vla-1.7b-pab-spline-adaptive",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")

prompt = (
    "### Context: Person raises both arms above head.\n"
    "<seed2_3758> <seed2_2157> <cosmos_58567> "
    "<fps_30> <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>"
)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=500, do_sample=False)
print(tokenizer.decode(output[0]))

Decoding agent tokens to 3D poses

# pip install scipy
from decode_agent_tokens import decode  # from the 3d-human-pose repo

generated_text = tokenizer.decode(output[0])
trajectories = decode(generated_text)  # list of (8, 17, 3) ndarrays

Training details

Loss curve

Iter	Loss	LR	Epoch
50	6.158	1.0e-3	0.02
100	3.927	2.0e-3	0.05
200	2.982	4.0e-3	0.10
500	2.070	4.0e-3	0.25
1000	1.672	4.0e-3	0.49
1500	1.555	4.0e-3	0.74
2000	1.476	3.2e-4	0.99
2032 (val)	1.501	—	—
2032 (test)	1.494	—	—

Config

Optimizer: AdamW (β1=0.9, β2=0.95, wd=0.05, ε=1e-8, clip=1.0)
Schedule: WSD (200 warmup, 400 linear decay at end, peak LR 4e-3)
Batch: GBS 1024, MBS 4, seq_len 4096
Infrastructure: 64 nodes × 4 GH200 GPUs (256 total), ~287 TFLOP/s/GPU
Wall time: ~35 minutes
Framework: Megatron-LM via oellm-autoexp

Data pipeline

FineVideo (~40K YouTube videos) → Seed2/Cosmos/AVC-LM tokenization → HRNet 2D pose → MotionBERT 3D lift → kinematics → YOLO cleaning → adaptive PCHIP tokenization → merge → flatten → Megatron tokenization

See EmpathicRobotics/FineVideo-Phase7-Flattened for the training data.

Differences from first model

The previous model (EmpathicRobotics/vla-1.7b-pab-spline-25b-test) had a broken tokenizer — VLA tokens like <seed2_1137> were split into 7 sub-pieces by BPE. This model fixes that:

	Previous (25b-test)	This model (adaptive)
Tokenizer	Broken (BPE splits VLA tokens)	Fixed (`add_tokens(special_tokens=True)`)
Agent format	Fixed 256 tokens per window	Adaptive 171-579 tokens (PCHIP, variable CPs)
Agent encoding	Scale + anchor + motion integers	Self-describing `<joint_t_N> <joint_x_N>`
Token atomicity	❌ `<seed2_1137>` → 7 sub-pieces	✅ `<seed2_1137>` → 1 token

Limitations

Small dataset (2.84B tokens, ~3 epochs) — model memorizes well but generalises poorly to novel prompts
No vision encoder — generates tokens from text descriptions only, not from actual video frames
Validation run — proves the pipeline works end-to-end, not intended as a final model
Next steps: Rich augmentation pipeline (4x data multiplier), additional datasets (SenseNova-SI-8M, stera-10m), Qwen3 architecture migration

Citation

@misc{empathicrobotics2025vla,
  title={PAB-Spline VLA: Adaptive PCHIP Tokenization for Vision-Language-Action Models},
  author={EmpathicRobotics},
  year={2025},
  url={https://huggingface.co/EmpathicRobotics/vla-1.7b-pab-spline-adaptive}
}

Downloads last month: 14

Safetensors

Model size

2B params

Tensor type

BF16