PulseVLA-LIBERO (0.5B) — from-scratch action expert
A Vision-Language-Action policy for the LIBERO
robot-manipulation benchmark. It uses a SmolVLA-style architecture: a frozen, pretrained
SmolVLM2-500M-Instruct
backbone + a flow-matching action expert trained from scratch on the full
LIBERO training set. The released checkpoint is self-contained (backbone +
expert in one file, 557.6M params) and runs with minimal, pure-PyTorch code — no
transformers dependency.
This repo ships only the weights + the minimal code to run and evaluate them. It does not contain the training pipeline.
Results
Success rate (pc_success, % of episodes solved), 10 tasks ×
10 episodes per suite (400 rollouts), n_action_steps=10, --seed 1:
| Suite | pc_success |
|---|---|
| libero_object | 97.0 |
| libero_goal | 88.0 |
| libero_spatial | 82.0 |
| libero_10 (long-horizon) | 60.0 |
| Average | 81.8 |
These numbers are reproducible with the bundled eval_libero.py --seed 1, to
within run-to-run noise (see Caveat 1 — fixed seed is approximately, not
bit-exactly, reproducible on GPU). Please read
Caveats before comparing to any other paper or
checkpoint — the eval protocol and settings matter a lot here.
Files in this repo
| File | What it is |
|---|---|
model.safetensors |
the weights (557.6M; frozen SmolVLM2 backbone + trained expert/projections) |
config.json |
architecture config (rebuilds the model exactly) |
norm_stats.safetensors |
state/action normalization stats the model was trained with — load-bearing for replication |
tokenizer.json |
the SmolVLM2-500M-Instruct tokenizer (bundled; Apache-2.0) |
smolvla/ |
minimal, transformers-free inference + eval code (pure PyTorch) |
predict_example.py |
minimal load + single forward pass (no simulator) |
eval_libero.py |
standalone eval that reproduces the table above (needs the LIBERO sim) |
requirements.txt |
pinned dependency versions |
Quickstart — inference (minimal, no simulator)
pip install -r requirements.txt # torch, numpy, safetensors, tokenizers
python predict_example.py
import torch
from safetensors.torch import load_file
from smolvla import SmolVLA, SmolVLAProcessor, Tokenizer, load_lerobot_norm_stats, load_smolvla_config
from smolvla.types import Obs
cfg = load_smolvla_config("config.json")
model = SmolVLA(cfg).float().eval()
model.load_state_dict(load_file("model.safetensors"))
stats = load_lerobot_norm_stats("norm_stats.safetensors")
proc = SmolVLAProcessor(cfg, Tokenizer("tokenizer.json", max_length=cfg.tokenizer_max_length), stats, device="cpu")
obs = Obs(images={"image": torch.rand(1,3,512,512), "image2": torch.rand(1,3,512,512)},
state=torch.zeros(1,8), task=["pick up the black bowl and place it on the plate"])
chunk = model.predict_action_chunk(proc.to_model_input(obs)) # (1, 50, 32) normalized
actions = proc.postprocess_action(chunk) # (1, 50, 7) raw
predict_action_chunk returns a chunk of chunk_size=50 actions; at eval you
execute the first n_action_steps (we use 10) before re-planning.
How to verify (reproduce the table)
Reproducing the numbers needs the public LIBERO simulator (the environment) — this is the only heavy dependency, and it is not our code:
# 1. get the HF CLI, then download this repo (it contains requirements.txt + the code)
pip install -U "huggingface_hub[cli]"
hf download verapulse/pulsevla-libero-0.5b --local-dir ./pulsevla-libero
cd ./pulsevla-libero
# 2. deps: inference (requirements.txt) + the public LIBERO sim. NOTE: lerobot 0.5.2
# is a main-branch dev version NOT on PyPI (latest PyPI is 0.5.1); install the exact
# commit we used (pulls robosuite + mujoco + bddl):
pip install -r requirements.txt
pip install "lerobot[libero] @ git+https://github.com/huggingface/lerobot.git@d1b1c5c8cff5e1f637495e1667a1d6c7c5258f3b"
# 3. run the canonical eval, once per suite (headless rendering => MUJOCO_GL=egl)
MUJOCO_GL=egl python eval_libero.py --task libero_object --seed 1
MUJOCO_GL=egl python eval_libero.py --task libero_goal --seed 1
MUJOCO_GL=egl python eval_libero.py --task libero_spatial --seed 1
MUJOCO_GL=egl python eval_libero.py --task libero_10 --seed 1
Each run prints per-task and overall pc_success. With --seed 1 and the defaults
(--n_action_steps 10 --n_envs 1) you should get the table above.
The exact protocol behind the numbers (all baked into eval_libero.py's
defaults):
n_action_steps = 10,n_envs = 1, 10 tasks × 10 episodes per suite;- normalization = the bundled
norm_stats.safetensors(these are theHuggingFaceVLA/liberodataset-metadata stats the model trained with — no dataset download needed for eval); - per-suite max episode length: spatial/object 280, goal 300, libero_10 520;
--seed 1reseeds the policy's flow-matching noise per episode (see below).
Caveats (read before comparing)
These are not fine print — they determine whether your number will match ours.
The policy is stochastic, and
--seed 1is only approximately reproducible on GPU. It samples flow-matching noise from the global RNG on every re-plan. Unseeded,pc_successvaries ±5–10 pts per suite at 100 episodes.--seed 1fixes that noise and gets you the table within a few points — but it is not bit-exact on GPU: residual CUDA / MuJoCo nondeterminism can flip a few near-boundary episodes from run to run. In our own clean-room checks, object, spatial and libero_10 reproduced exactly across runs, while goal moved between 84% and 88% (one task flipping 4 episodes). So expect to land within a couple of points of each number, sometimes exactly. (Full bit-determinism would additionally needtorch.use_deterministic_algorithms(True),CUBLAS_WORKSPACE_CONFIG=:4096:8, and a fixed-seed simulator — we do not enforce these.) A different seed, or unseeded, will likewise land a few points off — all expected, not a bug.Numbers are only comparable at
--n_envs 1. Batched rollouts (--n_envs > 1, or--vec asyncfor speed) draw the noise in a different shape, so they produce a different but equally valid sample — not the seed-1 table. Use--n_envs 1to match us; use--vec asynconly when you want a faster estimate and don't need exact reproduction.lerobot version matters — install the exact commit. LIBERO/lerobot APIs move. We used
lerobotat commitd1b1c5c8(its dev version string is0.5.2, which is not on PyPI — the latest PyPI release is0.5.1). Install the pinned commit shown in step 1; other versions may change env construction or eval batching and shift numbers.
Model details
- Backbone: SmolVLM2-500M-Instruct (SigLIP vision tower + pixel-shuffle connector
- SmolLM2 decoder), frozen, all 32 layers.
- Action expert: flow-matching head (10 integration steps), interleaved with the VLM via cross-attention; trained from scratch. ~97.5M trainable params of 557.6M.
- I/O: 2 camera views (agentview + wrist, 512×512), 8-D proprio state
(
[eef_pos(3), quat→axis-angle(3), gripper_qpos(2)]), language instruction → chunk of 7-DoF actions. State/action normalized with the bundled stats; images are flipped 180° and mapped to SigLIP range (handled insidesmolvla/). - Training: full LIBERO training set (all 4 suites), ~30k steps, AdamW, cosine LR, bf16, frozen backbone. The released checkpoint is the final step.
- Cheap to train on a home GPU. Because only the 97.5M-param expert is trained (the backbone is frozen and stays out of the backward graph), the whole run fits in about 16 GB of VRAM at batch size 32 (bf16; about 22 GB at batch size 48) — a single consumer GPU. The from-scratch run above took about 7.5 h on one RTX 5090, i.e. a few kWh of electricity — well under $10 at typical home rates. No multi-GPU or cloud cluster required.
Intended use & limitations
Research artifact for simulated LIBERO manipulation. Not validated for real robots. Evaluation is stochastic (see caveats). Performance outside LIBERO's task distribution / the 2-camera + 8-D-state observation format is unknown.
License & attribution
- This checkpoint and code: Apache-2.0.
- Bundled SmolVLM2-500M-Instruct backbone weights and
tokenizer.json: Apache-2.0, © Hugging Face — see the base model card. - LIBERO benchmark and the LeRobot library are the authors' / Hugging Face's; cite them if you use this. The architecture follows the SmolVLA design.
- Downloads last month
- 91
Model tree for verapulse/pulsevla-libero-0.5b
Base model
HuggingFaceTB/SmolLM2-360M