PulseVLA-LIBERO (0.5B) — from-scratch action expert

A Vision-Language-Action policy for the LIBERO robot-manipulation benchmark. It uses a SmolVLA-style architecture: a frozen, pretrained SmolVLM2-500M-Instruct backbone + a flow-matching action expert trained from scratch on the full LIBERO training set. The released checkpoint is self-contained (backbone + expert in one file, 557.6M params) and runs with minimal, pure-PyTorch code — no transformers dependency.

This repo ships only the weights + the minimal code to run and evaluate them. It does not contain the training pipeline.


Results

Success rate (pc_success, % of episodes solved), 10 tasks × 10 episodes per suite (400 rollouts), n_action_steps=10, --seed 1:

Suite pc_success
libero_object 97.0
libero_goal 88.0
libero_spatial 82.0
libero_10 (long-horizon) 60.0
Average 81.8

These numbers are reproducible with the bundled eval_libero.py --seed 1, to within run-to-run noise (see Caveat 1 — fixed seed is approximately, not bit-exactly, reproducible on GPU). Please read Caveats before comparing to any other paper or checkpoint — the eval protocol and settings matter a lot here.


Files in this repo

File What it is
model.safetensors the weights (557.6M; frozen SmolVLM2 backbone + trained expert/projections)
config.json architecture config (rebuilds the model exactly)
norm_stats.safetensors state/action normalization stats the model was trained with — load-bearing for replication
tokenizer.json the SmolVLM2-500M-Instruct tokenizer (bundled; Apache-2.0)
smolvla/ minimal, transformers-free inference + eval code (pure PyTorch)
predict_example.py minimal load + single forward pass (no simulator)
eval_libero.py standalone eval that reproduces the table above (needs the LIBERO sim)
requirements.txt pinned dependency versions

Quickstart — inference (minimal, no simulator)

pip install -r requirements.txt          # torch, numpy, safetensors, tokenizers
python predict_example.py
import torch
from safetensors.torch import load_file
from smolvla import SmolVLA, SmolVLAProcessor, Tokenizer, load_lerobot_norm_stats, load_smolvla_config
from smolvla.types import Obs

cfg   = load_smolvla_config("config.json")
model = SmolVLA(cfg).float().eval()
model.load_state_dict(load_file("model.safetensors"))
stats = load_lerobot_norm_stats("norm_stats.safetensors")
proc  = SmolVLAProcessor(cfg, Tokenizer("tokenizer.json", max_length=cfg.tokenizer_max_length), stats, device="cpu")

obs = Obs(images={"image": torch.rand(1,3,512,512), "image2": torch.rand(1,3,512,512)},
          state=torch.zeros(1,8), task=["pick up the black bowl and place it on the plate"])
chunk   = model.predict_action_chunk(proc.to_model_input(obs))   # (1, 50, 32) normalized
actions = proc.postprocess_action(chunk)                          # (1, 50, 7) raw

predict_action_chunk returns a chunk of chunk_size=50 actions; at eval you execute the first n_action_steps (we use 10) before re-planning.


How to verify (reproduce the table)

Reproducing the numbers needs the public LIBERO simulator (the environment) — this is the only heavy dependency, and it is not our code:

# 1. get the HF CLI, then download this repo (it contains requirements.txt + the code)
pip install -U "huggingface_hub[cli]"
hf download verapulse/pulsevla-libero-0.5b --local-dir ./pulsevla-libero
cd ./pulsevla-libero

# 2. deps: inference (requirements.txt) + the public LIBERO sim. NOTE: lerobot 0.5.2
# is a main-branch dev version NOT on PyPI (latest PyPI is 0.5.1); install the exact
# commit we used (pulls robosuite + mujoco + bddl):
pip install -r requirements.txt
pip install "lerobot[libero] @ git+https://github.com/huggingface/lerobot.git@d1b1c5c8cff5e1f637495e1667a1d6c7c5258f3b"

# 3. run the canonical eval, once per suite (headless rendering => MUJOCO_GL=egl)
MUJOCO_GL=egl python eval_libero.py --task libero_object  --seed 1
MUJOCO_GL=egl python eval_libero.py --task libero_goal    --seed 1
MUJOCO_GL=egl python eval_libero.py --task libero_spatial --seed 1
MUJOCO_GL=egl python eval_libero.py --task libero_10      --seed 1

Each run prints per-task and overall pc_success. With --seed 1 and the defaults (--n_action_steps 10 --n_envs 1) you should get the table above.

The exact protocol behind the numbers (all baked into eval_libero.py's defaults):

  • n_action_steps = 10, n_envs = 1, 10 tasks × 10 episodes per suite;
  • normalization = the bundled norm_stats.safetensors (these are the HuggingFaceVLA/libero dataset-metadata stats the model trained with — no dataset download needed for eval);
  • per-suite max episode length: spatial/object 280, goal 300, libero_10 520;
  • --seed 1 reseeds the policy's flow-matching noise per episode (see below).

Caveats (read before comparing)

These are not fine print — they determine whether your number will match ours.

  1. The policy is stochastic, and --seed 1 is only approximately reproducible on GPU. It samples flow-matching noise from the global RNG on every re-plan. Unseeded, pc_success varies ±5–10 pts per suite at 100 episodes. --seed 1 fixes that noise and gets you the table within a few points — but it is not bit-exact on GPU: residual CUDA / MuJoCo nondeterminism can flip a few near-boundary episodes from run to run. In our own clean-room checks, object, spatial and libero_10 reproduced exactly across runs, while goal moved between 84% and 88% (one task flipping 4 episodes). So expect to land within a couple of points of each number, sometimes exactly. (Full bit-determinism would additionally need torch.use_deterministic_algorithms(True), CUBLAS_WORKSPACE_CONFIG=:4096:8, and a fixed-seed simulator — we do not enforce these.) A different seed, or unseeded, will likewise land a few points off — all expected, not a bug.

  2. Numbers are only comparable at --n_envs 1. Batched rollouts (--n_envs > 1, or --vec async for speed) draw the noise in a different shape, so they produce a different but equally valid sample — not the seed-1 table. Use --n_envs 1 to match us; use --vec async only when you want a faster estimate and don't need exact reproduction.

  3. lerobot version matters — install the exact commit. LIBERO/lerobot APIs move. We used lerobot at commit d1b1c5c8 (its dev version string is 0.5.2, which is not on PyPI — the latest PyPI release is 0.5.1). Install the pinned commit shown in step 1; other versions may change env construction or eval batching and shift numbers.


Model details

  • Backbone: SmolVLM2-500M-Instruct (SigLIP vision tower + pixel-shuffle connector
    • SmolLM2 decoder), frozen, all 32 layers.
  • Action expert: flow-matching head (10 integration steps), interleaved with the VLM via cross-attention; trained from scratch. ~97.5M trainable params of 557.6M.
  • I/O: 2 camera views (agentview + wrist, 512×512), 8-D proprio state ([eef_pos(3), quat→axis-angle(3), gripper_qpos(2)]), language instruction → chunk of 7-DoF actions. State/action normalized with the bundled stats; images are flipped 180° and mapped to SigLIP range (handled inside smolvla/).
  • Training: full LIBERO training set (all 4 suites), ~30k steps, AdamW, cosine LR, bf16, frozen backbone. The released checkpoint is the final step.
  • Cheap to train on a home GPU. Because only the 97.5M-param expert is trained (the backbone is frozen and stays out of the backward graph), the whole run fits in about 16 GB of VRAM at batch size 32 (bf16; about 22 GB at batch size 48) — a single consumer GPU. The from-scratch run above took about 7.5 h on one RTX 5090, i.e. a few kWh of electricity — well under $10 at typical home rates. No multi-GPU or cloud cluster required.

Intended use & limitations

Research artifact for simulated LIBERO manipulation. Not validated for real robots. Evaluation is stochastic (see caveats). Performance outside LIBERO's task distribution / the 2-camera + 8-D-state observation format is unknown.

License & attribution

  • This checkpoint and code: Apache-2.0.
  • Bundled SmolVLM2-500M-Instruct backbone weights and tokenizer.json: Apache-2.0, © Hugging Face — see the base model card.
  • LIBERO benchmark and the LeRobot library are the authors' / Hugging Face's; cite them if you use this. The architecture follows the SmolVLA design.
Downloads last month
91
Safetensors
Model size
0.6B params
Tensor type
F32
·
Video Preview
loading

Model tree for verapulse/pulsevla-libero-0.5b

Dataset used to train verapulse/pulsevla-libero-0.5b