Instructions to use xpuenabler/molmoact2-libero_grid_sampler_fine_tuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use xpuenabler/molmoact2-libero_grid_sampler_fine_tuned with LeRobot:
- Notebooks
- Google Colab
- Kaggle
MolmoAct2-LIBERO + Grid Sampler (fine-tuned)
MolmoAct2 with Grid Sampler (GridS, ICML 2026) visual token pruning, fine-tuned on the full LIBERO training set (allenai/MolmoAct2-LIBERO-Dataset, 1693 episodes / 273k frames).
Each camera image contributes 16 visual tokens instead of 196 (pruned by an
ActiveTokenSampler in the vision backbone), shrinking the LIBERO 2-camera prompt from 483 to
123 tokens. The sampler and the rest of the network were trained jointly.
- Starting weights: allenai/MolmoAct2-LIBERO (sampler randomly initialized — see xpuenabler/molmoact2-libero_grid_sampler_random_init)
- Recipe: 10,000 steps, batch 32, AdamW lr 1e-5 (ViT 5e-6, connector 5e-6, action expert 5e-5),
cosine decay with 200 warmup steps, gradient checkpointing, frozen embeddings, image
augmentations — mirrors the allenai/MolmoAct2-LIBERO-LeRobot
train_config.json(included here astrain_config.json) - Final training loss: ~0.50 (from 7.5 at random-init sampler)
- Hardware: 1× NVIDIA B200, bfloat16
Reproduce latency / FLOPs on other hardware (e.g. NVIDIA Thor)
See benchmark/ — a self-contained latency + FLOPs reproduction
benchmark (environment setup for NVIDIA Thor / Jetson AGX Thor, a one-command
runner, and the reference B200 numbers in benchmark/reference_results_b200.json).
Quick start: export HF_TOKEN=hf_... then bash benchmark/run.sh. Full setup and
the B200 reference table are in benchmark/README.md.
Usage
Requires the feat/grid-sampler-molmoact2 branch of
nota-github/xpu-lerobot:
import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.molmoact2.modeling_molmoact2 import MolmoAct2Policy
path = "xpuenabler/molmoact2-libero_grid_sampler_fine_tuned"
cfg = PreTrainedConfig.from_pretrained(path)
cfg.pretrained_path = path
cfg.device = "cuda"
cfg.inference_action_mode = "continuous"
policy = MolmoAct2Policy.from_pretrained(path, config=cfg)
preprocessor, postprocessor = make_pre_post_processors(
policy_cfg=cfg, pretrained_path=path,
preprocessor_overrides={"device_processor": {"device": "cuda"}},
)
batch = preprocessor({
"observation.images.image": torch.rand(3, 256, 256),
"observation.images.wrist_image": torch.rand(3, 256, 256),
"observation.state": torch.zeros(8),
"task": "pick up the black bowl",
})
action = postprocessor(policy.select_action(batch))
LIBERO simulation evaluation: scripts/eval_molmoact2_libero.sh in the same branch with
POLICY_PATH=xpuenabler/molmoact2-libero_grid_sampler_fine_tuned.
Citation
@inproceedings{feng2026gridsampler,
title = {See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model},
author = {Feng, Yixu and Zhao, Zinan and Ma, Yanxiang and Xia, Chenghao and Du, Chengbin and Wang, Yunke and Xu, Chang},
booktitle = {Forty-Third International Conference on Machine Learning (ICML)},
year = {2026}
}
- Downloads last month
- 42
Model tree for xpuenabler/molmoact2-libero_grid_sampler_fine_tuned
Base model
allenai/MolmoAct2-LIBERO