MolmoAct2-LIBERO + Grid Sampler (fine-tuned)

MolmoAct2 with Grid Sampler (GridS, ICML 2026) visual token pruning, fine-tuned on the full LIBERO training set (allenai/MolmoAct2-LIBERO-Dataset, 1693 episodes / 273k frames).

Each camera image contributes 16 visual tokens instead of 196 (pruned by an ActiveTokenSampler in the vision backbone), shrinking the LIBERO 2-camera prompt from 483 to 123 tokens. The sampler and the rest of the network were trained jointly.

  • Starting weights: allenai/MolmoAct2-LIBERO (sampler randomly initialized — see xpuenabler/molmoact2-libero_grid_sampler_random_init)
  • Recipe: 10,000 steps, batch 32, AdamW lr 1e-5 (ViT 5e-6, connector 5e-6, action expert 5e-5), cosine decay with 200 warmup steps, gradient checkpointing, frozen embeddings, image augmentations — mirrors the allenai/MolmoAct2-LIBERO-LeRobot train_config.json (included here as train_config.json)
  • Final training loss: ~0.50 (from 7.5 at random-init sampler)
  • Hardware: 1× NVIDIA B200, bfloat16

Reproduce latency / FLOPs on other hardware (e.g. NVIDIA Thor)

See benchmark/ — a self-contained latency + FLOPs reproduction benchmark (environment setup for NVIDIA Thor / Jetson AGX Thor, a one-command runner, and the reference B200 numbers in benchmark/reference_results_b200.json). Quick start: export HF_TOKEN=hf_... then bash benchmark/run.sh. Full setup and the B200 reference table are in benchmark/README.md.

Usage

Requires the feat/grid-sampler-molmoact2 branch of nota-github/xpu-lerobot:

import torch
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.factory import make_pre_post_processors
from lerobot.policies.molmoact2.modeling_molmoact2 import MolmoAct2Policy

path = "xpuenabler/molmoact2-libero_grid_sampler_fine_tuned"
cfg = PreTrainedConfig.from_pretrained(path)
cfg.pretrained_path = path
cfg.device = "cuda"
cfg.inference_action_mode = "continuous"

policy = MolmoAct2Policy.from_pretrained(path, config=cfg)
preprocessor, postprocessor = make_pre_post_processors(
    policy_cfg=cfg, pretrained_path=path,
    preprocessor_overrides={"device_processor": {"device": "cuda"}},
)

batch = preprocessor({
    "observation.images.image": torch.rand(3, 256, 256),
    "observation.images.wrist_image": torch.rand(3, 256, 256),
    "observation.state": torch.zeros(8),
    "task": "pick up the black bowl",
})
action = postprocessor(policy.select_action(batch))

LIBERO simulation evaluation: scripts/eval_molmoact2_libero.sh in the same branch with POLICY_PATH=xpuenabler/molmoact2-libero_grid_sampler_fine_tuned.

Citation

@inproceedings{feng2026gridsampler,
  title     = {See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model},
  author    = {Feng, Yixu and Zhao, Zinan and Ma, Yanxiang and Xia, Chenghao and Du, Chengbin and Wang, Yunke and Xu, Chang},
  booktitle = {Forty-Third International Conference on Machine Learning (ICML)},
  year      = {2026}
}
Downloads last month
42
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Model tree for xpuenabler/molmoact2-libero_grid_sampler_fine_tuned

Finetuned
(1)
this model