MolmoAct2-Origami

MolmoAct2 is an open vision-language-action model for robot control. This checkpoint is fine-tuned for the bimanual Origami folding setup with absolute joint-pose control.

The checkpoint uses continuous action inference. Dataset normalization metadata is stored in norm_stats.json; pass norm_tag="origami" at inference time.

Install

pip install torch transformers pillow numpy huggingface_hub

Sample Input

This sample comes from season_POC22032_2026_05_14_19_21_01_train, episode 0, frame 1000. The task annotation is north ces task.

The camera order for this checkpoint is head-left, head-right, wrist-left, wrist-right, tactile-deform.

Head Left Head Right Wrist Left
Sample head-left RGB Sample head-right RGB Sample wrist-left RGB
Wrist Right Tactile Deform
Sample wrist-right RGB Sample tactile-deform RGB

Origami training concatenates observation.state and observation.tactile before passing the state into the model:

from huggingface_hub import hf_hub_download
from PIL import Image
import numpy as np

repo_id = "hqfang/molmoact2-origami"

head_left = Image.open(
    hf_hub_download(repo_id, "assets/sample_head_left.png")
).convert("RGB")
head_right = Image.open(
    hf_hub_download(repo_id, "assets/sample_head_right.png")
).convert("RGB")
wrist_left = Image.open(
    hf_hub_download(repo_id, "assets/sample_wrist_left.png")
).convert("RGB")
wrist_right = Image.open(
    hf_hub_download(repo_id, "assets/sample_wrist_right.png")
).convert("RGB")
tactile_deform = Image.open(
    hf_hub_download(repo_id, "assets/sample_tactile_deform.png")
).convert("RGB")

task = "north ces task"

observation_state = np.array(
    [
        1.0490574, -0.5202958, -0.081635103, 1.1234368, -0.16848825,
        -0.17774895, 0.32093349, 1.1820049, -0.015503892, -0.22338045,
        -0.054003395, 0.19747166, 1.1392729, 0.0037120704, 0.20939526,
        0.18640123, 0.80856121, 0.054220561, 0.86714411, 0.20072246,
        0.52739459, 0.12671313, 1.1206218, 0.2093568, 0.054587767,
        0.39636838, 0.13089132, 0.73561513, 0.62315375, 1.0669012,
        -0.50566864, -0.10724012, 1.0205883, -0.38752872, -0.059374165,
        0.40863451, 1.2579446, 0.13169624, -0.004245867, -0.11287238,
        0.34268361, 1.2139554, -0.014323864, 0.22620225, 0.19668068,
        1.1637404, -0.026463741, 0.48477855, 0.29940978, 0.73284894,
        0.10251313, 0.79543447, 0.3646262, 0.013743274, 1.4660013,
        0.25514615, 1.5681823, 1.3074577, 0.56268334, -1.106863,
        0.72058749, 0.035089809, 0.054360446, -0.055702679, -0.62337142,
    ],
    dtype=np.float32,
)
observation_tactile = np.array(
    [
        0.07043457, 0.44433594, -0.20898438, 0.0043830872, 0.0062179565,
        -0.00070285797, 0.39355469, -0.0039825439, 0.11804199, -0.0015954971,
        0.0025863647, 0.0075416565, -0.0035018921, 0.0027160645, -0.0034484863,
        6.6518784e-05, 0.00020933151, -0.00014781952, -0.0017547607, -0.005027771,
        -0.010017395, 0.00012350082, -0.00011891127, -4.1007996e-05, 0.014160156,
        -0.0014038086, -0.0091247559, 0.00027275085, 7.8439713e-05, 0.00023460388,
        -3.5722656, 6.28125, 9.078125, -0.13024902, 0.01159668,
        -0.057006836, -9.796875, 0.63378906, 7.921875, -0.13171387,
        0.013977051, -0.1862793, -0.00080108643, -0.0090332031, -0.00062561035,
        -2.5749207e-05, -0.00018501282, 3.4332275e-05, 0.00028610229, 0.0017700195,
        0.0065460205, -0.00017929077, -9.5367432e-06, 2.2411346e-05, -0.0016269684,
        0.00019073486, -0.0073242188, 0.00037384033, 3.3140182e-05, -0.00012779236,
    ],
    dtype=np.float32,
)
robot_state = np.concatenate([observation_state, observation_tactile], axis=-1).astype(np.float32)

observation_state has 65 dimensions and observation_tactile has 60 dimensions, so robot_state has 125 dimensions. Do not pass observation.state.joint_torque, observation.state.tcp, or observation.images.tactile_raw unless you fine-tune a separate checkpoint that was trained with those inputs.

Continuous Actions

import numpy as np
import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

repo_id = "hqfang/molmoact2-origami"

head_left = Image.open(
    hf_hub_download(repo_id, "assets/sample_head_left.png")
).convert("RGB")
head_right = Image.open(
    hf_hub_download(repo_id, "assets/sample_head_right.png")
).convert("RGB")
wrist_left = Image.open(
    hf_hub_download(repo_id, "assets/sample_wrist_left.png")
).convert("RGB")
wrist_right = Image.open(
    hf_hub_download(repo_id, "assets/sample_wrist_right.png")
).convert("RGB")
tactile_deform = Image.open(
    hf_hub_download(repo_id, "assets/sample_tactile_deform.png")
).convert("RGB")

task = "north ces task"
observation_state = np.array(
    [
        1.0490574, -0.5202958, -0.081635103, 1.1234368, -0.16848825,
        -0.17774895, 0.32093349, 1.1820049, -0.015503892, -0.22338045,
        -0.054003395, 0.19747166, 1.1392729, 0.0037120704, 0.20939526,
        0.18640123, 0.80856121, 0.054220561, 0.86714411, 0.20072246,
        0.52739459, 0.12671313, 1.1206218, 0.2093568, 0.054587767,
        0.39636838, 0.13089132, 0.73561513, 0.62315375, 1.0669012,
        -0.50566864, -0.10724012, 1.0205883, -0.38752872, -0.059374165,
        0.40863451, 1.2579446, 0.13169624, -0.004245867, -0.11287238,
        0.34268361, 1.2139554, -0.014323864, 0.22620225, 0.19668068,
        1.1637404, -0.026463741, 0.48477855, 0.29940978, 0.73284894,
        0.10251313, 0.79543447, 0.3646262, 0.013743274, 1.4660013,
        0.25514615, 1.5681823, 1.3074577, 0.56268334, -1.106863,
        0.72058749, 0.035089809, 0.054360446, -0.055702679, -0.62337142,
    ],
    dtype=np.float32,
)
observation_tactile = np.array(
    [
        0.07043457, 0.44433594, -0.20898438, 0.0043830872, 0.0062179565,
        -0.00070285797, 0.39355469, -0.0039825439, 0.11804199, -0.0015954971,
        0.0025863647, 0.0075416565, -0.0035018921, 0.0027160645, -0.0034484863,
        6.6518784e-05, 0.00020933151, -0.00014781952, -0.0017547607, -0.005027771,
        -0.010017395, 0.00012350082, -0.00011891127, -4.1007996e-05, 0.014160156,
        -0.0014038086, -0.0091247559, 0.00027275085, 7.8439713e-05, 0.00023460388,
        -3.5722656, 6.28125, 9.078125, -0.13024902, 0.01159668,
        -0.057006836, -9.796875, 0.63378906, 7.921875, -0.13171387,
        0.013977051, -0.1862793, -0.00080108643, -0.0090332031, -0.00062561035,
        -2.5749207e-05, -0.00018501282, 3.4332275e-05, 0.00028610229, 0.0017700195,
        0.0065460205, -0.00017929077, -9.5367432e-06, 2.2411346e-05, -0.0016269684,
        0.00019073486, -0.0073242188, 0.00037384033, 3.3140182e-05, -0.00012779236,
    ],
    dtype=np.float32,
)
robot_state = np.concatenate([observation_state, observation_tactile], axis=-1).astype(np.float32)

processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    repo_id,
    trust_remote_code=True,
    dtype=torch.float32,
).to("cuda").eval()

out = model.predict_action(
    processor=processor,
    images=[
        head_left,
        head_right,
        wrist_left,
        wrist_right,
        tactile_deform,
    ],
    task=task,
    state=robot_state,
    norm_tag="origami",
    inference_action_mode="continuous",
    enable_depth_reasoning=False,
    num_steps=10,
    normalize_language=True,
    enable_cuda_graph=True,
)

actions = out.actions  # shape: (1, 30, 65)

MolmoAct2 was trained with mixed precision. For our reported experiments, we typically run inference in float32. If GPU memory is limited, use bfloat16:

model = AutoModelForImageTextToText.from_pretrained(
    repo_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    out = model.predict_action(...)

normalize_language=True is the default. It lowercases the task string and removes trailing sentence punctuation to match training preprocessing. enable_cuda_graph=True is also the default; the first few calls can be slow while CUDA graphs are warmed up and captured. num_steps controls the continuous flow solver.

Depth reasoning is disabled for this checkpoint. Calling enable_depth_reasoning=True will raise an error.

Model and Hardware Safety

MolmoAct2 generates robot actions from visual observations and language instructions, but behavior may vary across embodiments, environments, and hardware configurations. Validate model outputs before deployment, especially on physical robots. The action space should be bounded by the training data, robot controller limits, and task-specific safety constraints, including limits on speed, workspace, torque, and contact force.

Citation

@misc{fang2026molmoact2actionreasoningmodels,
      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
      year={2026},
      eprint={2605.02881},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.02881},
}
Downloads last month
-
Safetensors
Model size
5B params
Tensor type
F32
·
Video Preview
loading

Paper for hqfang/molmoact2-origami