YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Mem-0 Execution Module β€” m1_mix (RMBench / RoboTwin 2.0)

A single Mem-0 low-level execution-module checkpoint trained jointly on all five RMBench M1 tasks (the m1_mix dataset) and evaluated on each task in turn. M1 tasks require only the execution module β€” no high-level planner / vLLM is needed for inference.

  • Backbone: Qwen3-VL-2B-Instruct (vision-language) β€” weights fine-tuned and bundled in the checkpoint
  • Action head: DiT-B flow-matching policy (action chunk of 30, 16-D action)
  • Memory: MemoryBank (instant + anchor memory fusion across the episode)
  • Aux head: subtask-end classifier (used for Mn multi-stage tasks; inert for M1)
  • Total parameters: β‰ˆ 2.67 B

Results

task_config = demo_clean, instruction_type = unseen, 100 episodes per task, action_horizon = 30. The same checkpoint and same m1_mix normalization stats are used for every task.

Task Success Rate Reward
put_back_block 1.00 1.00
rearrange_blocks 0.86 0.86
swap_blocks 0.81 0.81
swap_T 0.13 0.13
observe_and_pickup 0.03 0.00
Average 0.566 β€”

Per-episode logs and rollout videos for all five tasks are under eval_results/. See eval_results/summary.md for details and task_instructions.json for the exact per-task language instruction used.


Contents of this bundle

m1_mix_submit/
β”œβ”€β”€ README.md                                   # this file
β”œβ”€β”€ task_instructions.json                      # verbatim --global_task per task + scores
β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ m1_mix_final_step50000.pt.part00 … part08   # 15.3 GB full training ckpt, split into 9 parts (2Γ—4 GB + 7×≀1 GB)
β”‚   β”œβ”€β”€ m1_mix_final_step50000.pt.sha256        # SHA-256 of the reassembled checkpoint
β”‚   └── README_REASSEMBLE.md                    # how to cat the parts back together + verify
β”œβ”€β”€ norm_stats/
β”‚   └── norm_stats.json                         # min-max state/action stats β†’ [-1, 1]
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ execution_module_train_m1_mix.yaml      # training config (reproducibility)
β”‚   └── deploy_policy.yml                        # inference / deployment config
β”œβ”€β”€ qwen_base_config/                            # Qwen3-VL-2B-Instruct config/processor ONLY
β”‚   β”œβ”€β”€ config.json, generation_config.json
β”‚   β”œβ”€β”€ tokenizer*.json, vocab.json, merges.txt
β”‚   β”œβ”€β”€ preprocessor_config.json, video_preprocessor_config.json, chat_template.json
β”‚   └── README_Qwen3-VL-2B-Instruct.md          # upstream model card (Apache-2.0)
└── eval_results/
    β”œβ”€β”€ summary.md
    └── <task>/                                  # _result.txt, eval_log.txt, episode*.mp4 (Γ—100)

About the checkpoint

Reassemble first. The 15.3 GB checkpoint is uploaded as 9 byte-split parts (m1_mix_final_step50000.pt.part00…08) because the upload path capped single files and throttled per-window bytes. Concatenation reproduces the original bit-for-bit:

cat m1_mix_final_step50000.pt.part?? > m1_mix_final_step50000.pt
sha256sum -c m1_mix_final_step50000.pt.sha256   # -> m1_mix_final_step50000.pt: OK

See checkpoint/README_REASSEMBLE.md for details.

Once reassembled, m1_mix_final_step50000.pt is the full training checkpoint at step 50000:

key content
model_state_dict 910 tensors, β‰ˆ 2.67 B params (qwen_model β‰ˆ 2.44 B, action_model β‰ˆ 160 M, memory_bank β‰ˆ 39 M, classifier β‰ˆ 32 M); bf16 + fp32
optimizer_state_dict AdamW moments β€” for resume/fine-tune only
scheduler_state_dict cosine LR scheduler state
global_step 50000

The model_state_dict is self-contained: it already includes the fine-tuned Qwen3-VL-2B backbone weights. The bundled qwen_base_config/ provides only the architecture/tokenizer/processor config β€” the base model weights (model.safetensors, ~4 GB) are not re-distributed here; download them from the official repo (see below).

Inference-only slimming (15.3 GB β†’ β‰ˆ 6 GB) if you don't need to resume training:

import torch
ck = torch.load("checkpoint/m1_mix_final_step50000.pt", map_location="cpu", weights_only=False)
torch.save({"model_state_dict": ck["model_state_dict"], "global_step": ck["global_step"]},
           "m1_mix_inference.pt")

The deploy loader reads payload["model_state_dict"] and calls load_state_dict(..., strict=False), so either the full or the slimmed file works unchanged.


Dependencies

  1. Code: the RMBench / Mem-0 repository (this checkpoint targets its policy/Mem-0 execution module and script/eval_policy.py). Follow the repo README for the RoboTwin 2.0 simulator environment setup.

  2. Base VLM: Qwen/Qwen3-VL-2B-Instruct (Apache-2.0). Required at model instantiation for the architecture + image/text processor. Its weights are overwritten by this checkpoint at load time (strict=False), but the directory must exist and contain model.safetensors:

    huggingface-cli download Qwen/Qwen3-VL-2B-Instruct \
        --local-dir policy/Mem-0/checkpoints/Qwen3-VL-2B-Instruct
    

    The small config/processor files in qwen_base_config/ are exactly the ones used for training and evaluation; you may overlay them onto the downloaded directory if the upstream revision differs.


How to run evaluation

Point the deploy config at the checkpoint and the m1_mix stats, then run one task at a time. This mirrors exactly how the numbers above were produced:

python script/eval_policy.py --config policy/Mem-0/deploy_policy.yml --overrides \
    --task_name        swap_blocks \
    --execution_ckpt   /path/to/m1_mix_final_step50000.pt \
    --state_stats_path /path/to/norm_stats/norm_stats.json \
    --ckpt_setting     m1mix \
    --global_task      "There are three traies on the table, and two blocks are placed in two different traies. You may move only one block at a time, and each tray can hold at most one block. Swap the positions of the two blocks. Finally press the button." \
    --action_horizon   30
  • Replace --task_name and --global_task with each of the five tasks (strings in task_instructions.json). The checkpoint and --state_stats_path stay the same.
  • --ckpt_setting m1mix only labels the output directory (eval_result/<task>/Mem-0/demo_clean/m1mix/<timestamp>/).
  • --vllm_url is accepted but unused for M1 tasks (the global instruction is set directly; the planner client is constructed but never queried).
  • Ensure execution_module.qwen_vl.model_path in deploy_policy.yml points to your downloaded Qwen3-VL-2B-Instruct directory.

Model architecture (from configs/)

  • VLM backbone β€” Qwen3-VL-2B-Instruct, 224Γ—224 head-camera image + language instruction, last-layer hidden states (hidden size 2048).
  • MemoryBank β€” window_size 30, initial_anchor_size 1, num_heads 8, memory_accumulation 8, dropout 0.1; fuses an instant-memory and an anchor-memory token; concatenated with the text feature β†’ a 3-token summary (B, 3, 2048).
  • DiT-B action head (FlowmatchingActionHead) β€” num_layers 16, cross_attention_dim 2048, action_dim 16, state_dim 16, action_horizon 30, num_inference_timesteps 8; flow-matching regression of a 30-step action chunk.
  • Subtask-end classifier β€” MLP hidden_sizes [6144, 2048, 512], pos_weight 10.0, focal_gamma 1.0, threshold 0.5. Drives stage transitions in Mn tasks; for M1 the episode is a single stage so it does not affect rollout.

Training (from configs/execution_module_train_m1_mix.yaml)

  • Data: m1_mix (the five M1 tasks merged into one LeRobot dataset with globally unique episode_ids). Features: head-camera image, state, action, subtask, subtask_end, episode_id.
  • Schedule: train_steps 50000, batch_size 56, cosine scheduler, warmup_ratio 0.05, grad_clip_norm 2.5, weight_decay 0.005, seed 42.
  • Learning rates: base 1e-5, qwen_model 1e-5, action_model 1e-4, classifier 1e-4 (min LRs 1e-6 / 1e-6 / 5e-6 / 5e-6).
  • Loss: lambda_action 1.0, lambda_classifier 0.2.

Normalization

State and action are min-max normalized to [-1, 1] over the 14 arm dimensions using norm_stats/norm_stats.json (NORM_WAY = "minmax" in deploy_policy.py). Use the same stats file at inference; predicted actions are denormalized with it before being sent to the environment. Action chunks from overlapping predictions are averaged (mean smoothing) before execution.

Limitations

  • swap_T (0.13) and observe_and_pickup (0.03) are weak: the former needs precise T-block position and orientation alignment; the latter needs cross-view target re-identification after a visual occlusion followed by a pickup. The joint m1_mix model does not solve these reliably.
  • Numbers are on RoboTwin 2.0 demo_clean with unseen instruction phrasings; other task configs / domain randomization will differ.

License & attribution

  • Base VLM Qwen3-VL-2B-Instruct is Β© the Qwen team, licensed Apache-2.0 (see qwen_base_config/README_Qwen3-VL-2B-Instruct.md). Because the checkpoint embeds fine-tuned Qwen weights, that license applies to the corresponding components.
  • RMBench / RoboTwin and the Mem-0 policy code are governed by their respective upstream licenses; refer to the source repository.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support