Hy-Embodied-0.5-VLA

From Vision-Language-Action Models to a Real-World Robot Learning Stack

Tencent Robotics X × Tencent Hy Team

Project Page Tech Report Model Data Code

📖 Abstract

We introduce Hy-Embodied-0.5-VLA (Hy-VLA) — an end-to-end Vision-Language-Action system that spans the full robot learning stack: data collection, model design, pre-training, supervised fine-tuning, RL post-training, and real-world deployment. Built on the Hy-Embodied-0.5 MoT backbone, Hy-VLA integrates a flow-matching action expert, a compact memory encoder for multi-frame history, and a delta-chunk action representation decoupled from embodiment-specific kinematics.

Powered by 10,000+ hours of high-fidelity UMI demonstrations collected via a custom fingertip interface with optical motion-capture, Hy-VLA achieves state-of-the-art results on the RoboTwin 2.0 benchmark (90.9% / 90.1% on Clean / Randomized) and demonstrates robust cross-embodiment transfer across four real-world robot platforms. Paired with FlowPRO preference optimization and an asynchronous inference framework, Hy-VLA establishes a scalable paradigm for continuous dexterous manipulation.

Overview

Hy-VLA-UMI is the pre-trained checkpoint of Hy-Embodied-0.5-VLA (Hy-VLA), an end-to-end Vision-Language-Action system built on the Hy-Embodied-0.5 MoT backbone. Powered by 10,000+ hours of high-fidelity UMI demonstrations collected via a custom fingertip interface with optical motion-capture, this checkpoint serves as a generalist starting point for downstream fine-tuning on target embodiments.

Architecture

  • VLM Backbone: Hy-Embodied-0.5 MoT
  • Action Expert: 370M-parameter dual-tower flow-matching transformer (hidden=1024, intermediate=2048)
  • Video Encoder: Single-frame mode (K=1) during pre-training; memory encoder is activated during SFT
  • Action Representation: Relative-to-first-frame delta EEF chunk (10-dim per arm: xyz + rot6d + gripper)
  • Action Horizon: H=50 at 10 Hz

Training

Property Value
Data Full 10K-hour UMI corpus (~1M episodes, 70+ tasks)
Initialization VLM: tencent/HY-Embodied-0.5; Action Expert: random
Objective Conditional flow matching (no co-training)
Steps 200K
Global batch size 1,024
Learning rate 5e-5 (linear warmup 1K → decay to 5e-6 over 160K → constant 40K)
Optimizer AdamW, bfloat16 mixed precision
Hardware 64 GPUs (8 nodes × 8)

Contents

The checkpoint ships with all necessary files for loading and inference:

tencent/Hy-Embodied-0.5-VLA-UMI/
├── model.safetensors         # Model weights
├── config.json               # HyVLA configuration
├── tokenizer.json            # Tokenizer for the VLM backbone
├── tokenizer_config.json
├── special_tokens_map.json
├── chat_template.jinja       # Chat template for instruction formatting
├── preprocessor_config.json  # Image preprocessing config
├── norm_stats.pkl            # Pre-computed normalization statistics
└── LICENSE

Usage

Basic Loading

import torch
from huggingface_hub import snapshot_download
from hy_vla import HyVLA, HyVLAConfig

ckpt = snapshot_download("tencent/Hy-Embodied-0.5-VLA-UMI")

config = HyVLAConfig.from_pretrained(ckpt)
policy = HyVLA.from_pretrained(ckpt, config=config)
policy.enable_video_encoder_if_needed()  # K=1 in pretrain; call this before fine-tuning with K>1
policy = policy.to(device="cuda", dtype=torch.bfloat16).eval()

# (B, K, C, H, W); K=1 history slot (pre-trained mode)
img = torch.zeros(1, 1, 3, 224, 224, device="cuda", dtype=torch.bfloat16)
# Normalized dual-arm EEF: [xyz(3) + rot6d(6) + gripper(1)] * 2
state = torch.zeros((1, config.max_state_dim), device="cuda", dtype=torch.bfloat16)
batch = {
    "observation.images.top_head":   img,
    "observation.images.hand_left":  img,
    "observation.images.hand_right": img,
    "observation.state": state,
    "task": ["pick up the bottle"],
}

with torch.no_grad():
    actions = policy.forward_evaluate(batch)["pred"]
    actions = actions[..., : config.action_feature.shape[0]]
print(actions.shape)

Fine-Tuning

This model is designed to be fine-tuned. See the main README for the SFT recipe:

# Fine-tune on RoboTwin 2.0
export CHIEF_IP=<chief-ip> INDEX=0
bash scripts/train_robotwin_umi.sh

Normalization Statistics

The checkpoint includes pre-computed norm_stats.pkl derived from the full UMI pre-training corpus. If you are fine-tuning on a new dataset with substantially different statistics, you can regenerate them:

python scripts/compute_norm_lance.py \
    --lance-source /path/to/your/data \
    --output norm_stats.pkl

📚 Citation

If you find Hy-VLA useful for your research, please cite:

@article{tencent2026hyembodied05vla,
  title={Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack},
  author={Tencent Robotics X and Tencent Hy Team},
  journal={arXiv preprint arXiv:2606.14409},
  year={2026}
}

License

This model is released under Apache-2.0.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
BF16
·
Video Preview
loading

Model tree for tencent/Hy-Embodied-0.5-VLA-UMI

Finetuned
(2)
this model
Finetunes
1 model

Paper for tencent/Hy-Embodied-0.5-VLA-UMI