EgoSteer-3B-RealMan

EgoSteer-3B-RealMan is the generalist policy from EgoSteer: A Full-Stack System Towards Steerable Dexterous Manipulation from Egocentric Videos. It is a world-model-enhanced Vision-Language-Action (VLA) policy built on a Qwen3-VL backbone with a flow-matching action expert, post-trained from EgoSteer-3B-Base on real-world demonstrations collected on the RealMan robot.

This is a robot-ready dual-camera (head + chest) checkpoint. For the base model to fine-tune on your own data, see EgoSteer-3B-Base.

🌐 Project page: https://egosteer.github.io/
📄 Paper: https://github.com/egosteer/egosteer
💻 Code: https://github.com/egosteer/egosteer

Model Description

Our full-stack system integrates EgoSmith (data pipeline), Robot Stack (deployment), and EgoSteer (policy) to learn from 9.6k hours of large-scale egocentric human videos and facilitate data-efficient real-robot post-training, enabling steerable dexterous manipulation across over 40 tasks alongside few-shot adaptation to complex, long-horizon tasks. EgoSteer-3B-RealMan runs on the RealMan embodiment out of the box and can be further post-trained for other embodiments.

Component	Description
Backbone	Qwen3-VL-2B-Instruct
Action expert	Flow-matching (DiT / AdaLN) expert reusing the backbone KV prefix
World model expert	Regresses future-frame features from a frozen DINOv3 ViT-L/16 teacher, training only
Action space	Unified human-to-robot space based on wrist poses and fingertip keypoints
Cameras	Dual-camera: head + chest
Total parameters	~3B

Inputs & Outputs

The policy maps a language instruction plus multi-view RGB and proprioception to a chunk of future actions. Keep these consistent with the bundled config.yaml and normalizer.pkl.

	Specification
Instruction	A natural-language task description
Cameras	Dual RGB — head + chest, `480 × 640`, 6-frame history (stride 30 over a 30 fps base)
Intrinsics	Per-camera [fx, fy, cx, cy] (head + chest) — required. By default rendered into the VLM prompt as text (`camera_intrinsic_mode: text`); rescaled to match the resized image
Proprioception (state)	48-D: bimanual wrist poses (2 × [3 translation + 6D rotation] = 18), expressed in the camera frame + fingertip keypoints (2 hands × 5 fingertips × 3D = 30), expressed in wrist frame, 6-frame history
Action output	48-D relative action in the same layout as the state, and relative to the current state, predicted as a 32-step action chunk
Normalization	State/action normalized with the bundled `normalizer.pkl` (relative action space) — required for inference and fine-tuning

Model Variants

Model	Parameters	Description
EgoSteer-3B-Base	3B	Base EgoSteer model trained on 9.6k hours of egocentric human videos, ready for fine-tuning
EgoSteer-3B-RealMan (this repo)	3B	Generalist post-trained on real-world data collected on the RealMan robot

Repository Contents

File	Description
`model_bf16.pt`	bf16 model weights
`config.yaml`	RealMan post-training config, for rebuilding the network at eval and inference. Further post-training uses your own config, weights only
`normalizer-relative-10k-pretrain/normalizer.pkl`	State/action normalizer — required for fine-tuning and inference (shared with EgoSteer-3B-Base; relative action space)

⚠️ How to Use

This is a custom policy, loaded and run with the EgoSteer codebase.

This is a dual-camera policy, so a few config keys differ from the single-camera base. When serving or evaluating, keep these consistent with the bundled config.yaml:

data.target_image_size: [480, 640]
dataset.vla_dataset.load_chest: true (head + chest)
data.max_vlm_tokens: 2176 — dual-camera uses roughly twice the vision tokens of single-camera; a smaller value truncates the input and fails with a video-token-count mismatch.
policy.rtc_config.enabled: true — real-time chunking (RTC) for smooth asynchronous closed-loop control; tolerates up to max_delay: 6 steps of inference latency. Disable it for synchronous, blocking action-chunk execution.

Pretrained Backbones

EgoSteer depends on two pretrained backbones; download them ahead of time:

Training Data

EgoSteer-3B-RealMan is post-trained from EgoSteer-3B-Base (pretrained on 9.6k hours of egocentric human videos in a unified human-to-robot action space) on real-world dual-camera demonstrations collected on the RealMan robot. See the paper and project page for details.

Intended Use & Limitations

Intended use: research on vision-language-action models, world action models, and dexterous manipulation; a robot-ready RealMan generalist and a starting point for post-training.
Limitations: the policy is tuned to the RealMan dual-camera observation/action format; deploying on a different embodiment requires matching the observation format (camera setup, image size, normalizer) or further post-training. Outputs should be validated for safety before execution on hardware.

Citation

@article{egosteer2026,
  title   = {EgoSteer: A Full-Stack System Towards Steerable Dexterous Manipulation from Egocentric Videos},
  author  = {EgoSteer Team},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}

License

Released under the Apache 2.0 license.

Acknowledgements

Built on Qwen3-VL and DINOv3.

Downloads last month: 8

Video Preview

Robotics

Model tree for EgoSteer/EgoSteer-3B-RealMan

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

EgoSteer/EgoSteer-3B-Base

Finetuned

(1)

this model