EgoSteer-3B-Base
EgoSteer-3B-Base is the base world-model-enhanced Vision-Language-Action (VLA) policy from EgoSteer: A Full-Stack System Towards Steerable Dexterous Manipulation from Egocentric Videos. It is built on a Qwen3-VL backbone with a flow-matching action expert and a DINOv3 latent future-prediction objective used only during training, and learns a unified human-to-robot action space from large-scale egocentric human videos.
This is the pretrained base checkpoint. For a robot-ready generalist, see EgoSteer-3B-RealMan.
- 🌐 Project page: https://egosteer.github.io/
- 📄 Paper: https://github.com/egosteer/egosteer
- 💻 Code: https://github.com/egosteer/egosteer
Model Description
Our full-stack system integrates EgoSmith (data pipeline), Robot Stack (deployment), and EgoSteer (policy) to learn from 9.6k hours of large-scale egocentric human videos and facilitate data-efficient real-robot post-training, enabling steerable dexterous manipulation across over 40 tasks alongside few-shot adaptation to complex, long-horizon tasks. EgoSteer-3B-Base is the pretrained base and a starting point for data-efficient post-training on real robots.
| Component | Description |
|---|---|
| Backbone | Qwen3-VL-2B-Instruct |
| Action expert | Flow-matching (DiT / AdaLN) expert reusing the backbone KV prefix |
| World model expert | Regresses future-frame features from a frozen DINOv3 ViT-L/16 teacher, training only |
| Action space | Unified human-to-robot space based on wrist poses and fingertip keypoints |
| Total parameters | ~3B |
Inputs & Outputs
The policy maps a language instruction plus RGB and proprioception to a chunk of future actions.
Keep these consistent with the bundled config.yaml and normalizer.pkl.
| Specification | |
|---|---|
| Instruction | A natural-language task description |
| Cameras | Single RGB — head, 384 × 384, 6-frame history (stride 30 over a 30 fps base) |
| Intrinsics | Per-camera [fx, fy, cx, cy] (head) — required. By default rendered into the VLM prompt as text (camera_intrinsic_mode: text); rescaled to match the resized image |
| Proprioception (state) | 48-D: bimanual wrist poses (2 × [3 translation + 6D rotation] = 18), expressed in the camera frame + fingertip keypoints (2 hands × 5 fingertips × 3D = 30), expressed in wrist frame, 6-frame history |
| Action output | 48-D relative action in the same layout as the state, and relative to the current state, predicted as a 32-step action chunk |
| Normalization | State/action normalized with the bundled normalizer.pkl (relative action space) — required for inference and fine-tuning |
Model Variants
| Model | Parameters | Description |
|---|---|---|
| EgoSteer-3B-Base (this repo) | 3B | Base EgoSteer model trained on 9.6k hours of egocentric human videos, ready for fine-tuning |
| EgoSteer-3B-RealMan | 3B | Generalist post-trained on real-world data collected on the RealMan robot |
Repository Contents
| File | Description |
|---|---|
model_bf16.pt |
bf16 model weights |
config.yaml |
Pretraining config, for rebuilding the network at eval and inference. Fine-tuning uses your own config, weights only |
normalizer-relative-10k-pretrain/normalizer.pkl |
State/action normalizer calculated on 9.6k hours of egocentric human videos in relative action space |
⚠️ How to Use
This is a custom policy, loaded and run with the
EgoSteer codebase using the bundled
config.yaml to rebuild the network.
Pretrained Backbones
EgoSteer depends on two pretrained backbones; download them ahead of time:
Training Data
EgoSteer-3B-Base is pretrained on 9.6k hours of egocentric human videos, framed in a unified human-to-robot action space (wrist poses + fingertip keypoints). The EgoSteer-3B-RealMan variant is additionally post-trained on real-world demonstrations collected on the RealMan robot. See the paper and project page for details.
Intended Use & Limitations
- Intended use: research on vision-language-action models, world action models, and dexterous manipulation; a starting point for fine-tuning on your own embodiment.
- Limitations: the base checkpoint is not tuned to any single robot's control loop; real-robot deployment requires post-training and matching the observation/action format (camera setup, normalizer) the policy was trained with. Outputs should be validated for safety before execution on hardware.
Citation
@article{egosteer2026,
title = {EgoSteer: A Full-Stack System Towards Steerable Dexterous Manipulation from Egocentric Videos},
author = {EgoSteer Team},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}
License
Released under the Apache 2.0 license.
Acknowledgements
- Downloads last month
- 5