{"id":"2512.13030","authors":[{"_id":"6940d53b65f1e24a11780552","name":"Hongzhe Bi","hidden":false},{"_id":"6940d53b65f1e24a11780553","name":"Hengkai Tan","hidden":false},{"_id":"6940d53b65f1e24a11780554","name":"Shenghao Xie","hidden":false},{"_id":"6940d53b65f1e24a11780555","name":"Zeyuan Wang","hidden":false},{"_id":"6940d53b65f1e24a11780556","name":"Shuhe Huang","hidden":false},{"_id":"6940d53b65f1e24a11780557","name":"Haitian Liu","hidden":false},{"_id":"6940d53b65f1e24a11780558","user":{"_id":"6522e4fbd89bc7773ddc4b58","avatarUrl":"/avatars/3e9b158af52c5f738a3eae72dcbb3824.svg","isPro":false,"fullname":"Ruowen Zhao","user":"zzzrw","type":"user","name":"zzzrw"},"name":"Ruowen Zhao","status":"claimed_verified","statusLastChangedAt":"2026-04-10T07:45:41.766Z","hidden":false},{"_id":"6940d53b65f1e24a11780559","name":"Yao Feng","hidden":false},{"_id":"6940d53b65f1e24a1178055a","name":"Chendong Xiang","hidden":false},{"_id":"6940d53b65f1e24a1178055b","name":"Yinze Rong","hidden":false},{"_id":"6940d53b65f1e24a1178055c","name":"Hongyan Zhao","hidden":false},{"_id":"6940d53b65f1e24a1178055d","name":"Hanyu Liu","hidden":false},{"_id":"6940d53b65f1e24a1178055e","name":"Zhizhong Su","hidden":false},{"_id":"6940d53b65f1e24a1178055f","name":"Lei Ma","hidden":false},{"_id":"6940d53b65f1e24a11780560","name":"Hang Su","hidden":false},{"_id":"6940d53b65f1e24a11780561","name":"Jun Zhu","hidden":false}],"publishedAt":"2025-12-15T06:58:40.000Z","title":"Motus: A Unified Latent Action World Model","summary":"While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level \"delta action\" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.","upvotes":0,"discussionId":"6940d53b65f1e24a11780562","ai_summary":"Motus, a unified latent action world model using a Mixture-of-Transformer architecture and UniDiffuser-style scheduler, achieves superior performance in robotic tasks by integrating understanding, video generation, and action capabilities.","ai_keywords":["Mixture-of-Transformer (MoT)","UniDiffuser-style scheduler","latent action","optical flow","three-phase training pipeline","six-layer data pyramid","delta action","large-scale action pretraining"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","linkedModels":[{"author":"motus-robotics","authorData":{"_id":"692ff4d7640854ac15116e66","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692ff4d7640854ac15116e66/Vmf0oy3nr6UW8__W2fGIb.jpeg","fullname":"Motus Team","name":"motus-robotics","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"downloads":168,"gated":false,"id":"motus-robotics/Motus","availableInferenceProviders":[],"lastModified":"2025-12-16T03:43:54.000Z","likes":4,"pipeline_tag":"robotics","private":false,"repoType":"model","isLikedByUser":false},{"author":"motus-robotics","authorData":{"_id":"692ff4d7640854ac15116e66","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692ff4d7640854ac15116e66/Vmf0oy3nr6UW8__W2fGIb.jpeg","fullname":"Motus Team","name":"motus-robotics","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"downloads":0,"gated":false,"id":"motus-robotics/Motus_Wan2_2_5B_pretrain","availableInferenceProviders":[],"lastModified":"2025-12-16T03:45:32.000Z","likes":2,"pipeline_tag":"text-to-video","private":false,"repoType":"model","isLikedByUser":false,"numParameters":4999787712},{"author":"motus-robotics","authorData":{"_id":"692ff4d7640854ac15116e66","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692ff4d7640854ac15116e66/Vmf0oy3nr6UW8__W2fGIb.jpeg","fullname":"Motus Team","name":"motus-robotics","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"downloads":108,"gated":false,"id":"motus-robotics/Motus_robotwin2","availableInferenceProviders":[],"lastModified":"2025-12-19T18:00:16.000Z","likes":2,"pipeline_tag":"robotics","private":false,"repoType":"model","isLikedByUser":false}],"numTotalModels":3,"linkedDatasets":[],"numTotalDatasets":0,"linkedSpaces":[],"numTotalSpaces":0}