MoDE_CALVIN_ABCD / README.md
mbreuss's picture
Update README.md
334c524 verified
---
library_name: custom
tags:
- robotics
- diffusion
- mixture-of-experts
- multi-modal
license: mit
datasets:
- CALVIN
languages:
- en
pipeline_tag: robotics
base_model:
- mbreuss/MoDE_Pretrained
---
# MoDE (Mixture of Denoising Experts) Diffusion Policy
## Model Description
<div style="text-align: center">
<img src="MoDE_Figure_1.png" width="800px"/>
</div>
- [Github Link](https://github.com/intuitive-robots/MoDE_Diffusion_Policy)
- [Project Page](https://mbreuss.github.io/MoDE_Diffusion_Policy/)
This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.
The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.
## Model Details
### Architecture
- **Base Architecture**: MoDE with custom Mixture of Experts Transformer
- **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
- **EMA**: Enabled
- **Action Window Size**: 10
- **Sampling Steps**: 5 (optimal for performance)
- **Sampler Type**: DDIM
### Input/Output Specifications
#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings
#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions
## Usage
Check out our full model implementation on Github [MoDE_Diffusion_Policy](https://github.com/intuitive-robots/MoDE_Diffusion_Policy) and follow the instructions in the readme to test the model on one of the environments.
```python
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```
## Training Details
### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 0.0001
- **Weight Decay**: 0.05
## Citation
If you found the code usefull, please cite our work:
```bibtex
@misc{reuss2024efficient,
title={Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning},
author={Moritz Reuss and Jyothish Pari and Pulkit Agrawal and Rudolf Lioutikov},
year={2024},
eprint={2412.12953},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## License
This model is released under the MIT license.