README.md · mbreuss/MoDE_CALVIN_ABC_2 at fd95cfe88fb7ea14e7c11988be9fbc4e6ecf571c

            library_name: custom
            tags:
            - robotics
            - diffusion
            - mixture-of-experts
            - multi-modal
            license: mit
            datasets:
            - CALVIN
            languages:
            - en
            pipeline_tag: robotics
            ---
# MoDE (Mixture of Denoising Experts) Diffusion Policy

## Model Description

This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.

The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.

## Model Details

### Architecture
- **Base Architecture**: MoDE with custom Mixture of Experts Transformer
- **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
- **EMA**: Enabled
- **Action Window Size**: 10
- **Sampling Steps**: 5 (optimal for performance)
- **Sampler Type**: DDIM

### Input/Output Specifications

#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings

#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions

## Usage

```python
obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```

## Training Details

### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: {config.optimizer.learning_rate}
- **Weight Decay**: {config.optimizer.transformer_weight_decay}

## License
This model is released under the MIT license.