File size: 6,498 Bytes
101df72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7c5dfc
 
 
101df72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7c5dfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101df72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
'[object Object]': null
license: mit
language:
- en
---

# Model Card for 3D Diffuser Actor

<!-- Provide a quick summary of what the model is/does. -->

A robot manipulation policy that marries diffusion modeling with 3D scene representations.
3D Diffuser Actor is trained and evaluated on [RLBench](https://github.com/stepjam/RLBench) or [CALVIN](https://github.com/mees/calvin) simulation.
We release all code, checkpoints, and details involved in training these models.

## Model Details

The models released are the following:

| Benchmark | Embedding dimension | Diffusion timestep |
|------|------|------|
| [RLBench (PerAct)](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_peract.pth) | 120 | 100 |
| [RLBench (GNFactor)](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_gnfactor.pth) | 120| 100 |
| [CALVIN](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_calvin.pth) | 192 | 25 |

### Model Description

<!-- Provide a longer summary of what this model is. -->


- **Developed by:** Katerina Group at CMU
- **Model type:** a Diffusion model with 3D scene 
- **License:** The code and model are released under MIT License
- **Contact:** ngkanats@andrew.cmu.edu


### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Project Page:** https://3d-diffuser-actor.github.io
- **Repository:** https://github.com/nickgkan/3d_diffuser_actor.git
- **Paper:** [Link]()

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Input format
3D Diffuser Actor takes the following inputs:

1. `RGB observations`: a tensor of shape (batch_size, num_cameras, 3, H, W).  The pixel values are in the range of [0, 1]
2. `Point cloud observation`: a tensor of shape (batch_size, num_cameras, 3, H, W).
3. `Instruction encodings`: a tensor of shape (batch_size, max_instruction_length, C).  In this code base, the embedding dimension `C` is set to 512.
4. `curr_gripper`: a tensor of shape (batch_size, history_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).
5. `trajectory_mask`: a tensor of shape (batch_size, trajectory_length), which is only used to indicate the length of each trajectory.  To predict keyposes, we just need to set its shape to (batch_size, 1).
6. `gt_trajectory`: a tensor of shape (batch_size, trajectory_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).  The input is only used during training.

### Output format
The model returns the diffusion loss, when `run_inference=False`, otherwise, it returns pose trajectory of shape (batch_size, trajectory_length, 8) when `run_inference=True`.

### Usage 
For training, forward 3D Diffuser Actor with `run_inference=False`
```
> loss = model.forward(gt_trajectory,
                       trajectory_mask,
                       rgb_obs,
                       pcd_obs,
                       instruction,
                       curr_gripper,
                       run_inference=False)
```

For evaluation, forward 3D Diffuser Actor with `run_inference=True`
```
> fake_gt_trajectory =  torch.full((1, trajectory_length, 7), 0).to(device)
> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.forward(fake_gt_trajectory,
                             trajectory_mask,
                             rgb_obs,
                             pcd_obs,
                             instruction,
                             curr_gripper,
                             run_inference=True)
```

Or you can forward the model with `compute_trajectory` function
```
> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.compute_trajectory(trajectory_mask,
                                        rgb_obs,
                                        pcd_obs,
                                        instruction,
                                        curr_gripper)
```


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
Our model trained and evaluated on RLBench simulation with the PerAct setup:

| RLBench (PerAct) | 3D Diffuser Actor | [RVT](https://github.com/NVlabs/RVT) |
| --------------------------------- | -------- | -------- |
| average | 81.3 | 62.9 |
| open drawer | 89.6 | 71.2 |
| slide block | 97.6 | 81.6 |
| sweep to dustpan | 84.0 | 72.0 |
| meat off grill | 96.8 | 88 |
| turn tap | 99.2 | 93.6 |
| put in drawer | 96.0 | 88.0 |
| close jar | 96.0 | 52.0 |
| drag stick | 100.0 | 99.2 |
| stack blocks | 68.3 | 28.8 |
| screw bulbs | 82.4 | 48.0 |
| put in safe | 97.6 | 91.2 |
| place wine | 93.6 | 91.0 |
| put in cupboard | 85.6 | 49.6 |
| sort shape | 44.0 | 36.0 |
| push buttons | 98.4 | 100.0 |
| insert peg | 65.6 | 11.2 |
| stack cups | 47.2 | 26.4 |
| place cups | 24.0 | 4.0 |


Our model trained and evaluated on RLBench simulation with the GNFactor setup:

| RLBench (PerAct) | 3D Diffuser Actor | [GNFactor](https://github.com/YanjieZe/GNFactor) |
| --------------------------------- | -------- | -------- |
| average | 78.4 | 31.7 |
| open drawer | 89.3 | 76.0 |
| sweep to dustpan | 894.7 | 25.0 |
| close jar | 82.7 | 25.3 |
| meat off grill | 88.0 | 57.3 |
| turn tap | 80.0 | 50.7 |
| slide block | 92.0 | 20.0 |
| put in drawer | 77.3 | 0.0 |
| drag stick | 98.7 | 37.3 |
| push buttons | 69.3 | 18.7 |
| stack blocks | 12.0 | 4.0 |

Our model trained and evaluated on CALVIN simulation (train with environment A, B, C and test on D):

| RLBench (PerAct) | 3D Diffuser Actor | [GR-1](https://gr1-manipulation.github.io/) | [SuSIE](https://rail-berkeley.github.io/susie/) |
| --------------------------------- | -------- | -------- | -------- |
| task 1 | 92.2 | 85.4 | 87.0 |
| task 2 | 78.7 | 71.2 | 69.0 |
| task 3 | 63.9 | 59.6 | 49.0 |
| task 4 | 51.2 | 49.7 | 38.0 |
| task 5 | 41.2 | 40.1 | 26.0 |



## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@article{,
  title={Action Diffusion with 3D Scene Representations},
  author={Ke, Tsung-Wei and Gkanatsios, Nikolaos and Fragkiadaki, Katerina}
  journal={Preprint},
  year={2024}
}
```



## Model Card Contact

For errors in this model card, contact Nikos or Tsung-Wei, {ngkanats, tsungwek} at andrew dot cmu dot edu.