File size: 13,035 Bytes
1621f58
 
 
 
 
 
 
 
 
 
 
 
15fa80a
 
654e8b1
15fa80a
654e8b1
15fa80a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
654e8b1
 
15fa80a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
654e8b1
 
 
 
 
 
15fa80a
654e8b1
15fa80a
654e8b1
 
 
15fa80a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: Stable Text To Motion Framework
emoji: πŸŒ–
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: mit
---

# SATO: Stable Text-to-Motion Framework

[Wenshuo chen*](https://github.com/shurdy123), [Hongru Xiao*](https://github.com/Hongru0306), [Erhang Zhang*](https://github.com/zhangerhang), [Lijie Hu](https://sites.google.com/view/lijiehu/homepage), [Lei Wang](https://leiwangr.github.io/), [Mengyuan Liu](https://www.semanticscholar.org/author/Mengyuan-Liu/47842072), [Chen Chen](https://www.crcv.ucf.edu/chenchen/)

[![Website shields.io](https://img.shields.io/website?url=http%3A//poco.is.tue.mpg.de)](https://sato-team.github.io/Stable-Text-to-Motion-Framework/) [![YouTube Badge](https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube)](https://youtu.be/qqGhV3Flmus)  [![arXiv](https://img.shields.io/badge/arXiv-2308.12965-00ff00.svg)]()  
## Existing Challenges
A fundamental challenge inherent in text-to-motion tasks stems from the variability of textual inputs. Even when conveying similar or the same meanings and intentions, texts can exhibit considerable variations in vocabulary and structure due to individual user preferences or linguistic nuances. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions. In the following demonstration, we showcase the instability of predictions generated by the previous method when presented with different user inputs conveying identical semantic meaning.
<!-- <div style="display:flex;">
    <img src="assets/run_lola.gif" width="45%" style="margin-right: 1%;">
    <img src="assets/yt_solo.gif" width="45%">
</div> -->

<p align="center">
<table align="center">
  <tr>
    <th colspan="4">Original text: A man kicks something or someone with his left leg.</th>
  </tr>
  <tr>
    <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
    <th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
    <th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
  </tr>

  <tr>
    <td width="250" align="center"><img src="images/example/kick/gpt.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/example/kick/mdm.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/example/kick/momask.gif" width="150px" height="150px" alt="gif"></td>
  </tr>

  <tr>
   <th colspan="4">Perturbed text: A human boots something or someone with his left leg.</th>

  </tr>
  <tr>
    <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
    <th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
    <th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
  </tr>

  <tr>
    <td width="250" align="center"><img src="images/example/boot/gpt.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/example/boot/mdm.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/example/boot/momask.gif" width="150px" height="150px" alt="gif"></td>
  </tr>
</table>
</p>

## Motivation
![motivation](images/motivation.png)
The model's inconsistent outputs are accompanied by unstable attention patterns. We further elucidate the aforementioned experimental findings: When perturbed text is inputted, the model exhibits unstable attention, often neglecting critical text elements necessary for accurate motion prediction. This instability further complicates the encoding of text into consistent embeddings, leading to a cascade of consecutive temporal motion generation errors.
## Our Approach
<p align="center">
  <img src="images/framework.png" alt="Approach Image">
</p>

**Attention Stability**. For the original text input, we can easily observe the model's attention vector for the text. This attention vector reflects the model's attentional ranking of the text, indicating the importance of each word to the text encoder's prediction. We hope a stable attention vector maintains a consistent ranking even after perturbations.

**Prediction Robustness**. Even with stable attention, we still cannot achieve stable results due to the change in text embeddings when facing perturbations, even with similar attention vectors. This requires us to impose further restrictions on the model's predictions. Specifically, in the face of perturbations, the model's prediction should remain consistent with the original distribution, meaning the model's output should be robust to perturbations.

**Balancing Accuracy and Robustness Trade-off**. Accuracy and robustness are naturally in a trade-off relationship. Our objective is to bolster stability while minimizing the decline in model accuracy, thereby mitigating catastrophic errors arising from input perturbations. Consequently, we require a mechanism to uphold the model's performance concerning the original input.
## Quantitative evaluation on the HumanML3D and KIT-ML.
![eval](images/table.png)
## Visualization
<p align="center">
<table align="center">
  <tr>
    <th colspan="4">Original text: person is walking normally in a circle.</th>
  </tr>
  <tr>
    <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
    <th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
    <th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
    <th align="center"><nobr>SATO</nobr> </th>
  </tr>

  <tr>
    <td width="250" align="center"><img src="images/visualization/circle/gpt.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/circle/mdm.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/circle/momask.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/circle/sato.gif" width="150px" height="150px" alt="gif"></td>
  </tr>

  <tr>
    <th colspan="4" >Perturbed text: <span style="color: red;">human</span> is walking <span style="color: red;">usually</span> in a <span style="color: red;">loop.</th>
  </tr>
  <tr>
    <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
    <th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
    <th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
    <th align="center"><nobr>SATO</nobr> </th>
  </tr>

  <tr>
    <td width="250" align="center"><img src="images/visualization/loop/gpt.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/loop/mdm.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/loop/momask.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/loop/sato.gif" width="150px" height="150px" alt="gif"></td>
  </tr>
</table>
</p>
<center>
  <h3>
    <p style="color: blue;">Explanation: T2M-GPT, MDM, and MoMask all don't walk in a loop.</p>
  </h3>

<p align="center">
<table align="center">
  <tr>
    <th colspan="4">Original text: a person uses his right arm to help himself to stand up.</th>
  </tr>
  <tr>
    <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
    <th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
    <th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
    <th align="center"><nobr>SATO</nobr> </th>
  </tr>

  <tr>
    <td width="250" align="center"><img src="images/visualization/use/gpt.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/use/mdm.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/use/momask.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/use/sato.gif" width="150px" height="150px" alt="gif"></td>
  </tr>

  <tr>
    <th colspan="4" >Perturbed text: A human <span style="color: red;">utilizes</span> his right arm to help himself to stand up.</th>
  </tr>
  <tr>
    <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
    <th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
    <th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
    <th align="center"><nobr>SATO</nobr> </th>
  </tr>

  <tr>
    <td width="250" align="center"><img src="images/visualization/utilize/gpt.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/utilize/mdm.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/utilize/momask.gif" width="150px" height="150px" alt="gif"></td>
    <td width="250" align="center"><img src="images/visualization/utilize/sato.gif" width="150px" height="150px" alt="gif"></td>
  </tr>
</table>
</p>
<center>
  <h3>
    <p style="color: blue;">Explanation: T2M-GPT, MDM, and MoMask all lack the action of transitioning from squatting to standing up, resulting in a catastrophic error.</p>
  </h3>

    
## How to Use the Code

* [1. Setup and Installation](#setup)

* [2.Dependencies](#Dependencies)

* [3. Quick Start](#quickstart)

* [4. Datasets](#datasets)

* [4. Train](#train)

* [5. Evaluation](#eval)

* [6. Acknowledgments](#acknowledgements)

    
## Setup and Installation <a name="setup"></a>

Clone the repository: 

```shell
git clone https://github.com/sato-team/Stable-Text-to-motion-Framework.git
```

Create fresh conda environment and install all the dependencies:

```
conda env create -f environment.yml
conda activate SATO
```

The code was tested on Python 3.8 and PyTorch 1.8.1.

## Dependencies<a name="Dependencies"></a>

```shell
bash dataset/prepare/download_extractor.sh
bash dataset/prepare/download_glove.sh
```

## **Quick Start**<a name="quickstart"></a>

A quick reference guide for using our code is provided in quickstart.ipynb.

## Datasets<a name="datasets"></a>

We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download [link](https://github.com/EricGuo5513/HumanML3D).
We perturbed the input texts based on the two datasets mentioned. You can access the perturbed text dataset through the following [link](https://drive.google.com/file/d/1XLvu2jfG1YKyujdANhYHV_NfFTyOJPvP/view?usp=sharing).
Take HumanML3D for an example, the dataset structure should look like this:  
```
./dataset/HumanML3D/
β”œβ”€β”€ new_joint_vecs/
β”œβ”€β”€ texts/ # You need to replace the 'texts' folder in the original dataset with the 'texts' folder from our dataset.
β”œβ”€β”€ Mean.npy 
β”œβ”€β”€ Std.npy 
β”œβ”€β”€ train.txt
β”œβ”€β”€ val.txt
β”œβ”€β”€ test.txt
β”œβ”€β”€ train_val.txt
└── all.txt
```
### **Train**<a name="train"></a>

We will release the training code soon.

### **Evaluation**<a name="eval"></a>

You can download the pretrained models in this [link](https://drive.google.com/drive/folders/1rs8QPJ3UPzLW4H3vWAAX9hJn4ln7m_ts?usp=sharing). 

```shell
python eval_t2m.py --resume-pth pretrained/net_best_fid.pth --clip_path pretrained/clip_best_fid.pth
```

## Acknowledgements<a name="acknowledgements"></a>

We appreciate helps from :

- Open Source Code:[T2M-GPT](https://github.com/Mael-zys/T2M-GPT), [MoMask ](https://github.com/EricGuo5513/momask-codes), [MDM](https://guytevet.github.io/mdm-page/), etc.
- [Hongru Xiao](https://github.com/Hongru0306), [Erhang Zhang](https://github.com/zhangerhang), [Lijie Hu](https://sites.google.com/view/lijiehu/homepage), [Lei Wang](https://leiwangr.github.io/), [Mengyuan Liu](), [Chen Chen](https://www.crcv.ucf.edu/chenchen/) for discussions and guidance throughout the project, which has been instrumental to our work.
- [Zhen Zhao](https://github.com/Zanebla) for project website.
- If you find our work helpful, we would appreciate it if you could give our project a star!
## Citing<a name="citing"></a>

If you find this code useful for your research, please consider citing the following paper:

```bibtex

```