chencws commited on
Commit
654e8b1
1 Parent(s): 70ce8fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -3
README.md CHANGED
@@ -1,8 +1,8 @@
1
  # SATO: Stable Text-to-Motion Framework
2
 
3
- [Wenshuo chen*](https://github.com/shurdy123), [Hongru Xiao*](https://github.com/Hongru0306), [Erhang Zhang*](https://github.com/zhangerhang), [Lijie Hu](https://sites.google.com/view/lijiehu/homepage), [Lei Wang](https://leiwangr.github.io/), [Mengyuan Liu](), [Chen Chen](https://www.crcv.ucf.edu/chenchen/)
4
 
5
- [![Website shields.io](https://img.shields.io/website?url=http%3A//poco.is.tue.mpg.de)](https://sato-team.github.io/Stable-Text-to-Motion-Framework/) [![YouTube Badge](https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube)]() [![arXiv](https://img.shields.io/badge/arXiv-2308.12965-00ff00.svg)]()
6
  ## Existing Challenges
7
  A fundamental challenge inherent in text-to-motion tasks stems from the variability of textual inputs. Even when conveying similar or the same meanings and intentions, texts can exhibit considerable variations in vocabulary and structure due to individual user preferences or linguistic nuances. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions. In the following demonstration, we showcase the instability of predictions generated by the previous method when presented with different user inputs conveying identical semantic meaning.
8
  <!-- <div style="display:flex;">
@@ -28,7 +28,8 @@ A fundamental challenge inherent in text-to-motion tasks stems from the variabil
28
  </tr>
29
 
30
  <tr>
31
- <th colspan="4" >Perturbed text: A human boots something or someone with his left leg.</th>
 
32
  </tr>
33
  <tr>
34
  <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
@@ -47,8 +48,18 @@ A fundamental challenge inherent in text-to-motion tasks stems from the variabil
47
  ## Motivation
48
  ![motivation](images/motivation.png)
49
  The model's inconsistent outputs are accompanied by unstable attention patterns. We further elucidate the aforementioned experimental findings: When perturbed text is inputted, the model exhibits unstable attention, often neglecting critical text elements necessary for accurate motion prediction. This instability further complicates the encoding of text into consistent embeddings, leading to a cascade of consecutive temporal motion generation errors.
 
 
 
 
 
 
50
 
 
51
 
 
 
 
52
  ## Visualization
53
  <p align="center">
54
  <table align="center">
 
1
  # SATO: Stable Text-to-Motion Framework
2
 
3
+ [Wenshuo chen*](https://github.com/shurdy123), [Hongru Xiao*](https://github.com/Hongru0306), [Erhang Zhang*](https://github.com/zhangerhang), [Lijie Hu](https://sites.google.com/view/lijiehu/homepage), [Lei Wang](https://leiwangr.github.io/), [Mengyuan Liu](https://www.semanticscholar.org/author/Mengyuan-Liu/47842072), [Chen Chen](https://www.crcv.ucf.edu/chenchen/)
4
 
5
+ [![Website shields.io](https://img.shields.io/website?url=http%3A//poco.is.tue.mpg.de)](https://sato-team.github.io/Stable-Text-to-Motion-Framework/) [![YouTube Badge](https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube)](https://youtu.be/qqGhV3Flmus) [![arXiv](https://img.shields.io/badge/arXiv-2308.12965-00ff00.svg)]()
6
  ## Existing Challenges
7
  A fundamental challenge inherent in text-to-motion tasks stems from the variability of textual inputs. Even when conveying similar or the same meanings and intentions, texts can exhibit considerable variations in vocabulary and structure due to individual user preferences or linguistic nuances. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions. In the following demonstration, we showcase the instability of predictions generated by the previous method when presented with different user inputs conveying identical semantic meaning.
8
  <!-- <div style="display:flex;">
 
28
  </tr>
29
 
30
  <tr>
31
+ <th colspan="4">Perturbed text: A human boots something or someone with his left leg.</th>
32
+
33
  </tr>
34
  <tr>
35
  <th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
 
48
  ## Motivation
49
  ![motivation](images/motivation.png)
50
  The model's inconsistent outputs are accompanied by unstable attention patterns. We further elucidate the aforementioned experimental findings: When perturbed text is inputted, the model exhibits unstable attention, often neglecting critical text elements necessary for accurate motion prediction. This instability further complicates the encoding of text into consistent embeddings, leading to a cascade of consecutive temporal motion generation errors.
51
+ ## Our Approach
52
+ <p align="center">
53
+ <img src="images/framework.png" alt="Approach Image">
54
+ </p>
55
+
56
+ **Attention Stability**. For the original text input, we can easily observe the model's attention vector for the text. This attention vector reflects the model's attentional ranking of the text, indicating the importance of each word to the text encoder's prediction. We hope a stable attention vector maintains a consistent ranking even after perturbations.
57
 
58
+ **Prediction Robustness**. Even with stable attention, we still cannot achieve stable results due to the change in text embeddings when facing perturbations, even with similar attention vectors. This requires us to impose further restrictions on the model's predictions. Specifically, in the face of perturbations, the model's prediction should remain consistent with the original distribution, meaning the model's output should be robust to perturbations.
59
 
60
+ **Balancing Accuracy and Robustness Trade-off**. Accuracy and robustness are naturally in a trade-off relationship. Our objective is to bolster stability while minimizing the decline in model accuracy, thereby mitigating catastrophic errors arising from input perturbations. Consequently, we require a mechanism to uphold the model's performance concerning the original input.
61
+ ## Quantitative evaluation on the HumanML3D and KIT-ML.
62
+ ![eval](images/table.png)
63
  ## Visualization
64
  <p align="center">
65
  <table align="center">