zheyangqin
commited on
Commit
•
9347c75
1
Parent(s):
4bb176b
add-readme
Browse files- .gitattributes +10 -0
- assets/vader_method.png +3 -0
- assets/videos/1.gif +3 -0
- assets/videos/10.gif +3 -0
- assets/videos/11.gif +3 -0
- assets/videos/3.gif +3 -0
- assets/videos/4.gif +3 -0
- assets/videos/5.gif +3 -0
- assets/videos/7.gif +3 -0
- assets/videos/8.gif +3 -0
- assets/videos/9.gif +3 -0
- readme.md +35 -0
.gitattributes
CHANGED
@@ -33,3 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
assets/vader_method.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
assets/videos/1.gif filter=lfs diff=lfs merge=lfs -text
|
38 |
+
assets/videos/10.gif filter=lfs diff=lfs merge=lfs -text
|
39 |
+
assets/videos/11.gif filter=lfs diff=lfs merge=lfs -text
|
40 |
+
assets/videos/3.gif filter=lfs diff=lfs merge=lfs -text
|
41 |
+
assets/videos/4.gif filter=lfs diff=lfs merge=lfs -text
|
42 |
+
assets/videos/5.gif filter=lfs diff=lfs merge=lfs -text
|
43 |
+
assets/videos/7.gif filter=lfs diff=lfs merge=lfs -text
|
44 |
+
assets/videos/8.gif filter=lfs diff=lfs merge=lfs -text
|
45 |
+
assets/videos/9.gif filter=lfs diff=lfs merge=lfs -text
|
assets/vader_method.png
ADDED
Git LFS Details
|
assets/videos/1.gif
ADDED
Git LFS Details
|
assets/videos/10.gif
ADDED
Git LFS Details
|
assets/videos/11.gif
ADDED
Git LFS Details
|
assets/videos/3.gif
ADDED
Git LFS Details
|
assets/videos/4.gif
ADDED
Git LFS Details
|
assets/videos/5.gif
ADDED
Git LFS Details
|
assets/videos/7.gif
ADDED
Git LFS Details
|
assets/videos/8.gif
ADDED
Git LFS Details
|
assets/videos/9.gif
ADDED
Git LFS Details
|
readme.md
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<div align="center">
|
2 |
+
|
3 |
+
<!-- TITLE -->
|
4 |
+
# **Video Diffusion Alignment via Reward Gradient**
|
5 |
+
![VADER](assets/vader_method.png)
|
6 |
+
|
7 |
+
[![arXiv](https://img.shields.io/badge/cs.LG-)]()
|
8 |
+
[![Website](https://img.shields.io/badge/🌎-Website-blue.svg)](http://vader-vid.github.io)
|
9 |
+
[![GitHub](https://img.shields.io/github/stars/mihirp1998/VADER?style=social)](https://github.com/mihirp1998/VADER)
|
10 |
+
</div>
|
11 |
+
|
12 |
+
This is the official implementation of our paper [Video Diffusion Alignment via Reward Gradient](https://vader-vid.github.io/) by
|
13 |
+
|
14 |
+
Mihir Prabhudesai*, Russell Mendonca*, Zheyang Qin*, Katerina Fragkiadaki, Deepak Pathak.
|
15 |
+
|
16 |
+
|
17 |
+
<!-- DESCRIPTION -->
|
18 |
+
## Abstract
|
19 |
+
We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.
|
20 |
+
|
21 |
+
## Demo
|
22 |
+
| | | |
|
23 |
+
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
24 |
+
| <img src="assets/videos/8.gif" width=""> | <img src="assets/videos/5.gif" width=""> | <img src="assets/videos/7.gif" width=""> |
|
25 |
+
| <img src="assets/videos/10.gif" width=""> | <img src="assets/videos/3.gif" width=""> | <img src="assets/videos/4.gif" width=""> |
|
26 |
+
| <img src="assets/videos/9.gif" width=""> | <img src="assets/videos/1.gif" width=""> | <img src="assets/videos/11.gif" width=""> |
|
27 |
+
|
28 |
+
|
29 |
+
## Citation
|
30 |
+
|
31 |
+
If you find this work useful in your research, please cite:
|
32 |
+
|
33 |
+
```bibtex
|
34 |
+
|
35 |
+
```
|