hkunzhe commited on
Commit
a5ca6af
1 Parent(s): 5546b5b

add README.md

Browse files
Files changed (1) hide show
  1. README.md +214 -3
README.md CHANGED
@@ -1,3 +1,214 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CogVideoX-Fun-V1.1-Reward-LoRAs
2
+ ## Introduction
3
+ We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [CogVideoX-Fun-V1.1](https://github.com/aigc-apps/CogVideoX-Fun) for better alignment with human preferences.
4
+ We provide the following pre-trained models (i.e. LoRAs) along with [the training script](https://github.com/aigc-apps/CogVideoX-Fun/blob/main/scripts/train_reward_lora.py). You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
5
+
6
+ For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/CogVideoX-Fun).
7
+
8
+ | Name | Base Model | Reward Model | Hugging Face | Description |
9
+ |--|--|--|--|--|
10
+ | CogVideoX-Fun-V1.1-5b-InP-HPS2.1.safetensors | [CogVideoX-Fun-V1.1-5b](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP) | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.1-5b-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.1-5b-InP. It is trained with a batch size of 8 for 1,500 steps.|
11
+ | CogVideoX-Fun-V1.1-2b-InP-HPS2.1.safetensors | [CogVideoX-Fun-V1.1-2b](alibaba-pai/CogVideoX-Fun-V1.1-2b-InP) | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.1-2b-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.1-2b-InP. It is trained with a batch size of 8 for 3,000 steps.|
12
+ | CogVideoX-Fun-V1.1-5b-InP-MPS.safetensors | [CogVideoX-Fun-V1.1-5b](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP) | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.1-5b-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.1-5b-InP. It is trained with a batch size of 8 for 5,500 steps.|
13
+ | CogVideoX-Fun-V1.1-2b-InP-MPS.safetensors | [CogVideoX-Fun-V1.1-2b](alibaba-pai/CogVideoX-Fun-V1.1-2b-InP) | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-Reward-LoRAs/resolve/main/CogVideoX-Fun-V1.1-2b-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for CogVideoX-Fun-V1.1-2b-InP. It is trained with a batch size of 8 for 16,000 steps.|
14
+
15
+ ## Demo
16
+ ### CogVideoX-Fun-V1.1-5B
17
+
18
+ <table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
19
+ <thead>
20
+ <tr>
21
+ <th style="text-align: center;" width="10%">Prompt</sup></th>
22
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.1-5B</th>
23
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.1-5B <br> HPSv2.1 Reward LoRA</th>
24
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.1-5B <br> MPS Reward LoRA</th>
25
+ </tr>
26
+ </thead>
27
+ <tr>
28
+ <td>
29
+ Pig with wings flying above a diamond mountain
30
+ </td>
31
+ <td>
32
+ <video src="https://github.com/user-attachments/assets/6682f507-4ca2-45e9-9d76-86e2d709efb3" width="100%" controls autoplay loop></video>
33
+ </td>
34
+ <td>
35
+ <video src="https://github.com/user-attachments/assets/ec9219a2-96b3-44dd-b918-8176b2beb3b0" width="100%" controls autoplay loop></video>
36
+ </td>
37
+ <td>
38
+ <video src="https://github.com/user-attachments/assets/a75c6a6a-0b69-4448-afc0-fda3c7955ba0" width="100%" controls autoplay loop></video>
39
+ </td>
40
+ </tr>
41
+ <tr>
42
+ <td>
43
+ A dog runs through a field while a cat climbs a tree
44
+ </td>
45
+ <td>
46
+ <video src="https://github.com/user-attachments/assets/0392d632-2ec3-46b4-8867-0da1db577b6d" width="100%" controls autoplay loop></video>
47
+ </td>
48
+ <td>
49
+ <video src="https://github.com/user-attachments/assets/7d8c729d-6afb-408e-b812-67c40c3aaa96" width="100%" controls autoplay loop></video>
50
+ </td>
51
+ <td>
52
+ <video src="https://github.com/user-attachments/assets/dcd1343c-7435-4558-b602-9c0fa08cbd59" width="100%" controls autoplay loop></video>
53
+ </td>
54
+ </tr>
55
+ <tr>
56
+ <td>
57
+ Crystal cake shimmering beside a metal apple
58
+ </td>
59
+ <td>
60
+ <video src="https://github.com/user-attachments/assets/af0df8e0-1edb-4e2c-9a87-70df2b564aef" width="100%" controls autoplay loop></video>
61
+ </td>
62
+ <td>
63
+ <video src="https://github.com/user-attachments/assets/59b840f7-d33c-4972-8024-11a097f1c419" width="100%" controls autoplay loop></video>
64
+ </td>
65
+ <td>
66
+ <video src="https://github.com/user-attachments/assets/4a1d0af0-54e3-455c-9930-0789e2346fa0" width="100%" controls autoplay loop></video>
67
+ </td>
68
+ </tr>
69
+ <tr>
70
+ <td>
71
+ Elderly artist with a white beard painting on a white canvas
72
+ </td>
73
+ <td>
74
+ <video src="https://github.com/user-attachments/assets/99e44f9d-c770-48ce-8cc5-69fe36d757bc" width="100%" controls autoplay loop></video>
75
+ </td>
76
+ <td>
77
+ <video src="https://github.com/user-attachments/assets/9c106677-e4cb-4970-a1a2-a013fa6ce903" width="100%" controls autoplay loop></video>
78
+ </td>
79
+ <td>
80
+ <video src="https://github.com/user-attachments/assets/0a7b57ab-36a8-4fb6-bcfa-75e3878c55b7" width="100%" controls autoplay loop></video>
81
+ </td>
82
+ </tr>
83
+ </table>
84
+
85
+ ### CogVideoX-Fun-V1.1-2B
86
+
87
+ <table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
88
+ <thead>
89
+ <tr>
90
+ <th style="text-align: center;" width="10%">Prompt</th>
91
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.1-2B</th>
92
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.1-2B <br> HPSv2.1 Reward LoRA</th>
93
+ <th style="text-align: center;" width="30%">CogVideoX-Fun-V1.1-2B <br> MPS Reward LoRA</th>
94
+ </tr>
95
+ </thead>
96
+ <tr>
97
+ <td>
98
+ A blue car drives past a white picket fence on a sunny day
99
+ </td>
100
+ <td>
101
+ <video src="https://github.com/user-attachments/assets/274b0873-4fbd-4afa-94c0-22b23168f0a1" width="100%" controls autoplay loop></video>
102
+ </td>
103
+ <td>
104
+ <video src="https://github.com/user-attachments/assets/730f2ba3-4c54-44ce-ad5b-4eeca7ae844e" width="100%" controls autoplay loop></video>
105
+ </td>
106
+ <td>
107
+ <video src="https://github.com/user-attachments/assets/1b8eb777-0f17-46ef-9e7e-c8be7636e157" width="100%" controls autoplay loop></video>
108
+ </td>
109
+ </tr>
110
+ <tr>
111
+ <td>
112
+ Blue jay swooping near a red maple tree
113
+ </td>
114
+ <td>
115
+ <video src="https://github.com/user-attachments/assets/a14778d2-38ea-42c3-89a2-18164c48f3cf" width="100%" controls autoplay loop></video>
116
+ </td>
117
+ <td>
118
+ <video src="https://github.com/user-attachments/assets/90af433f-ab01-4341-9977-c675041d76d0" width="100%" controls autoplay loop></video>
119
+ </td>
120
+ <td>
121
+ <video src="https://github.com/user-attachments/assets/dafe8bf6-77ac-4934-8c9c-61c25088f80b" width="100%" controls autoplay loop></video>
122
+ </td>
123
+ </tr>
124
+ <tr>
125
+ <td>
126
+ Yellow curtains swaying near a blue sofa
127
+ </td>
128
+ <td>
129
+ <video src="https://github.com/user-attachments/assets/e8a445a4-781b-4b3f-899b-2cc24201f247" width="100%" controls autoplay loop></video>
130
+ </td>
131
+ <td>
132
+ <video src="https://github.com/user-attachments/assets/318cfb00-8bd1-407f-aaee-8d4220573b82" width="100%" controls autoplay loop></video>
133
+ </td>
134
+ <td>
135
+ <video src="https://github.com/user-attachments/assets/6b90e8a4-1754-42f4-b454-73510ed0701d" width="100%" controls autoplay loop></video>
136
+ </td>
137
+ </tr>
138
+ <tr>
139
+ <td>
140
+ White tractor plowing near a green farmhouse
141
+ </td>
142
+ <td>
143
+ <video src="https://github.com/user-attachments/assets/42d35282-e964-4c8b-aae9-a1592178493a" width="100%" controls autoplay loop></video>
144
+ </td>
145
+ <td>
146
+ <video src="https://github.com/user-attachments/assets/c9704bd4-d88d-41a1-8e5b-b7980df57a4a" width="100%" controls autoplay loop></video>
147
+ </td>
148
+ <td>
149
+ <video src="https://github.com/user-attachments/assets/7a785b34-4a5d-4491-9e03-c40cf953a1dc" width="100%" controls autoplay loop></video>
150
+ </td>
151
+ </tr>
152
+ </table>
153
+
154
+ > [!NOTE]
155
+ > The above test prompts are from <a href="https://github.com/Vchitect/VBench/tree/master/prompts">VBench</a>. All videos are generated with lora weight 0.7.
156
+
157
+ ## Quick Start
158
+ We provide a simple inference code to run CogVideoX-Fun-V1.1-5b-InP with its HPS2.1 reward LoRA.
159
+
160
+ ```python
161
+ import torch
162
+ from diffusers import CogVideoXDDIMScheduler
163
+
164
+ from cogvideox.models.transformer3d import CogVideoXTransformer3DModel
165
+ from cogvideox.pipeline.pipeline_cogvideox_inpaint import CogVideoX_Fun_Pipeline_Inpaint
166
+ from cogvideox.utils.lora_utils import merge_lora
167
+ from cogvideox.utils.utils import get_image_to_video_latent, save_videos_grid
168
+
169
+ model_path = "alibaba-pai/CogVideoX-Fun-V1.1-5b-InP"
170
+ lora_path = "alibaba-pai/CogVideoX-Fun-V1.1-Reward-LoRAs/CogVideoX-Fun-V1.1-5b-InP-HPS2.1.safetensors"
171
+ lora_weight = 0.7
172
+
173
+ prompt = "Pig with wings flying above a diamond mountain"
174
+ sample_size = [512, 512]
175
+ video_length = 49
176
+
177
+ transformer = CogVideoXTransformer3DModel.from_pretrained_2d(model_path, subfolder="transformer").to(torch.bfloat16)
178
+ scheduler = CogVideoXDDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
179
+ pipeline = CogVideoX_Fun_Pipeline_Inpaint.from_pretrained(
180
+ model_path, transformer=transformer, scheduler=scheduler, torch_dtype=torch.bfloat16
181
+ )
182
+ pipeline.enable_model_cpu_offload()
183
+ pipeline = merge_lora(pipeline, lora_path, lora_weight)
184
+
185
+ generator = torch.Generator(device="cuda").manual_seed(42)
186
+ input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
187
+ sample = pipeline(
188
+ prompt,
189
+ num_frames = video_length,
190
+ negative_prompt = "bad detailed",
191
+ height = sample_size[0],
192
+ width = sample_size[1],
193
+ generator = generator,
194
+ guidance_scale = 7.0,
195
+ num_inference_steps = 50,
196
+ video = input_video,
197
+ mask_video = input_video_mask,
198
+ ).videos
199
+
200
+ save_videos_grid(sample, "samples/output.mp4", fps=8)
201
+ ```
202
+
203
+ ## Limitations
204
+ 1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
205
+ The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
206
+ 2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
207
+ evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
208
+ in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.
209
+
210
+ ## Reference
211
+ <ol>
212
+ <li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
213
+ <li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
214
+ </ol>