ONNX
Safetensors
AricGamma commited on
Commit
1bc351a
·
verified ·
1 Parent(s): 00bd927

docs: update README

Browse files
Files changed (1) hide show
  1. README.md +255 -2
README.md CHANGED
@@ -1,6 +1,259 @@
1
  ---
2
  license: mit
3
  ---
4
- # Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
5
 
6
- ## Comming Soon...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ <h1 align='center'>Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks</h1>
5
 
6
+ <div align='center'>
7
+ <a href='https://github.com/cuijh26' target='_blank'>Jiahao Cui</a><sup>1</sup>&emsp;
8
+ <a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1</sup>&emsp;
9
+ <a href='https://github.com/subazinga' target='_blank'>Yun Zhan</a><sup>1</sup>&emsp;
10
+ <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup>&emsp;
11
+ <a href='https://github.com/Kaihui-Cheng' target='_blank'>Kaihui Cheng</a><sup>1</sup>&emsp;
12
+ <a href='https://github.com/mayuqi7777' target='_blank'>Yuqi Ma</a><sup>1</sup>&emsp;
13
+ <a href='https://github.com/AricGamma' target='_blank'>Shan Mu</a><sup>1</sup>&emsp;
14
+ </div>
15
+ <div align='center'>
16
+ <a href='https://hangz-nju-cuhk.github.io/' target='_blank'>Hang Zhou</a><sup>2</sup>&emsp;
17
+ <a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup>&emsp;
18
+ <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1✉️</sup>&emsp;
19
+ </div>
20
+
21
+ <div align='center'>
22
+ <sup>1</sup>Fudan University&emsp; <sup>2</sup>Baidu Inc&emsp;
23
+ </div>
24
+
25
+ <br>
26
+ <div align='center'>
27
+ <a href='https://github.com/fudan-generative-vision/hallo3'><img src='https://img.shields.io/github/stars/fudan-generative-vision/hallo3?style=social'></a>
28
+ </div>
29
+ <br>
30
+
31
+ ## 📸 Showcase
32
+
33
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
34
+ <tr>
35
+ <td>
36
+ <video src="https://github.com/user-attachments/assets/3fc44086-bdbf-4a54-bfe3-62cfd9dfb191" width="100%" controls autoplay loop></video>
37
+ </td>
38
+ <td>
39
+ <video src="https://github.com/user-attachments/assets/ad5a87cf-b50e-48d6-af35-774e3b1713e7" width="100%" controls autoplay loop></video>
40
+ </td>
41
+ <td>
42
+ <video src="https://github.com/user-attachments/assets/78c7acc3-4fa2-447e-b77d-3462d411c81c" width="100%" controls autoplay loop></video>
43
+ </td>
44
+ </tr>
45
+ <tr>
46
+ <td>
47
+ <video src="https://github.com/user-attachments/assets/f62f2b6d-9846-40be-a976-56cc7d5a8a5b" width="100%" controls autoplay loop></video>
48
+ </td>
49
+ <td>
50
+ <video src="https://github.com/user-attachments/assets/42b6968e-c68a-4473-b773-406ccf5d90b1" width="100%" controls autoplay loop></video>
51
+ </td>
52
+ <td>
53
+ <video src="https://github.com/user-attachments/assets/015f1d6d-31a8-4454-b51a-5431d3c953c2" width="100%" controls autoplay loop></video>
54
+ </td>
55
+ </tr>
56
+ </table>
57
+
58
+ Visit our [project page](https://fudan-generative-vision.github.io/hallo3/#/) to view more cases.
59
+
60
+ ## ⚙️ Installation
61
+
62
+ - System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
63
+ - Tested GPUs: H100
64
+
65
+ Download the codes:
66
+
67
+ ```bash
68
+ git clone https://github.com/fudan-generative-vision/hallo3
69
+ cd hallo3
70
+ ```
71
+
72
+ Create conda environment:
73
+
74
+ ```bash
75
+ conda create -n hallo python=3.10
76
+ conda activate hallo
77
+ ```
78
+
79
+ Install packages with `pip`
80
+
81
+ ```bash
82
+ pip install -r requirements.txt
83
+ ```
84
+
85
+ Besides, ffmpeg is also needed:
86
+
87
+ ```bash
88
+ apt-get install ffmpeg
89
+ ```
90
+
91
+ ### 📥 Download Pretrained Models
92
+
93
+ You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/fudan-generative-ai/hallo3).
94
+
95
+ Using `huggingface-cli` to download the models:
96
+
97
+ ```shell
98
+ cd $ProjectRootDir
99
+ pip install huggingface-cli
100
+ huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models
101
+ ```
102
+
103
+ Or you can download them separately from their source repo:
104
+
105
+ - [hallo3](https://huggingface.co/fudan-generative-ai/hallo3/tree/main/hallo3): Our checkpoints.
106
+ - [Cogvidex](https://github.com/THUDM/CogVideo): Cogvideox-5b-i2v pretrained model, consisting of transformer and 3d vae
107
+ - [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
108
+ - [audio_separator](https://huggingface.co/huangjackson/Kim_Vocal_2): Kim Vocal_2 MDX-Net vocal removal model.
109
+ - [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h).
110
+ - [insightface](https://github.com/deepinsight/insightface/tree/master/python-package#model-zoo): 2D and 3D Face Analysis placed into `pretrained_models/face_analysis/models/`. (_Thanks to deepinsight_)
111
+ - [face landmarker](https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task): Face detection & mesh model from [mediapipe](https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker#models) placed into `pretrained_models/face_analysis/models`.
112
+
113
+ Finally, these pretrained models should be organized as follows:
114
+
115
+ ```text
116
+ ./pretrained_models/
117
+ |-- audio_separator/
118
+ | |-- download_checks.json
119
+ | |-- mdx_model_data.json
120
+ | |-- vr_model_data.json
121
+ | `-- Kim_Vocal_2.onnx
122
+ |-- cogvideox-5b-i2v-sat/
123
+ | |-- transformer/
124
+ | |--1/
125
+ | |-- mp_rank_00_model_states.pt
126
+ | `--latest
127
+ | `-- vae/
128
+ | |-- 3d-vae.pt
129
+ |-- face_analysis/
130
+ | `-- models/
131
+ | |-- face_landmarker_v2_with_blendshapes.task # face landmarker model from mediapipe
132
+ | |-- 1k3d68.onnx
133
+ | |-- 2d106det.onnx
134
+ | |-- genderage.onnx
135
+ | |-- glintr100.onnx
136
+ | `-- scrfd_10g_bnkps.onnx
137
+ |-- hallo3
138
+ | |--1/
139
+ | |-- mp_rank_00_model_states.pt
140
+ | `--latest
141
+ |-- t5-v1_1-xxl/
142
+ | |-- added_tokens.json
143
+ | |-- config.json
144
+ | |-- model-00001-of-00002.safetensors
145
+ | |-- model-00002-of-00002.safetensors
146
+ | |-- model.safetensors.index.json
147
+ | |-- special_tokens_map.json
148
+ | |-- spiece.model
149
+ | |-- tokenizer_config.json
150
+ |
151
+ `-- wav2vec/
152
+ `-- wav2vec2-base-960h/
153
+ |-- config.json
154
+ |-- feature_extractor_config.json
155
+ |-- model.safetensors
156
+ |-- preprocessor_config.json
157
+ |-- special_tokens_map.json
158
+ |-- tokenizer_config.json
159
+ `-- vocab.json
160
+ ```
161
+
162
+ ### 🛠️ Prepare Inference Data
163
+
164
+ Hallo3 has a few simple requirements for the input data of inference:
165
+ 1. Reference image must be 1:1 or 3:2 aspect ratio.
166
+ 2. Driving audio must be in WAV format.
167
+ 3. Audio must be in English since our training datasets are only in this language.
168
+ 4. Ensure the vocals of audio are clear; background music is acceptable.
169
+
170
+ ### 🎮 Run Inference
171
+
172
+ Simply to run the `scripts/inference_long_batch.sh`:
173
+
174
+ ```bash
175
+ bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output
176
+ ```
177
+
178
+ Animation results will be saved at `./output`. You can find more examples for inference at [examples folder](https://github.com/fudan-generative-vision/hallo3/tree/main/examples).
179
+
180
+
181
+ ## Training
182
+
183
+ #### prepare data for training
184
+ Organize your raw videos into the following directory structure:
185
+ ```text
186
+ dataset_name/
187
+ |-- videos/
188
+ | |-- 0001.mp4
189
+ | |-- 0002.mp4
190
+ | `-- 0003.mp4
191
+ |-- caption/
192
+ | |-- 0001.txt
193
+ | |-- 0002.txt
194
+ | `-- 0003.txt
195
+ ```
196
+ You can use any dataset_name, but ensure the videos directory and caption directory are named as shown above.
197
+
198
+ Next, process the videos with the following commands:
199
+ ```bash
200
+ bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}
201
+ ```
202
+
203
+ #### Training
204
+
205
+ Update the data meta path settings in the configuration YAML files, `configs/sft_s1.yaml` and `configs/sft_s2.yaml`:
206
+
207
+ ```yaml
208
+ #sft_s1.yaml
209
+ train_data: [
210
+ "./data/output_name.json"
211
+ ]
212
+
213
+ #sft_s2.yaml
214
+ train_data: [
215
+ "./data/output_name.json"
216
+ ]
217
+ ```
218
+
219
+ Start training with the following command:
220
+ ```bash
221
+ # stage1
222
+ bash scripts/finetune_multi_gpus_s1.sh
223
+
224
+ # stage2
225
+ bash scripts/finetune_multi_gpus_s2.sh
226
+ ```
227
+
228
+ ## 📝 Citation
229
+
230
+ If you find our work useful for your research, please consider citing the paper:
231
+
232
+ ```
233
+ @misc{cui2024hallo3,
234
+ title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks},
235
+ author={Jiahao Cui and Hui Li and Yun Zhang and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang Zhou and Jingdong Wang and Siyu Zhu},
236
+ year={2024},
237
+ eprint={2412.00733},
238
+ archivePrefix={arXiv},
239
+ primaryClass={cs.CV}
240
+ }
241
+ ```
242
+
243
+ ## ⚠️ Social Risks and Mitigations
244
+
245
+ The development of portrait image animation technologies driven by audio inputs poses social risks, such as the ethical implications of creating realistic portraits that could be misused for deepfakes. To mitigate these risks, it is crucial to establish ethical guidelines and responsible use practices. Privacy and consent concerns also arise from using individuals' images and voices. Addressing these involves transparent data usage policies, informed consent, and safeguarding privacy rights. By addressing these risks and implementing mitigations, the research aims to ensure the responsible and ethical development of this technology.
246
+
247
+ ## 🤗 Acknowledgements
248
+
249
+ This model is a fine-tuned derivative version based on the **CogVideo-5B I2V** model. CogVideo-5B is an open-source text-to-video generation model developed by the CogVideoX team. Its original code and model parameters are governed by the [CogVideo-5B LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
250
+
251
+ As a derivative work of CogVideo-5B, the use, distribution, and modification of this model must comply with the license terms of CogVideo-5B.
252
+
253
+ ## 👏 Community Contributors
254
+
255
+ Thank you to all the contributors who have helped to make this project better!
256
+
257
+ <a href="https://github.com/fudan-generative-vision/hallo2/graphs/contributors">
258
+ <img src="https://contrib.rocks/image?repo=fudan-generative-vision/hallo3" />
259
+ </a>