benjamin-paine commited on
Commit
051da37
1 Parent(s): 5f4d0fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +280 -0
README.md CHANGED
@@ -1,3 +1,283 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ ---
5
+ license: apache-2.0
6
+ ---
7
+ This repository contains a pruned and partially reorganized version of [AniPortrait](https://fudan-generative-vision.github.io/champ/#/).
8
+
9
+ ```
10
+ @misc{wei2024aniportrait,
11
+ title={AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations},
12
+ author={Huawei Wei and Zejun Yang and Zhisheng Wang},
13
+ year={2024},
14
+ eprint={2403.17694},
15
+ archivePrefix={arXiv},
16
+ primaryClass={cs.CV}
17
+ }
18
+ ```
19
+
20
+ # Usage
21
+
22
+ ## Installation
23
+
24
+ First, install the AniPortrait package into your python environment. If you're creating a new environment for AniPortrait, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.
25
+
26
+ ```sh
27
+ pip install git+https://github.com/painebenjamin/aniportrait.git
28
+ ```
29
+
30
+ Now, you can create the pipeline, automatically pulling the weights from this repository, either as individual models:
31
+
32
+ ```py
33
+ from aniportrait import AniPortraitPipeline
34
+ pipeline = AniPortraitPipeline.from_pretrained(
35
+ "benjamin-paine/aniportrait",
36
+ torch_dtype=torch.float16,
37
+ variant="fp16",
38
+ device="cuda"
39
+ ).to("cuda", dtype=torch.float16)
40
+ ```
41
+
42
+ Or, as a single file:
43
+
44
+ ```py
45
+ from aniportrait import AniPortraitPipeline
46
+ pipeline = AniPortraitPipeline.from_single_file(
47
+ "benjamin-paine/aniportrait",
48
+ torch_dtype=torch.float16,
49
+ variant="fp16",
50
+ device="cuda"
51
+ ).to("cuda", dtype=torch.float16)
52
+ ```
53
+
54
+ The `AniPortraitPipeline` is a mega pipeline, capable of instantiating and executing other pipelines. It provides the following functions:
55
+
56
+ ## Workflows
57
+
58
+ ### img2img
59
+
60
+ ```py
61
+ pipeline.img2img(
62
+ reference_image: PIL.Image.Image,
63
+ pose_reference_image: PIL.Image.Image,
64
+ num_inference_steps: int,
65
+ guidance_scale: float,
66
+ eta: float=0.0,
67
+ reference_pose_image: Optional[Image.Image]=None,
68
+ generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
69
+ output_type: Optional[str]="pil",
70
+ return_dict: bool=True,
71
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
72
+ callback_steps: Optional[int]=None,
73
+ width: Optional[int]=None,
74
+ height: Optional[int]=None,
75
+ **kwargs: Any
76
+ ) -> Pose2VideoPipelineOutput
77
+ ```
78
+
79
+ Using a reference image (for structure) and a pose reference image (for pose), render an image of the former in the pose of the latter.
80
+ - The pose reference image here is an unprocessed image, from which the face pose will be extracted.
81
+ - Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
82
+
83
+ ### vid2vid
84
+
85
+ ```py
86
+ pipeline.vid2vid(
87
+ reference_image: PIL.Image.Image,
88
+ pose_reference_images: List[PIL.Image.Image],
89
+ num_inference_steps: int,
90
+ guidance_scale: float,
91
+ eta: float=0.0,
92
+ reference_pose_image: Optional[Image.Image]=None,
93
+ generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
94
+ output_type: Optional[str]="pil",
95
+ return_dict: bool=True,
96
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
97
+ callback_steps: Optional[int]=None,
98
+ width: Optional[int]=None,
99
+ height: Optional[int]=None,
100
+ video_length: Optional[int]=None,
101
+ context_schedule: str="uniform",
102
+ context_frames: int=16,
103
+ context_overlap: int=4,
104
+ context_batch_size: int=1,
105
+ interpolation_factor: int=1,
106
+ use_long_video: bool=True,
107
+ **kwargs: Any
108
+ ) -> Pose2VideoPipelineOutput
109
+ ```
110
+
111
+ Using a reference image (for structure) and a sequence of pose reference images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation when the poses are longer than 16 frames.
112
+ - Optionally pass `use_long_video = false` to disable using the long video pipeline.
113
+ - Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
114
+ - Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images.
115
+
116
+ ### audio2vid
117
+
118
+ ```py
119
+ pipeline.audio2vid(
120
+ audio: str,
121
+ reference_image: PIL.Image.Image,
122
+ num_inference_steps: int,
123
+ guidance_scale: float,
124
+ fps: int=30,
125
+ eta: float=0.0,
126
+ reference_pose_image: Optional[Image.Image]=None,
127
+ pose_reference_images: Optional[List[PIL.Image.Image]]=None,
128
+ generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
129
+ output_type: Optional[str]="pil",
130
+ return_dict: bool=True,
131
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
132
+ callback_steps: Optional[int]=None,
133
+ width: Optional[int]=None,
134
+ height: Optional[int]=None,
135
+ video_length: Optional[int]=None,
136
+ context_schedule: str="uniform",
137
+ context_frames: int=16,
138
+ context_overlap: int=4,
139
+ context_batch_size: int=1,
140
+ interpolation_factor: int=1,
141
+ use_long_video: bool=True,
142
+ **kwargs: Any
143
+ ) -> Pose2VideoPipelineOutput
144
+ ```
145
+
146
+ Using an audio file, draw `fps` face pose images per second for the duration of the audio. Then, using those face pose images, render a video.
147
+ - Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement.
148
+ - Optionally pass width/height to modify the size. Defaults to reference image size.
149
+ - Optionally pass `use_long_video = false` to disable using the long video pipeline.
150
+ - Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
151
+ - Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images.
152
+
153
+ ## Internals/Helpers
154
+
155
+ ### img2pose
156
+
157
+ ```py
158
+ pipeline.img2pose(
159
+ reference_image: PIL.Image.Image,
160
+ width: Optional[int]=None,
161
+ height: Optional[int]=None
162
+ ) -> PIL.Image.Image
163
+ ```
164
+
165
+ Detects face landmarks in an image and draws a face pose image.
166
+ - Optionally modify the original width and height.
167
+
168
+ ### vid2pose
169
+
170
+ ```py
171
+ pipeline.vid2pose(
172
+ reference_image: PIL.Image.Image,
173
+ retarget_image: Optional[PIL.Image.Image],
174
+ width: Optional[int]=None,
175
+ height: Optional[int]=None
176
+ ) -> List[PIL.Image.Image]
177
+ ```
178
+
179
+ Detects face landmarks in a series of images and draws pose images.
180
+ - Optionally modify the original width and height.
181
+ - Optionally retarget to a different face position, useful for video-to-video tasks.
182
+
183
+ ### audio2pose
184
+
185
+ ```py
186
+ pipeline.audio2pose(
187
+ audio_path: str,
188
+ fps: int=30,
189
+ reference_image: Optional[PIL.Image.Image]=None,
190
+ pose_reference_images: Optional[List[PIL.Image.Image]]=None,
191
+ width: Optional[int]=None,
192
+ height: Optional[int]=None
193
+ ) -> List[PIL.Image.Image]
194
+ ```
195
+
196
+ Using an audio file, draw `fps` face pose images per second for the duration of the audio.
197
+ - Optionally include a reference image to extract the face shape and initial position from. Default has a generic androgynous face shape.
198
+ - Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement.
199
+ - Optionally pass width/height to modify the size. Defaults to reference image size, then pose image sizes, then 256.
200
+
201
+ ### pose2img
202
+
203
+ ```py
204
+ pipeline.pose2img(
205
+ reference_image: PIL.Image.Image,
206
+ pose_image: PIL.Image.Image,
207
+ num_inference_steps: int,
208
+ guidance_scale: float,
209
+ eta: float=0.0,
210
+ reference_pose_image: Optional[Image.Image]=None,
211
+ generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
212
+ output_type: Optional[str]="pil",
213
+ return_dict: bool=True,
214
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
215
+ callback_steps: Optional[int]=None,
216
+ width: Optional[int]=None,
217
+ height: Optional[int]=None,
218
+ **kwargs: Any
219
+ ) -> Pose2VideoPipelineOutput
220
+ ```
221
+
222
+ Using a reference image (for structure) and a pose image (for pose), render an image of the former in the pose of the latter.
223
+ - The pose image here is a processed face pose. To pass a non-processed face pose, see `img2img`.
224
+ - Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
225
+
226
+ ### pose2vid
227
+
228
+ ```py
229
+ pipeline.pose2vid(
230
+ reference_image: PIL.Image.Image,
231
+ pose_images: List[PIL.Image.Image],
232
+ num_inference_steps: int,
233
+ guidance_scale: float,
234
+ eta: float=0.0,
235
+ reference_pose_image: Optional[Image.Image]=None,
236
+ generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
237
+ output_type: Optional[str]="pil",
238
+ return_dict: bool=True,
239
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
240
+ callback_steps: Optional[int]=None,
241
+ width: Optional[int]=None,
242
+ height: Optional[int]=None,
243
+ video_length: Optional[int]=None,
244
+ **kwargs: Any
245
+ ) -> Pose2VideoPipelineOutput
246
+ ```
247
+
248
+ Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter.
249
+ - The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`.
250
+ - Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
251
+ - Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images.
252
+
253
+ ### pose2vid_long
254
+
255
+ ```py
256
+ pipeline.pose2vid_long(
257
+ reference_image: PIL.Image.Image,
258
+ pose_images: List[PIL.Image.Image],
259
+ num_inference_steps: int,
260
+ guidance_scale: float,
261
+ eta: float=0.0,
262
+ reference_pose_image: Optional[Image.Image]=None,
263
+ generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
264
+ output_type: Optional[str]="pil",
265
+ return_dict: bool=True,
266
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
267
+ callback_steps: Optional[int]=None,
268
+ width: Optional[int]=None,
269
+ height: Optional[int]=None,
270
+ video_length: Optional[int]=None,
271
+ context_schedule: str="uniform",
272
+ context_frames: int=16,
273
+ context_overlap: int=4,
274
+ context_batch_size: int=1,
275
+ interpolation_factor: int=1,
276
+ **kwargs: Any
277
+ ) -> Pose2VideoPipelineOutput
278
+ ```
279
+
280
+ Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation.
281
+ - The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`.
282
+ - Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
283
+ - Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images.