|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
This repository contains a pruned and partially reorganized version of [AniPortrait](https://fudan-generative-vision.github.io/champ/#/). |
|
|
|
``` |
|
@misc{wei2024aniportrait, |
|
title={AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations}, |
|
author={Huawei Wei and Zejun Yang and Zhisheng Wang}, |
|
year={2024}, |
|
eprint={2403.17694}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
# Usage |
|
|
|
## Installation |
|
|
|
First, install the AniPortrait package into your python environment. If you're creating a new environment for AniPortrait, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU. |
|
|
|
```sh |
|
pip install git+https://github.com/painebenjamin/aniportrait.git |
|
``` |
|
|
|
Now, you can create the pipeline, automatically pulling the weights from this repository, either as individual models: |
|
|
|
```py |
|
from aniportrait import AniPortraitPipeline |
|
pipeline = AniPortraitPipeline.from_pretrained( |
|
"benjamin-paine/aniportrait", |
|
torch_dtype=torch.float16, |
|
variant="fp16", |
|
device="cuda" |
|
).to("cuda", dtype=torch.float16) |
|
``` |
|
|
|
Or, as a single file: |
|
|
|
```py |
|
from aniportrait import AniPortraitPipeline |
|
pipeline = AniPortraitPipeline.from_single_file( |
|
"benjamin-paine/aniportrait", |
|
torch_dtype=torch.float16, |
|
variant="fp16", |
|
device="cuda" |
|
).to("cuda", dtype=torch.float16) |
|
``` |
|
|
|
The `AniPortraitPipeline` is a mega pipeline, capable of instantiating and executing other pipelines. It provides the following functions: |
|
|
|
## Workflows |
|
|
|
### img2img |
|
|
|
```py |
|
pipeline.img2img( |
|
reference_image: PIL.Image.Image, |
|
pose_reference_image: PIL.Image.Image, |
|
num_inference_steps: int, |
|
guidance_scale: float, |
|
eta: float=0.0, |
|
reference_pose_image: Optional[Image.Image]=None, |
|
generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None, |
|
output_type: Optional[str]="pil", |
|
return_dict: bool=True, |
|
callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None, |
|
callback_steps: Optional[int]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None, |
|
**kwargs: Any |
|
) -> Pose2VideoPipelineOutput |
|
``` |
|
|
|
Using a reference image (for structure) and a pose reference image (for pose), render an image of the former in the pose of the latter. |
|
- The pose reference image here is an unprocessed image, from which the face pose will be extracted. |
|
- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected. |
|
|
|
### vid2vid |
|
|
|
```py |
|
pipeline.vid2vid( |
|
reference_image: PIL.Image.Image, |
|
pose_reference_images: List[PIL.Image.Image], |
|
num_inference_steps: int, |
|
guidance_scale: float, |
|
eta: float=0.0, |
|
reference_pose_image: Optional[Image.Image]=None, |
|
generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None, |
|
output_type: Optional[str]="pil", |
|
return_dict: bool=True, |
|
callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None, |
|
callback_steps: Optional[int]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None, |
|
video_length: Optional[int]=None, |
|
context_schedule: str="uniform", |
|
context_frames: int=16, |
|
context_overlap: int=4, |
|
context_batch_size: int=1, |
|
interpolation_factor: int=1, |
|
use_long_video: bool=True, |
|
**kwargs: Any |
|
) -> Pose2VideoPipelineOutput |
|
``` |
|
|
|
Using a reference image (for structure) and a sequence of pose reference images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation when the poses are longer than 16 frames. |
|
- Optionally pass `use_long_video = false` to disable using the long video pipeline. |
|
- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected. |
|
- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images. |
|
|
|
### audio2vid |
|
|
|
```py |
|
pipeline.audio2vid( |
|
audio: str, |
|
reference_image: PIL.Image.Image, |
|
num_inference_steps: int, |
|
guidance_scale: float, |
|
fps: int=30, |
|
eta: float=0.0, |
|
reference_pose_image: Optional[Image.Image]=None, |
|
pose_reference_images: Optional[List[PIL.Image.Image]]=None, |
|
generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None, |
|
output_type: Optional[str]="pil", |
|
return_dict: bool=True, |
|
callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None, |
|
callback_steps: Optional[int]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None, |
|
video_length: Optional[int]=None, |
|
context_schedule: str="uniform", |
|
context_frames: int=16, |
|
context_overlap: int=4, |
|
context_batch_size: int=1, |
|
interpolation_factor: int=1, |
|
use_long_video: bool=True, |
|
**kwargs: Any |
|
) -> Pose2VideoPipelineOutput |
|
``` |
|
|
|
Using an audio file, draw `fps` face pose images per second for the duration of the audio. Then, using those face pose images, render a video. |
|
- Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement. |
|
- Optionally pass width/height to modify the size. Defaults to reference image size. |
|
- Optionally pass `use_long_video = false` to disable using the long video pipeline. |
|
- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected. |
|
- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images. |
|
|
|
## Internals/Helpers |
|
|
|
### img2pose |
|
|
|
```py |
|
pipeline.img2pose( |
|
reference_image: PIL.Image.Image, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None |
|
) -> PIL.Image.Image |
|
``` |
|
|
|
Detects face landmarks in an image and draws a face pose image. |
|
- Optionally modify the original width and height. |
|
|
|
### vid2pose |
|
|
|
```py |
|
pipeline.vid2pose( |
|
reference_image: PIL.Image.Image, |
|
retarget_image: Optional[PIL.Image.Image], |
|
width: Optional[int]=None, |
|
height: Optional[int]=None |
|
) -> List[PIL.Image.Image] |
|
``` |
|
|
|
Detects face landmarks in a series of images and draws pose images. |
|
- Optionally modify the original width and height. |
|
- Optionally retarget to a different face position, useful for video-to-video tasks. |
|
|
|
### audio2pose |
|
|
|
```py |
|
pipeline.audio2pose( |
|
audio_path: str, |
|
fps: int=30, |
|
reference_image: Optional[PIL.Image.Image]=None, |
|
pose_reference_images: Optional[List[PIL.Image.Image]]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None |
|
) -> List[PIL.Image.Image] |
|
``` |
|
|
|
Using an audio file, draw `fps` face pose images per second for the duration of the audio. |
|
- Optionally include a reference image to extract the face shape and initial position from. Default has a generic androgynous face shape. |
|
- Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement. |
|
- Optionally pass width/height to modify the size. Defaults to reference image size, then pose image sizes, then 256. |
|
|
|
### pose2img |
|
|
|
```py |
|
pipeline.pose2img( |
|
reference_image: PIL.Image.Image, |
|
pose_image: PIL.Image.Image, |
|
num_inference_steps: int, |
|
guidance_scale: float, |
|
eta: float=0.0, |
|
reference_pose_image: Optional[Image.Image]=None, |
|
generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None, |
|
output_type: Optional[str]="pil", |
|
return_dict: bool=True, |
|
callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None, |
|
callback_steps: Optional[int]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None, |
|
**kwargs: Any |
|
) -> Pose2VideoPipelineOutput |
|
``` |
|
|
|
Using a reference image (for structure) and a pose image (for pose), render an image of the former in the pose of the latter. |
|
- The pose image here is a processed face pose. To pass a non-processed face pose, see `img2img`. |
|
- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected. |
|
|
|
### pose2vid |
|
|
|
```py |
|
pipeline.pose2vid( |
|
reference_image: PIL.Image.Image, |
|
pose_images: List[PIL.Image.Image], |
|
num_inference_steps: int, |
|
guidance_scale: float, |
|
eta: float=0.0, |
|
reference_pose_image: Optional[Image.Image]=None, |
|
generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None, |
|
output_type: Optional[str]="pil", |
|
return_dict: bool=True, |
|
callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None, |
|
callback_steps: Optional[int]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None, |
|
video_length: Optional[int]=None, |
|
**kwargs: Any |
|
) -> Pose2VideoPipelineOutput |
|
``` |
|
|
|
Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter. |
|
- The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`. |
|
- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected. |
|
- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images. |
|
|
|
### pose2vid_long |
|
|
|
```py |
|
pipeline.pose2vid_long( |
|
reference_image: PIL.Image.Image, |
|
pose_images: List[PIL.Image.Image], |
|
num_inference_steps: int, |
|
guidance_scale: float, |
|
eta: float=0.0, |
|
reference_pose_image: Optional[Image.Image]=None, |
|
generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None, |
|
output_type: Optional[str]="pil", |
|
return_dict: bool=True, |
|
callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None, |
|
callback_steps: Optional[int]=None, |
|
width: Optional[int]=None, |
|
height: Optional[int]=None, |
|
video_length: Optional[int]=None, |
|
context_schedule: str="uniform", |
|
context_frames: int=16, |
|
context_overlap: int=4, |
|
context_batch_size: int=1, |
|
interpolation_factor: int=1, |
|
**kwargs: Any |
|
) -> Pose2VideoPipelineOutput |
|
``` |
|
|
|
Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation. |
|
- The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`. |
|
- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected. |
|
- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images. |
|
|