aniportrait / README.md

Update README.md

2339d40 verified 8 months ago

10.9 kB

	---
	license: apache-2.0
	---

	This repository contains a pruned and partially reorganized version of [AniPortrait](https://fudan-generative-vision.github.io/champ/#/).

	```
	@misc{wei2024aniportrait,
	title={AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations},
	author={Huawei Wei and Zejun Yang and Zhisheng Wang},
	year={2024},
	eprint={2403.17694},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	# Usage

	## Installation

	First, install the AniPortrait package into your python environment. If you're creating a new environment for AniPortrait, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.

	```sh
	pip install git+https://github.com/painebenjamin/aniportrait.git
	```

	Now, you can create the pipeline, automatically pulling the weights from this repository, either as individual models:

	```py
	from aniportrait import AniPortraitPipeline
	pipeline = AniPortraitPipeline.from_pretrained(
	"benjamin-paine/aniportrait",
	torch_dtype=torch.float16,
	variant="fp16",
	device="cuda"
	).to("cuda", dtype=torch.float16)
	```

	Or, as a single file:

	```py
	from aniportrait import AniPortraitPipeline
	pipeline = AniPortraitPipeline.from_single_file(
	"benjamin-paine/aniportrait",
	torch_dtype=torch.float16,
	variant="fp16",
	device="cuda"
	).to("cuda", dtype=torch.float16)
	```

	The `AniPortraitPipeline` is a mega pipeline, capable of instantiating and executing other pipelines. It provides the following functions:

	## Workflows

	### img2img

	```py
	pipeline.img2img(
	reference_image: PIL.Image.Image,
	pose_reference_image: PIL.Image.Image,
	num_inference_steps: int,
	guidance_scale: float,
	eta: float=0.0,
	reference_pose_image: Optional[Image.Image]=None,
	generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
	output_type: Optional[str]="pil",
	return_dict: bool=True,
	callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
	callback_steps: Optional[int]=None,
	width: Optional[int]=None,
	height: Optional[int]=None,
	**kwargs: Any
	) -> Pose2VideoPipelineOutput
	```

	Using a reference image (for structure) and a pose reference image (for pose), render an image of the former in the pose of the latter.
	- The pose reference image here is an unprocessed image, from which the face pose will be extracted.
	- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.

	### vid2vid

	```py
	pipeline.vid2vid(
	reference_image: PIL.Image.Image,
	pose_reference_images: List[PIL.Image.Image],
	num_inference_steps: int,
	guidance_scale: float,
	eta: float=0.0,
	reference_pose_image: Optional[Image.Image]=None,
	generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
	output_type: Optional[str]="pil",
	return_dict: bool=True,
	callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
	callback_steps: Optional[int]=None,
	width: Optional[int]=None,
	height: Optional[int]=None,
	video_length: Optional[int]=None,
	context_schedule: str="uniform",
	context_frames: int=16,
	context_overlap: int=4,
	context_batch_size: int=1,
	interpolation_factor: int=1,
	use_long_video: bool=True,
	**kwargs: Any
	) -> Pose2VideoPipelineOutput
	```

	Using a reference image (for structure) and a sequence of pose reference images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation when the poses are longer than 16 frames.
	- Optionally pass `use_long_video = false` to disable using the long video pipeline.
	- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
	- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images.

	### audio2vid

	```py
	pipeline.audio2vid(
	audio: str,
	reference_image: PIL.Image.Image,
	num_inference_steps: int,
	guidance_scale: float,
	fps: int=30,
	eta: float=0.0,
	reference_pose_image: Optional[Image.Image]=None,
	pose_reference_images: Optional[List[PIL.Image.Image]]=None,
	generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
	output_type: Optional[str]="pil",
	return_dict: bool=True,
	callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
	callback_steps: Optional[int]=None,
	width: Optional[int]=None,
	height: Optional[int]=None,
	video_length: Optional[int]=None,
	context_schedule: str="uniform",
	context_frames: int=16,
	context_overlap: int=4,
	context_batch_size: int=1,
	interpolation_factor: int=1,
	use_long_video: bool=True,
	**kwargs: Any
	) -> Pose2VideoPipelineOutput
	```

	Using an audio file, draw `fps` face pose images per second for the duration of the audio. Then, using those face pose images, render a video.
	- Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement.
	- Optionally pass width/height to modify the size. Defaults to reference image size.
	- Optionally pass `use_long_video = false` to disable using the long video pipeline.
	- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
	- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose reference images.

	## Internals/Helpers

	### img2pose

	```py
	pipeline.img2pose(
	reference_image: PIL.Image.Image,
	width: Optional[int]=None,
	height: Optional[int]=None
	) -> PIL.Image.Image
	```

	Detects face landmarks in an image and draws a face pose image.
	- Optionally modify the original width and height.

	### vid2pose

	```py
	pipeline.vid2pose(
	reference_image: PIL.Image.Image,
	retarget_image: Optional[PIL.Image.Image],
	width: Optional[int]=None,
	height: Optional[int]=None
	) -> List[PIL.Image.Image]
	```

	Detects face landmarks in a series of images and draws pose images.
	- Optionally modify the original width and height.
	- Optionally retarget to a different face position, useful for video-to-video tasks.

	### audio2pose

	```py
	pipeline.audio2pose(
	audio_path: str,
	fps: int=30,
	reference_image: Optional[PIL.Image.Image]=None,
	pose_reference_images: Optional[List[PIL.Image.Image]]=None,
	width: Optional[int]=None,
	height: Optional[int]=None
	) -> List[PIL.Image.Image]
	```

	Using an audio file, draw `fps` face pose images per second for the duration of the audio.
	- Optionally include a reference image to extract the face shape and initial position from. Default has a generic androgynous face shape.
	- Optionally include a list of images to extract the poses from prior to merging with audio-generated poses (in essence, pass a video here to control non-speech motion). The default is a moderately active loop of head movement.
	- Optionally pass width/height to modify the size. Defaults to reference image size, then pose image sizes, then 256.

	### pose2img

	```py
	pipeline.pose2img(
	reference_image: PIL.Image.Image,
	pose_image: PIL.Image.Image,
	num_inference_steps: int,
	guidance_scale: float,
	eta: float=0.0,
	reference_pose_image: Optional[Image.Image]=None,
	generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
	output_type: Optional[str]="pil",
	return_dict: bool=True,
	callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
	callback_steps: Optional[int]=None,
	width: Optional[int]=None,
	height: Optional[int]=None,
	**kwargs: Any
	) -> Pose2VideoPipelineOutput
	```

	Using a reference image (for structure) and a pose image (for pose), render an image of the former in the pose of the latter.
	- The pose image here is a processed face pose. To pass a non-processed face pose, see `img2img`.
	- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.

	### pose2vid

	```py
	pipeline.pose2vid(
	reference_image: PIL.Image.Image,
	pose_images: List[PIL.Image.Image],
	num_inference_steps: int,
	guidance_scale: float,
	eta: float=0.0,
	reference_pose_image: Optional[Image.Image]=None,
	generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
	output_type: Optional[str]="pil",
	return_dict: bool=True,
	callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
	callback_steps: Optional[int]=None,
	width: Optional[int]=None,
	height: Optional[int]=None,
	video_length: Optional[int]=None,
	**kwargs: Any
	) -> Pose2VideoPipelineOutput
	```

	Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter.
	- The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`.
	- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
	- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images.

	### pose2vid_long

	```py
	pipeline.pose2vid_long(
	reference_image: PIL.Image.Image,
	pose_images: List[PIL.Image.Image],
	num_inference_steps: int,
	guidance_scale: float,
	eta: float=0.0,
	reference_pose_image: Optional[Image.Image]=None,
	generation: Optional[Union[torch.Generator, List[torch.Generator]]]=None,
	output_type: Optional[str]="pil",
	return_dict: bool=True,
	callback: Optional[Callable[[int, int, torch.FloatTensor], None]]=None,
	callback_steps: Optional[int]=None,
	width: Optional[int]=None,
	height: Optional[int]=None,
	video_length: Optional[int]=None,
	context_schedule: str="uniform",
	context_frames: int=16,
	context_overlap: int=4,
	context_batch_size: int=1,
	interpolation_factor: int=1,
	**kwargs: Any
	) -> Pose2VideoPipelineOutput
	```

	Using a reference image (for structure) and pose images (for pose), render a video of the former in the poses of the latter, using context windowing for long-video generation.
	- The pose images here are a processed face poses. To non-processed face poses, see `vid2vid`.
	- Optionally pass `reference_pose_image` to designate the pose of `reference_image`. When not passed, the pose of `reference_image` is automatically detected.
	- Optionally pass `video_length` to use this many frames. Default is the same as the length of the pose images.