# Tune-A-Video
This repository is the official implementation of [Tune-A-Video](https://arxiv.org/abs/2212.11565).
**[Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565)**
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Yixiao Ge](https://geyixiao.com/),
[Xintao Wang](https://xinntao.github.io/),
[Stan Weixian Lei](),
[Yuchao Gu](https://ycgu.site/),
[Wynne Hsu](https://www.comp.nus.edu.sg/~whsu/),
[Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en),
[Xiaohu Qie](https://scholar.google.com/citations?user=mk-F69UAAAAJ&hl=en),
[Mike Zheng Shou](https://sites.google.com/view/showlab)
[Project Page](https://tuneavideo.github.io/) | [arXiv](https://arxiv.org/abs/2212.11565)
## Setup
### Requirements
```shell
pip install -r requirements.txt
```
Installing [xformers](https://github.com/facebookresearch/xformers) is highly recommended for more efficiency and speed on GPUs.
To enable xformers, set `enable_xformers_memory_efficient_attention=True` (default).
### Weights
You can download the pre-trained [Stable Diffusion](https://arxiv.org/abs/2112.10752) models
(e.g., [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)):
```shell
git lfs install
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
```
Alternatively, you can use a personalized [DreamBooth](https://arxiv.org/abs/2208.12242) model (e.g., [mr-potato-head](https://huggingface.co/sd-dreambooth-library/mr-potato-head)):
```shell
git lfs install
git clone https://huggingface.co/sd-dreambooth-library/mr-potato-head
```
## Training
To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:
```shell
accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"
```
## Inference
Once the training is done, run inference:
```python
from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch
model_id = "path-to-your-trained-model"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")
prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos
save_videos_grid(video, f"{prompt}.gif")
```
## Results
### Fine-tuning on Stable Diffusion
|
|
|
|
[Training] a man is surfing. |
a panda is surfing. |
Iron Man is surfing in the desert. |
a raccoon is surfing, cartoon style. |
### Fine-tuning on DreamBooth
|
|
|
|
sks mr potato head. |
sks mr potato head, wearing a pink hat, is surfing. |
sks mr potato head, wearing sunglasses, is surfing. |
sks mr potato head is surfing in the forest. |
## BibTeX
```
@article{wu2022tuneavideo,
title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2212.11565},
year={2022}
}
```