File size: 6,156 Bytes
1dbab65
 
 
 
 
8144ed2
4e5d605
1dbab65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fa64c0
1dbab65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
041c082
 
1dbab65
 
 
 
 
 
 
371d561
1dbab65
 
 
 
 
 
 
 
 
fc5cfe4
1dbab65
56e4906
 
1dbab65
 
56e4906
 
1dbab65
 
baf595a
1dbab65
 
 
 
 
ed3f93a
1dbab65
 
 
 
 
 
 
 
0405db5
371d561
1dbab65
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
inference: false
tags:
- text-to-video
- text-to-image
- jax-diffusers-event
- art
pipeline_tag: text-to-video
datasets:
- TempoFunk/tempofunk-sdance
- TempoFunk/small
license: agpl-3.0
language: en
library_name: diffusers
---


# Make-A-Video SD JAX Model Card

**A latent diffusion model for text-to-video synthesis.**

**[Try it with an interactive demo on HuggingFace spaces.](https://huggingface.co/spaces/TempoFunk/makeavid-sd-jax)**

Training code, PyTorch and FLAX implementation are available here: <https://github.com/lopho/makeavid-sd-tpu>

This model extends an inpainting latent-diffusion image generation model ([Stable Diffusion v1.5 Inpaint](https://huggingface.co/runwayml/stable-diffusion-inpainting))
with temporal convolution and temporal self-attention ported from [Make-A-Video PyTorch](https://github.com/lucidrains/make-a-video-pytorch)

It has then been fine tuned for ~150k steps on a [dataset](https://huggingface.co/datasets/TempoFunk/tempofunk-sdance) of 10,000 videos themed around dance.
Then for an additional ~50k steps with [extra data](https://huggingface.co/datasets/TempoFunk/small) of generic videos mixed into the original set.

This model used weights pretrained by [lxj616](https://huggingface.co/lxj616/make-a-stable-diffusion-video-timelapse) on 286 timelapse video clips for initialization.

![](https://huggingface.co/spaces/TempoFunk/makeavid-sd-jax/resolve/main/example.gif)

##  Table of Contents

- [Model Details](#model-details)
- [Uses](#uses)
- [Limitations](#limitations)
- [Training](#training)
  - [Training Data](#training-data)
  - [Training Process](#training-process)
  - [Hyper parameters](#hyperparameters)
- [Acknowledgements](#acknowledgements-and-Citations)
- [Citation](#citation)


## Model Details

* **Developed by:** [Lopho](https://huggingface.co/lopho), [Chavinlo](https://huggingface.co/chavinlo)
* **Model type:** Diffusion based text-to-video generation model
* **Language(s):** English
* **License:** (pending) GNU Affero General Public License 3.0
* **Further resources:** [Model implementation & training code](https://github.com/lopho/makeavid-sd-tpu), [Weights & Biases training statistics](https://wandb.ai/tempofunk/makeavid-sd-tpu)

## Uses

* Understanding limitations and biases of generative video models
* Development of educational or creative tools
* Artistic usage
* What ever you want

## Limitations

* Limited knowledge of temporal concepts not seen during training (see linked datasets)
* Emerging flashing lights, most likely due to training on dance videos, which include many scenes with bright, neon and flashing lights
* The model has only been trained with English captions and will not perform as well in other languages

## Training

### Training Data

* [S(mall)dance](https://huggingface.co/datasets/TempoFunk/tempofunk-sdance): 10,000 video-caption pairs of dancing videos (as encoded image latents, text embeddings and metadata).
* [small](https://huggingface.co/datasets/TempoFunk/small): 7,000 video-caption pairs of general videos (as encoded image latents, text embeddings and metadata).

### Training Procedure

* From each video sample a random range of 24 frames is selected
* Each video latent is encoded into latent representations of the shape 4 x 24 x H/8 x W/8
* The latent of the first frame from each video is repeated along the frame dimension as additional guidance (referred to as hint image)
* Hint latent and video latent are stacked to produce a shape of 8 x 24 x H/8 x W/8
* The last input channel is preserved for masking purposes (not used during training, set to zero)
* Text prompts are encoded by the CLIP text encoder
* Video latents with added noise and clip encoded text prompts are fed into the UNet to predict the added noise
* Loss is the reconstruction objective between the added noise and the predicted noise via mean squared error (mse/l2)

### Hyperparameters

* **Batch size:** 1 x 4
* **Image size:** 512 x 512
* **Frame count:** 24
* **Optimizer:** AdamW (beta_1 = 0.9, beta_2 = 0.999, weight decay = 0.02)
* **Schedule:**
  * 2 x 10 epochs: LR warmup for 1 epochs then held constant at 5e-5 (10,000 samples per ep)
  * 2 x 20 epochs: LR warmup for 1 epochs then held constant at 5e-5 (10,000 samples per ep)
  * 1 x 9 epochs: LR warmup for 1 epoch to 5e-5 then cosine annealing to 1e-8
  * Additional data mixed in, see [Trainig Data](#training-data)
  * 1 x 5 epochs: LR warmup for 0.5 epochs to 2.5e-5 then constant (17,000 samples per ep)
  * 1 x 5 epochs: LR warmup for 0.5 epochs to 5e-6 then cosine annealing to 2.5e-6 (17,000 samples per ep)
  * some restarts were required due to NaNs appearing in the gradient (see training logs)
* **Total update steps:** ~200,000
* **Hardware:** TPUv4-8 (provided by Google Cloud for the [HuggingFace JAX/Diffusers Sprint Event](https://github.com/huggingface/community-events/tree/main/jax-controlnet-sprint))

Trainig statistics are available at [Weights and Biases](https://wandb.ai/tempofunk/makeavid-sd-tpu).

## Acknowledgements

* [CompVis](https://github.com/CompVis/) for [Latent Diffusion Models](https://arxiv.org/abs/2112.10752) + [Stable Diffusion](https://github.com/CompVis/stable-diffusion)
* [Meta AIs Make-A-Video](https://arxiv.org/abs/2209.14792) for the research of applying pseudo 3D convolution and attention to existing image models
* [Phil Wang](https://github.com/lucidrains) for the torch implementation of [Make-A-Video Pseudo3D convolution and attention](https://github.com/lucidrains/make-a-video-pytorch/)
* [lxj616](https://huggingface.co/lxj616) for initial proof of feasibility of LDM + Make-A-Video

## Citation

```bibtext
@misc{TempoFunk2023,
      author = {Lopho, Carlos Chavez},
      title = {TempoFunk: Extending latent diffusion image models to Video},
      url = {https://github.com/lopho/makeavid-sd-tpu},
      month = {5},
      year = {2023}
}
```

---

*This model card was written by: [Lopho](https://hugginface.co/lopho), [Chavinlo](https://huggingface.co/chavinlo), [Julian Herrera](https://huggingface.co/puffy310) and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*