Files changed (1) hide show
  1. README.md +117 -2
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - stabilityai/stable-diffusion-3-medium
5
  ---
@@ -21,4 +21,119 @@ This is the official repository for Pyramid Flow, a training-efficient **Autoreg
21
  <td><video src="https://pyramid-flow.github.io/static/videos/t2v/trailer.mp4" autoplay muted loop playsinline></video></td>
22
  <td><video src="https://pyramid-flow.github.io/static/videos/i2v/sunday.mp4" autoplay muted loop playsinline></video></td>
23
  </tr>
24
- </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
  base_model:
4
  - stabilityai/stable-diffusion-3-medium
5
  ---
 
21
  <td><video src="https://pyramid-flow.github.io/static/videos/t2v/trailer.mp4" autoplay muted loop playsinline></video></td>
22
  <td><video src="https://pyramid-flow.github.io/static/videos/i2v/sunday.mp4" autoplay muted loop playsinline></video></td>
23
  </tr>
24
+ </table>
25
+
26
+ ## News
27
+
28
+ * `COMING SOON` ⚡️⚡️⚡️ Training code and new model checkpoints trained from scratch.
29
+ * `2024.10.10` 🚀🚀🚀 We release the [technical report](https://arxiv.org), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
30
+
31
+ ## Usage
32
+
33
+ You can directly download the model from [Huggingface](https://huggingface.co/rain1011/pyramid-flow-sd3). We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.
34
+
35
+ To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We further simplify it into the following two-step procedure. First, load the downloaded model:
36
+
37
+ ```python
38
+ import torch
39
+ from PIL import Image
40
+ from pyramid_dit import PyramidDiTForVideoGeneration
41
+ from diffusers.utils import load_image, export_to_video
42
+
43
+ torch.cuda.set_device(0)
44
+ model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16, fp16 or fp32
45
+
46
+ model = PyramidDiTForVideoGeneration(
47
+ '/home/jinyang06/models/pyramid-flow', # The downloaded checkpoint dir
48
+ model_dtype,
49
+ model_variant='diffusion_transformer_768p', # 'diffusion_transformer_384p'
50
+ )
51
+
52
+ model.vae.to("cuda")
53
+ model.dit.to("cuda")
54
+ model.text_encoder.to("cuda")
55
+ model.vae.enable_tiling()
56
+ ```
57
+
58
+ Then, you can try text-to-video generation on your own prompts:
59
+
60
+ ```python
61
+ prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
62
+
63
+ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
64
+ frames = model.generate(
65
+ prompt=prompt,
66
+ num_inference_steps=[20, 20, 20],
67
+ video_num_inference_steps=[10, 10, 10],
68
+ height=768,
69
+ width=1280,
70
+ temp=16, # temp=16: 5s, temp=31: 10s
71
+ guidance_scale=9.0, # The guidance for the first frame
72
+ video_guidance_scale=5.0, # The guidance for the other video latent
73
+ output_type="pil",
74
+ )
75
+
76
+ export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
77
+ ```
78
+
79
+ As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
80
+
81
+ ```python
82
+ image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((1280, 768))
83
+ prompt = "FPV flying over the Great Wall"
84
+
85
+ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
86
+ frames = model.generate_i2v(
87
+ prompt=prompt,
88
+ input_image=image,
89
+ num_inference_steps=[10, 10, 10],
90
+ temp=16,
91
+ video_guidance_scale=4.0,
92
+ output_type="pil",
93
+ )
94
+
95
+ export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
96
+ ```
97
+
98
+ Usage tips:
99
+
100
+ * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
101
+ * The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.
102
+ * For 10-second video generation, we recommend using a guidance scale of 7 and a video guidance scale of 5.
103
+
104
+ ## Gallery
105
+
106
+ The following video examples are generated at 5s, 768p, 24fps. For more results, please visit our [project page](https://pyramid-flow.github.io).
107
+
108
+ <table class="center" border="0" style="width: 100%; text-align: left;">
109
+ <tr>
110
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/tokyo.mp4" autoplay muted loop playsinline></video></td>
111
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/eiffel.mp4" autoplay muted loop playsinline></video></td>
112
+ </tr>
113
+ <tr>
114
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/waves.mp4" autoplay muted loop playsinline></video></td>
115
+ <td><video src="https://pyramid-flow.github.io/static/videos/t2v/rail.mp4" autoplay muted loop playsinline></video></td>
116
+ </tr>
117
+ </table>
118
+
119
+ ## Acknowledgement
120
+
121
+ We are grateful for the following awesome projects when implementing Pyramid Flow:
122
+
123
+ * [SD3 Medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium) and [Flux 1.0](https://huggingface.co/black-forest-labs/FLUX.1-dev): State-of-the-art image generation models based on flow matching.
124
+ * [Diffusion Forcing](https://boyuan.space/diffusion-forcing) and [GameNGen](https://gamengen.github.io): Next-token prediction meets full-sequence diffusion.
125
+ * [WebVid-10M](https://github.com/m-bain/webvid), [OpenVid-1M](https://github.com/NJU-PCALab/OpenVid-1M) and [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): Large-scale datasets for text-to-video generation.
126
+ * [CogVideoX](https://github.com/THUDM/CogVideo): An open-source text-to-video generation model that shares many training details.
127
+ * [Video-LLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2): An open-source video LLM for our video recaptioning.
128
+
129
+ ## Citation
130
+
131
+ Consider giving this repository a star and cite Pyramid Flow in your publications if it helps your research.
132
+ ```
133
+ @article{jin2024pyramidal,
134
+ title={Pyramidal Flow Matching for Efficient Video Generative Modeling},
135
+ author={Jin, Yang and Sun, Zhicheng and Li, Ningyuan and Xu, Kun and Xu, Kun and Jiang, Hao and Zhuang, Nan and Huang, Quzhe and Song, Yang and Mu, Yadong and Lin, Zhouchen},
136
+ jounal={arXiv preprint arXiv:2410.XXXXX},
137
+ year={2024}
138
+ }
139
+ ```