Image-to-Video
nielsr HF Staff commited on
Commit
ffa80c7
ยท
verified ยท
1 Parent(s): e51dd67

Update model card: Add license and pipeline_tag, improve paper links

Browse files

This PR enhances the model card with critical metadata and improves external links for better clarity and discoverability on the Hugging Face Hub.

* Adds `license: apache-2.0` to the YAML metadata, explicitly stating the model's license.
* Adds `pipeline_tag: image-to-video` to the YAML metadata, accurately reflecting the model's core functionality (Image-to-Video generation, I2V) for improved search and filtering. The `library_name` tag is omitted, as the model's roadmap indicates that `diffusers` integration is still pending.

Content updates include:
* Correcting the BibTeX `journal` field from `arXiv preprint` to `arXiv preprint arXiv:2510.17519` for full accuracy.
* Updating the `[technical report](#)` link in the 'Latest News' section to point to the correct arXiv paper: `https://arxiv.org/abs/2510.17519`.
* The existing arXiv badge link at the top has been preserved, as per the guidelines.

These changes ensure the model card is more informative and easier to navigate for users.

Files changed (1) hide show
  1. README.md +70 -64
README.md CHANGED
@@ -1,3 +1,12 @@
 
 
 
 
 
 
 
 
 
1
  # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
2
 
3
  <div align="center">
@@ -31,7 +40,7 @@ To our knowledge, this is the first publicly available large-scale video-generat
31
 
32
  ## ๐Ÿ”ฅ Latest News
33
 
34
- * Oct. 21, 2025: ๐Ÿ‘‹ We are excited to announce the release of the **MUG-V 10B** [technical report](#). We welcome feedback and discussions.
35
  * Oct. 21, 2025: ๐Ÿ‘‹ We've released Megatron-LMโ€“based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
36
  * Oct. 21, 2025: ๐Ÿ‘‹ We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
37
  * Oct. 21, 2025: ๐Ÿ‘‹ We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
@@ -233,64 +242,64 @@ MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matc
233
 
234
  #### Core Components
235
 
236
- 1. **VideoVAE**: 8ร—8ร—8 spatiotemporal compression
237
- - Encoder: 3D convolutions + temporal attention
238
- - Decoder: 3D transposed convolutions + temporal upsampling
239
- - KL regularization for stable latent space
240
-
241
- 2. **3D Patch Embedding**: Converts video latents to tokens
242
- - Patch size: 2ร—2ร—2 (non-overlapping)
243
- - Final compression: ~2048ร— vs. pixel space
244
-
245
- 3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
246
- - Extends 2D RoPE to handle temporal dimension
247
- - Frequency-based encoding for spatiotemporal modeling
248
-
249
- 4. **Conditioning Modules**:
250
- - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
251
- - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
252
- - **Size Embedder**: Handles variable resolution inputs
253
-
254
- 5. **MUGDiT Transformer Block**:
255
-
256
- ```mermaid
257
- graph LR
258
- A[Input] --> B[AdaLN]
259
- B --> C[Self-Attn<br/>QK-Norm]
260
- C --> D[Gate]
261
- D --> E1[+]
262
- A --> E1
263
-
264
- E1 --> F[LayerNorm]
265
- F --> G[Cross-Attn<br/>QK-Norm]
266
- G --> E2[+]
267
- E1 --> E2
268
-
269
- E2 --> I[AdaLN]
270
- I --> J[MLP]
271
- J --> K[Gate]
272
- K --> E3[+]
273
- E2 --> E3
274
-
275
- E3 --> L[Output]
276
-
277
- M[Timestep<br/>Size Info] -.-> B
278
- M -.-> I
279
-
280
- N[Text] -.-> G
281
-
282
- style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
283
- style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
284
- style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
285
- style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
286
- style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
287
- style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
288
- ```
289
-
290
- 6. **Rectified Flow Scheduler**:
291
- - More stable training than DDPM
292
- - Logit-normal timestep sampling
293
- - Linear interpolation between noise and data
294
 
295
  ## Citation
296
  If you find our work helpful, please cite us.
@@ -299,7 +308,7 @@ If you find our work helpful, please cite us.
299
  @article{mug-v2025,
300
  title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
301
  author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
302
- journal = {arXiv preprint},
303
  year={2025}
304
  }
305
  ```
@@ -313,7 +322,4 @@ This project is licensed under the Apache License 2.0 - see the [LICENSE](https:
313
 
314
  ## Acknowledgements
315
 
316
- We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.
317
-
318
-
319
-
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-video
4
+ ---
5
+
6
+ ---
7
+ license: apache-2.0
8
+ pipeline_tag: image-to-video
9
+ ---
10
  # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
11
 
12
  <div align="center">
 
40
 
41
  ## ๐Ÿ”ฅ Latest News
42
 
43
+ * Oct. 21, 2025: ๐Ÿ‘‹ We are excited to announce the release of the **MUG-V 10B** [technical report](https://arxiv.org/abs/2510.17519). We welcome feedback and discussions.
44
  * Oct. 21, 2025: ๐Ÿ‘‹ We've released Megatron-LMโ€“based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
45
  * Oct. 21, 2025: ๐Ÿ‘‹ We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
46
  * Oct. 21, 2025: ๐Ÿ‘‹ We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
 
242
 
243
  #### Core Components
244
 
245
+ 1. **VideoVAE**: 8ร—8ร—8 spatiotemporal compression
246
+ - Encoder: 3D convolutions + temporal attention
247
+ - Decoder: 3D transposed convolutions + temporal upsampling
248
+ - KL regularization for stable latent space
249
+
250
+ 2. **3D Patch Embedding**: Converts video latents to tokens
251
+ - Patch size: 2ร—2ร—2 (non-overlapping)
252
+ - Final compression: ~2048ร— vs. pixel space
253
+
254
+ 3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
255
+ - Extends 2D RoPE to handle temporal dimension
256
+ - Frequency-based encoding for spatiotemporal modeling
257
+
258
+ 4. **Conditioning Modules**:
259
+ - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
260
+ - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
261
+ - **Size Embedder**: Handles variable resolution inputs
262
+
263
+ 5. **MUGDiT Transformer Block**:
264
+
265
+ ```mermaid
266
+ graph LR
267
+ A[Input] --> B[AdaLN]
268
+ B --> C[Self-Attn<br/>QK-Norm]
269
+ C --> D[Gate]
270
+ D --> E1[+]
271
+ A --> E1
272
+
273
+ E1 --> F[LayerNorm]
274
+ F --> G[Cross-Attn<br/>QK-Norm]
275
+ G --> E2[+]
276
+ E1 --> E2
277
+
278
+ E2 --> I[AdaLN]
279
+ I --> J[MLP]
280
+ J --> K[Gate]
281
+ K --> E3[+]
282
+ E2 --> E3
283
+
284
+ E3 --> L[Output]
285
+
286
+ M[Timestep<br/>Size Info] -.-> B
287
+ M -.-> I
288
+
289
+ N[Text] -.-> G
290
+
291
+ style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
292
+ style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
293
+ style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
294
+ style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
295
+ style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
296
+ style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
297
+ ```
298
+
299
+ 6. **Rectified Flow Scheduler**:
300
+ - More stable training than DDPM
301
+ - Logit-normal timestep sampling
302
+ - Linear interpolation between noise and data
303
 
304
  ## Citation
305
  If you find our work helpful, please cite us.
 
308
  @article{mug-v2025,
309
  title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
310
  author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
311
+ journal = {arXiv preprint arXiv:2510.17519},
312
  year={2025}
313
  }
314
  ```
 
322
 
323
  ## Acknowledgements
324
 
325
+ We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.