Update model card: Add license and pipeline_tag, improve paper links
Browse filesThis PR enhances the model card with critical metadata and improves external links for better clarity and discoverability on the Hugging Face Hub.
* Adds `license: apache-2.0` to the YAML metadata, explicitly stating the model's license.
* Adds `pipeline_tag: image-to-video` to the YAML metadata, accurately reflecting the model's core functionality (Image-to-Video generation, I2V) for improved search and filtering. The `library_name` tag is omitted, as the model's roadmap indicates that `diffusers` integration is still pending.
Content updates include:
* Correcting the BibTeX `journal` field from `arXiv preprint` to `arXiv preprint arXiv:2510.17519` for full accuracy.
* Updating the `[technical report](#)` link in the 'Latest News' section to point to the correct arXiv paper: `https://arxiv.org/abs/2510.17519`.
* The existing arXiv badge link at the top has been preserved, as per the guidelines.
These changes ensure the model card is more informative and easier to navigate for users.
|
@@ -1,3 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
|
| 2 |
|
| 3 |
<div align="center">
|
|
@@ -31,7 +40,7 @@ To our knowledge, this is the first publicly available large-scale video-generat
|
|
| 31 |
|
| 32 |
## ๐ฅ Latest News
|
| 33 |
|
| 34 |
-
* Oct. 21, 2025: ๐ We are excited to announce the release of the **MUG-V 10B** [technical report](
|
| 35 |
* Oct. 21, 2025: ๐ We've released Megatron-LMโbased [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
|
| 36 |
* Oct. 21, 2025: ๐ We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
|
| 37 |
* Oct. 21, 2025: ๐ We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
|
|
@@ -233,64 +242,64 @@ MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matc
|
|
| 233 |
|
| 234 |
#### Core Components
|
| 235 |
|
| 236 |
-
1.
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
2.
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
3.
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
4.
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
5.
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
6.
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
|
| 295 |
## Citation
|
| 296 |
If you find our work helpful, please cite us.
|
|
@@ -299,7 +308,7 @@ If you find our work helpful, please cite us.
|
|
| 299 |
@article{mug-v2025,
|
| 300 |
title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
|
| 301 |
author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
|
| 302 |
-
journal = {arXiv preprint},
|
| 303 |
year={2025}
|
| 304 |
}
|
| 305 |
```
|
|
@@ -313,7 +322,4 @@ This project is licensed under the Apache License 2.0 - see the [LICENSE](https:
|
|
| 313 |
|
| 314 |
## Acknowledgements
|
| 315 |
|
| 316 |
-
We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-video
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
pipeline_tag: image-to-video
|
| 9 |
+
---
|
| 10 |
# MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
|
| 11 |
|
| 12 |
<div align="center">
|
|
|
|
| 40 |
|
| 41 |
## ๐ฅ Latest News
|
| 42 |
|
| 43 |
+
* Oct. 21, 2025: ๐ We are excited to announce the release of the **MUG-V 10B** [technical report](https://arxiv.org/abs/2510.17519). We welcome feedback and discussions.
|
| 44 |
* Oct. 21, 2025: ๐ We've released Megatron-LMโbased [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
|
| 45 |
* Oct. 21, 2025: ๐ We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
|
| 46 |
* Oct. 21, 2025: ๐ We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
|
|
|
|
| 242 |
|
| 243 |
#### Core Components
|
| 244 |
|
| 245 |
+
1. **VideoVAE**: 8ร8ร8 spatiotemporal compression
|
| 246 |
+
- Encoder: 3D convolutions + temporal attention
|
| 247 |
+
- Decoder: 3D transposed convolutions + temporal upsampling
|
| 248 |
+
- KL regularization for stable latent space
|
| 249 |
+
|
| 250 |
+
2. **3D Patch Embedding**: Converts video latents to tokens
|
| 251 |
+
- Patch size: 2ร2ร2 (non-overlapping)
|
| 252 |
+
- Final compression: ~2048ร vs. pixel space
|
| 253 |
+
|
| 254 |
+
3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
|
| 255 |
+
- Extends 2D RoPE to handle temporal dimension
|
| 256 |
+
- Frequency-based encoding for spatiotemporal modeling
|
| 257 |
+
|
| 258 |
+
4. **Conditioning Modules**:
|
| 259 |
+
- **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
|
| 260 |
+
- **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
|
| 261 |
+
- **Size Embedder**: Handles variable resolution inputs
|
| 262 |
+
|
| 263 |
+
5. **MUGDiT Transformer Block**:
|
| 264 |
+
|
| 265 |
+
```mermaid
|
| 266 |
+
graph LR
|
| 267 |
+
A[Input] --> B[AdaLN]
|
| 268 |
+
B --> C[Self-Attn<br/>QK-Norm]
|
| 269 |
+
C --> D[Gate]
|
| 270 |
+
D --> E1[+]
|
| 271 |
+
A --> E1
|
| 272 |
+
|
| 273 |
+
E1 --> F[LayerNorm]
|
| 274 |
+
F --> G[Cross-Attn<br/>QK-Norm]
|
| 275 |
+
G --> E2[+]
|
| 276 |
+
E1 --> E2
|
| 277 |
+
|
| 278 |
+
E2 --> I[AdaLN]
|
| 279 |
+
I --> J[MLP]
|
| 280 |
+
J --> K[Gate]
|
| 281 |
+
K --> E3[+]
|
| 282 |
+
E2 --> E3
|
| 283 |
+
|
| 284 |
+
E3 --> L[Output]
|
| 285 |
+
|
| 286 |
+
M[Timestep<br/>Size Info] -.-> B
|
| 287 |
+
M -.-> I
|
| 288 |
+
|
| 289 |
+
N[Text] -.-> G
|
| 290 |
+
|
| 291 |
+
style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
|
| 292 |
+
style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
|
| 293 |
+
style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
|
| 294 |
+
style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
|
| 295 |
+
style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
|
| 296 |
+
style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
|
| 297 |
+
```
|
| 298 |
+
|
| 299 |
+
6. **Rectified Flow Scheduler**:
|
| 300 |
+
- More stable training than DDPM
|
| 301 |
+
- Logit-normal timestep sampling
|
| 302 |
+
- Linear interpolation between noise and data
|
| 303 |
|
| 304 |
## Citation
|
| 305 |
If you find our work helpful, please cite us.
|
|
|
|
| 308 |
@article{mug-v2025,
|
| 309 |
title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
|
| 310 |
author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
|
| 311 |
+
journal = {arXiv preprint arXiv:2510.17519},
|
| 312 |
year={2025}
|
| 313 |
}
|
| 314 |
```
|
|
|
|
| 322 |
|
| 323 |
## Acknowledgements
|
| 324 |
|
| 325 |
+
We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.
|
|
|
|
|
|
|
|
|