Update model card: Add license and pipeline_tag, improve paper links

This PR enhances the model card with critical metadata and improves external links for better clarity and discoverability on the Hugging Face Hub.

* Adds `license: apache-2.0` to the YAML metadata, explicitly stating the model's license.
* Adds `pipeline_tag: image-to-video` to the YAML metadata, accurately reflecting the model's core functionality (Image-to-Video generation, I2V) for improved search and filtering. The `library_name` tag is omitted, as the model's roadmap indicates that `diffusers` integration is still pending.

Content updates include:
* Correcting the BibTeX `journal` field from `arXiv preprint` to `arXiv preprint arXiv:2510.17519` for full accuracy.
* Updating the `[technical report](#)` link in the 'Latest News' section to point to the correct arXiv paper: `https://arxiv.org/abs/2510.17519`.
* The existing arXiv badge link at the top has been preserved, as per the guidelines.

These changes ensure the model card is more informative and easier to navigate for users.

Files changed (1) hide show

README.md +70 -64

README.md CHANGED Viewed

@@ -1,3 +1,12 @@
 # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
 <div align="center">
@@ -31,7 +40,7 @@ To our knowledge, this is the first publicly available large-scale video-generat
 ## 🔥 Latest News
-* Oct. 21, 2025: 👋 We are excited to announce the release of the **MUG-V 10B** [technical report](#). We welcome feedback and discussions.
 * Oct. 21, 2025: 👋 We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
 * Oct. 21, 2025: 👋 We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
 * Oct. 21, 2025: 👋 We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
@@ -233,64 +242,64 @@ MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matc
 #### Core Components
-1. **VideoVAE**: 8×8×8 spatiotemporal compression
-   - Encoder: 3D convolutions + temporal attention
-   - Decoder: 3D transposed convolutions + temporal upsampling
-   - KL regularization for stable latent space
-2. **3D Patch Embedding**: Converts video latents to tokens
-   - Patch size: 2×2×2 (non-overlapping)
-   - Final compression: ~2048× vs. pixel space
-3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
-   - Extends 2D RoPE to handle temporal dimension
-   - Frequency-based encoding for spatiotemporal modeling
-4. **Conditioning Modules**:
-   - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
-   - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
-   - **Size Embedder**: Handles variable resolution inputs
-5. **MUGDiT Transformer Block**:
-   ```mermaid
-   graph LR
-       A[Input] --> B[AdaLN]
-       B --> C[Self-Attn<br/>QK-Norm]
-       C --> D[Gate]
-       D --> E1[+]
-       A --> E1
-       E1 --> F[LayerNorm]
-       F --> G[Cross-Attn<br/>QK-Norm]
-       G --> E2[+]
-       E1 --> E2
-       E2 --> I[AdaLN]
-       I --> J[MLP]
-       J --> K[Gate]
-       K --> E3[+]
-       E2 --> E3
-       E3 --> L[Output]
-       M[Timestep<br/>Size Info] -.-> B
-       M -.-> I
-       N[Text] -.-> G
-       style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
-       style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
-       style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
-       style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
-       style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
-       style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
-   ```
-6. **Rectified Flow Scheduler**:
-   - More stable training than DDPM
-   - Logit-normal timestep sampling
-   - Linear interpolation between noise and data
 ## Citation
 If you find our work helpful, please cite us.
@@ -299,7 +308,7 @@ If you find our work helpful, please cite us.
 @article{mug-v2025,
       title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
       author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
-      journal = {arXiv preprint},
       year={2025}
 }
 ```
@@ -313,7 +322,4 @@ This project is licensed under the Apache License 2.0 - see the [LICENSE](https:
 ## Acknowledgements
-We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.

+---
+license: apache-2.0
+pipeline_tag: image-to-video
+---
+---
+license: apache-2.0
+pipeline_tag: image-to-video
+---
 # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
 <div align="center">
 ## 🔥 Latest News
+* Oct. 21, 2025: 👋 We are excited to announce the release of the **MUG-V 10B** [technical report](https://arxiv.org/abs/2510.17519). We welcome feedback and discussions.
 * Oct. 21, 2025: 👋 We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
 * Oct. 21, 2025: 👋 We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
 * Oct. 21, 2025: 👋 We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
 #### Core Components
+1.  **VideoVAE**: 8×8×8 spatiotemporal compression
+    -   Encoder: 3D convolutions + temporal attention
+    -   Decoder: 3D transposed convolutions + temporal upsampling
+    -   KL regularization for stable latent space
+2.  **3D Patch Embedding**: Converts video latents to tokens
+    -   Patch size: 2×2×2 (non-overlapping)
+    -   Final compression: ~2048× vs. pixel space
+3.  **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
+    -   Extends 2D RoPE to handle temporal dimension
+    -   Frequency-based encoding for spatiotemporal modeling
+4.  **Conditioning Modules**:
+    -   **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
+    -   **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
+    -   **Size Embedder**: Handles variable resolution inputs
+5.  **MUGDiT Transformer Block**:
+    ```mermaid
+    graph LR
+        A[Input] --> B[AdaLN]
+        B --> C[Self-Attn<br/>QK-Norm]
+        C --> D[Gate]
+        D --> E1[+]
+        A --> E1
+        E1 --> F[LayerNorm]
+        F --> G[Cross-Attn<br/>QK-Norm]
+        G --> E2[+]
+        E1 --> E2
+        E2 --> I[AdaLN]
+        I --> J[MLP]
+        J --> K[Gate]
+        K --> E3[+]
+        E2 --> E3
+        E3 --> L[Output]
+        M[Timestep<br/>Size Info] -.-> B
+        M -.-> I
+        N[Text] -.-> G
+        style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
+        style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
+        style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
+        style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+        style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+        style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+    ```
+6.  **Rectified Flow Scheduler**:
+    -   More stable training than DDPM
+    -   Logit-normal timestep sampling
+    -   Linear interpolation between noise and data
 ## Citation
 If you find our work helpful, please cite us.
 @article{mug-v2025,
       title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
       author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
+      journal = {arXiv preprint arXiv:2510.17519},
       year={2025}
 }
 ```
 ## Acknowledgements
+We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.