Add comprehensive model card for Many-for-Many unified generation model

This PR adds a comprehensive model card for the Many-for-Many model.

It links the model to its paper: [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758).

It also adds essential metadata, including:
* `pipeline_tag: any-to-any`, reflecting its capability across various image and video generation and manipulation tasks.
* `library_name: diffusers`, as the model is built upon the Diffusers framework.
* `license: apache-2.0`.

Additionally, the PR provides links to the project page and the GitHub repository, along with a basic Python usage example to help users get started.

Files changed (1) hide show

README.md +111 -0

README.md ADDED Viewed

	@@ -0,0 +1,111 @@

+---
+pipeline_tag: any-to-any
+library_name: diffusers
+license: apache-2.0
+---
+# Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
+<div align="center">
+  <img src="https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/MfM_logo.jpeg" alt="MfM-logo" width="50%">
+</div>
+**Many-for-Many (MfM)** is a unified framework introduced in the paper [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758). This framework leverages available training data from many different visual generation and manipulation tasks to train a single model for those tasks.
+MfM utilizes a lightweight adapter to unify diverse conditions across different tasks and employs a joint image-video learning strategy for progressive training from scratch. This approach leads to a unified visual generation and manipulation model with improved video generation performance. The model also integrates depth maps as a condition to enhance its perception of 3D space in visual generation.
+Two versions of the model (8B and 2B parameters) are available, each capable of performing more than 10 different tasks, including text-to-video (T2V), image-to-video (I2V), video-to-video (V2V), and various image and video manipulation tasks. The 8B model demonstrates highly competitive performance in video generation.
+*   **Paper:** [Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks](https://huggingface.co/papers/2506.01758)
+*   **Project Page:** [https://leeruibin.github.io/MfMPage/](https://leeruibin.github.io/MfMPage/)
+*   **Code:** [https://github.com/SandAI-org/MAGI-1](https://github.com/SandAI-org/MAGI-1)
+## Visual Results
+<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/visual_result.png'>
+## Demo Video
+<div align="center">
+  <video src="https://github.com/user-attachments/assets/f1ddd1fd-1c2b-44e7-94dc-9f62963ab147" width="70%" controls> </video>
+</div>
+## Architecture
+<img src='https://huggingface.co/LetsThink/MfM-Pipeline-8B/resolve/main/assets/arch.png'>
+## Usage
+You can load the model using the `diffusers` library and perform various generation tasks.
+First, ensure you have the necessary requirements installed:
+```bash
+pip install -r requirements.txt
+```
+Then, you can download the pipeline from Hugging Face Hub and use it for inference:
+```python
+from huggingface_hub import snapshot_download
+from diffusers import DiffusionPipeline
+import torch
+import os
+# Define a local directory to download the model
+local_dir = "./MfM-Pipeline-8B"
+# Download the pipeline from Hugging Face Hub
+# You can use "LetsThink/MfM-Pipeline-2B" for the 2B version
+snapshot_download(repo_id="LetsThink/MfM-Pipeline-8B", local_dir=local_dir)
+# Load the pipeline. Since MfMPipeline is a custom class, we need trust_remote_code=True.
+pipe = DiffusionPipeline.from_pretrained(local_dir, torch_dtype=torch.float16, trust_remote_code=True)
+pipe.to("cuda") # or your preferred device like "cpu"
+# Example: Text-to-Video generation (task="t2v")
+prompt = "A majestic eagle flying over snow-capped mountains."
+output_dir = "outputs"
+task = "t2v" # The model supports multiple tasks like "t2v", "i2v", "i2i", etc.
+# Create output directory if it doesn't exist
+os.makedirs(output_dir, exist_ok=True)
+# Run inference
+# Parameters like num_frames, num_inference_steps, guidance_scale, motion_score
+# are crucial and may vary per task. Refer to the official GitHub repository
+# for recommended values and detailed usage for different tasks.
+video_frames = pipe(
+    prompt=prompt,
+    task=task,
+    crop_type="keep_res",
+    num_inference_steps=30,
+    guidance_scale=9,
+    motion_score=5,
+    num_samples=1,
+    upscale=4,
+    noise_aug_strength=0.0,
+    # t2v_inputs expects a path to a file with prompts, here we pass prompt directly.
+    # For full functionality as in infer_mfm_pipeline.py, you might need to adapt.
+).images[0] # The pipeline returns a list of generated results, take the first one
+# You can save the video frames as a GIF or MP4 using libraries like imageio or moviepy
+# Example using imageio (install with: pip install imageio imageio-ffmpeg)
+# import imageio
+# output_video_path = os.path.join(output_dir, "generated_video.mp4")
+# imageio.mimsave(output_video_path, video_frames, fps=8)
+# print(f"Generated video saved to {output_video_path}")
+```
+## Citation
+If you find our code or model useful in your research, please cite:
+```bibtex
+@article{yang2025MfM,
+  title={Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks},
+  author={Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang},
+  year={2025},
+  booktitle={arXiv preprint arXiv:2506.01758},
+}
+```