ajayj commited on
Commit
db56120
·
verified ·
1 Parent(s): 83359d2

Update Github repository link (#24)

Browse files

- Update Github repository link (b02b888bcb96c36a9695184af62939f74bf60dde)

Files changed (1) hide show
  1. README.md +21 -22
README.md CHANGED
@@ -25,8 +25,8 @@ Clone the repository and install it in editable mode:
25
  Install using [uv](https://github.com/astral-sh/uv):
26
 
27
  ```bash
28
- git clone https://github.com/genmoai/models
29
- cd models
30
  pip install uv
31
  uv venv .venv
32
  source .venv/bin/activate
@@ -53,6 +53,25 @@ python3 -m mochi_preview.infer --prompt "A hand with delicate fingers picks up a
53
 
54
  Replace `<path_to_model_directory>` with the path to your model directory.
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ## Running with Diffusers
57
 
58
  Install the latest version of Diffusers
@@ -105,26 +124,6 @@ export_to_video(frames, "mochi.mp4", fps=30)
105
 
106
  To learn more check out the [Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi) documentation
107
 
108
- ## Model Architecture
109
-
110
- Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
111
-
112
- Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 96x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
113
-
114
- An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
115
- Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
116
-
117
- ## Hardware Requirements
118
-
119
- Mochi 1 supports a variety of hardware platforms depending on quantization level, ranging from a single 3090 GPU up to multiple H100 GPUs.
120
-
121
- ## Safety
122
- Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
123
-
124
- ## Limitations
125
- Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
126
-
127
-
128
  ## BibTeX
129
  ```
130
  @misc{genmo2024mochi,
 
25
  Install using [uv](https://github.com/astral-sh/uv):
26
 
27
  ```bash
28
+ git clone https://github.com/genmoai/mochi
29
+ cd mochi
30
  pip install uv
31
  uv venv .venv
32
  source .venv/bin/activate
 
53
 
54
  Replace `<path_to_model_directory>` with the path to your model directory.
55
 
56
+ ## Model Architecture
57
+
58
+ Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
59
+
60
+ Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 96x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
61
+
62
+ An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
63
+ Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
64
+
65
+ ## Hardware Requirements
66
+
67
+ Mochi 1 supports a variety of hardware platforms depending on quantization level, ranging from a single 3090 GPU up to multiple H100 GPUs.
68
+
69
+ ## Safety
70
+ Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
71
+
72
+ ## Limitations
73
+ Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
74
+
75
  ## Running with Diffusers
76
 
77
  Install the latest version of Diffusers
 
124
 
125
  To learn more check out the [Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi) documentation
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ## BibTeX
128
  ```
129
  @misc{genmo2024mochi,