README.md CHANGED
@@ -1,8 +1,7 @@
1
  ---
2
- pipeline_tag: image-to-video
3
- license: other
4
- license_name: stable-video-diffusion-nc-community
5
- license_link: LICENSE
6
  ---
7
 
8
  # Stable Video Diffusion Image-to-Video Model Card
@@ -11,8 +10,6 @@ license_link: LICENSE
11
  ![row01](output_tile.gif)
12
  Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
13
 
14
- Please note: For commercial use, please refer to https://stability.ai/membership.
15
-
16
  ## Model Details
17
 
18
  ### Model Description
@@ -47,7 +44,7 @@ SVD-Image-to-Video is preferred by human voters in terms of video quality. For d
47
 
48
  ### Direct Use
49
 
50
- The model is intended for both non-commercial and commercial usage. You can use this model for non-commercial or research purposes under this [license](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/LICENSE). Possible research areas and tasks include
51
 
52
  - Research on generative models.
53
  - Safe deployment of models which have the potential to generate harmful content.
@@ -55,8 +52,6 @@ The model is intended for both non-commercial and commercial usage. You can use
55
  - Generation of artworks and use in design and other artistic processes.
56
  - Applications in educational or creative tools.
57
 
58
- For commercial use, please refer to https://stability.ai/membership.
59
-
60
  Excluded uses are described below.
61
 
62
  ### Out-of-Scope Use
@@ -78,22 +73,11 @@ The model should not be used in any way that violates Stability AI's [Acceptable
78
 
79
  ### Recommendations
80
 
81
- The model is intended for both non-commercial and commercial usage.
82
 
83
  ## How to Get Started with the Model
84
 
85
  Check out https://github.com/Stability-AI/generative-models
86
 
87
- # Appendix:
88
-
89
- All considered potential data sources were included for final training, with none held out as the proposed data filtering methods described in the SVD paper handle the quality control/filtering of the dataset. With regards to safety/NSFW filtering, sources considered were either deemed safe or filtered with the in-house NSFW filters.
90
- No explicit human labor is involved in training data preparation. However, human evaluation for model outputs and quality was extensively used to evaluate model quality and performance. The evaluations were performed with third-party contractor platforms (Amazon Sagemaker, Amazon Mechanical Turk, Prolific) with fluent English-speaking contractors from various countries, primarily from the USA, UK, and Canada. Each worker was paid $12/hr for the time invested in the evaluation.
91
- No other third party was involved in the development of this model; the model was fully developed in-house at Stability AI.
92
- Training the SVD checkpoints required a total of approximately 200,000 A100 80GB hours. The majority of the training occurred on 48 * 8 A100s, while some stages took more/less than that. The resulting CO2 emission is ~19,000kg CO2 eq., and energy consumed is ~64000 kWh.
93
- The released checkpoints (SVD/SVD-XT) are image-to-video models that generate short videos/animations closely following the given input image. Since the model relies on an existing supplied image, the potential risks of disclosing specific material or novel unsafe content are minimal. This was also evaluated by third-party independent red-teaming services, which agree with our conclusion to a high degree of confidence (>90% in various areas of safety red-teaming). The external evaluations were also performed for trustworthiness, leading to >95% confidence in real, trustworthy videos.
94
- With the default settings at the time of release, SVD takes ~100s for generation, and SVD-XT takes ~180s on an A100 80GB card. Several optimizations to trade off quality / memory / speed can be done to perform faster inference or inference on lower VRAM cards.
95
- The information related to the model and its development process and usage protocols can be found in the GitHub repo, associated research paper, and HuggingFace model page/cards.
96
- The released model inference & demo code has image-level watermarking enabled by default, which can be used to detect the outputs. This is done via the imWatermark Python library.
97
- The model can be used to generate videos from static initial images. However, we prohibit unlawful, obscene, or misleading uses of the model consistent with the terms of our license and Acceptable Use Policy. For the open-weights release, our training data filtering mitigations alleviate this risk to some extent. These restrictions are explicitly enforced on user-facing interfaces at stablevideo.com, where a warning is issued. We do not take any responsibility for third-party interfaces. Submitting initial images that bypass input filters to tease out offensive or inappropriate content listed above is also prohibited. Safety filtering checks at stablevideo.com run on model inputs and outputs independently. More details on our user-facing interfaces can be found here: https://www.stablevideo.com/faq. Beyond the Acceptable Use Policy and other mitigations and conditions described here, the model is not subject to additional model behavior interventions of the type described in the Foundation Model Transparency Index.
98
- For stablevideo.com, we store preference data in the form of upvotes/downvotes on user-generated videos, and we have a pairwise ranker that runs while a user generates videos. This usage data is solely used for improving Stability AI’s future image/video models and services. No other third-party entities are given access to the usage data beyond Stability AI and maintainers of stablevideo.com.
99
- For usage statistics of SVD, we refer interested users to HuggingFace model download/usage statistics as a primary indicator. Third-party applications also have reported model usage statistics. We might also consider releasing aggregate usage statistics of stablevideo.com on reaching some milestones.
 
1
  ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
 
5
  ---
6
 
7
  # Stable Video Diffusion Image-to-Video Model Card
 
10
  ![row01](output_tile.gif)
11
  Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
12
 
 
 
13
  ## Model Details
14
 
15
  ### Model Description
 
44
 
45
  ### Direct Use
46
 
47
+ The model is intended for research purposes only. Possible research areas and tasks include
48
 
49
  - Research on generative models.
50
  - Safe deployment of models which have the potential to generate harmful content.
 
52
  - Generation of artworks and use in design and other artistic processes.
53
  - Applications in educational or creative tools.
54
 
 
 
55
  Excluded uses are described below.
56
 
57
  ### Out-of-Scope Use
 
73
 
74
  ### Recommendations
75
 
76
+ The model is intended for research purposes only.
77
 
78
  ## How to Get Started with the Model
79
 
80
  Check out https://github.com/Stability-AI/generative-models
81
 
82
+
83
+
 
 
 
 
 
 
 
 
 
 
 
feature_extractor/preprocessor_config.json DELETED
@@ -1,28 +0,0 @@
1
- {
2
- "crop_size": {
3
- "height": 224,
4
- "width": 224
5
- },
6
- "do_center_crop": true,
7
- "do_convert_rgb": true,
8
- "do_normalize": true,
9
- "do_rescale": true,
10
- "do_resize": true,
11
- "feature_extractor_type": "CLIPFeatureExtractor",
12
- "image_mean": [
13
- 0.48145466,
14
- 0.4578275,
15
- 0.40821073
16
- ],
17
- "image_processor_type": "CLIPImageProcessor",
18
- "image_std": [
19
- 0.26862954,
20
- 0.26130258,
21
- 0.27577711
22
- ],
23
- "resample": 3,
24
- "rescale_factor": 0.00392156862745098,
25
- "size": {
26
- "shortest_edge": 224
27
- }
28
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
image_encoder/config.json DELETED
@@ -1,23 +0,0 @@
1
- {
2
- "_name_or_path": "/home/suraj_huggingface_co/.cache/huggingface/hub/models--diffusers--svd-xt/snapshots/9703ded20c957c340781ee710b75660826deb487/image_encoder",
3
- "architectures": [
4
- "CLIPVisionModelWithProjection"
5
- ],
6
- "attention_dropout": 0.0,
7
- "dropout": 0.0,
8
- "hidden_act": "gelu",
9
- "hidden_size": 1280,
10
- "image_size": 224,
11
- "initializer_factor": 1.0,
12
- "initializer_range": 0.02,
13
- "intermediate_size": 5120,
14
- "layer_norm_eps": 1e-05,
15
- "model_type": "clip_vision_model",
16
- "num_attention_heads": 16,
17
- "num_channels": 3,
18
- "num_hidden_layers": 32,
19
- "patch_size": 14,
20
- "projection_dim": 1024,
21
- "torch_dtype": "float16",
22
- "transformers_version": "4.34.0.dev0"
23
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
image_encoder/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ae616c24393dd1854372b0639e5541666f7521cbe219669255e865cb7f89466a
3
- size 1264217240
 
 
 
 
image_encoder/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ed1e5af7b4042ca30ec29999a4a5cfcac90b7fb610fd05ace834f2dcbb763eab
3
- size 2528371296
 
 
 
 
model_index.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_class_name": "StableVideoDiffusionPipeline",
3
- "_diffusers_version": "0.24.0.dev0",
4
- "_name_or_path": "diffusers/svd-xt",
5
- "feature_extractor": [
6
- "transformers",
7
- "CLIPImageProcessor"
8
- ],
9
- "image_encoder": [
10
- "transformers",
11
- "CLIPVisionModelWithProjection"
12
- ],
13
- "scheduler": [
14
- "diffusers",
15
- "EulerDiscreteScheduler"
16
- ],
17
- "unet": [
18
- "diffusers",
19
- "UNetSpatioTemporalConditionModel"
20
- ],
21
- "vae": [
22
- "diffusers",
23
- "AutoencoderKLTemporalDecoder"
24
- ]
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scheduler/scheduler_config.json DELETED
@@ -1,20 +0,0 @@
1
- {
2
- "_class_name": "EulerDiscreteScheduler",
3
- "_diffusers_version": "0.24.0.dev0",
4
- "beta_end": 0.012,
5
- "beta_schedule": "scaled_linear",
6
- "beta_start": 0.00085,
7
- "clip_sample": false,
8
- "interpolation_type": "linear",
9
- "num_train_timesteps": 1000,
10
- "prediction_type": "v_prediction",
11
- "set_alpha_to_one": false,
12
- "sigma_max": 700.0,
13
- "sigma_min": 0.002,
14
- "skip_prk_steps": true,
15
- "steps_offset": 1,
16
- "timestep_spacing": "leading",
17
- "timestep_type": "continuous",
18
- "trained_betas": null,
19
- "use_karras_sigmas": true
20
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
unet/config.json DELETED
@@ -1,38 +0,0 @@
1
- {
2
- "_class_name": "UNetSpatioTemporalConditionModel",
3
- "_diffusers_version": "0.24.0.dev0",
4
- "_name_or_path": "/home/suraj_huggingface_co/.cache/huggingface/hub/models--diffusers--svd-xt/snapshots/9703ded20c957c340781ee710b75660826deb487/unet",
5
- "addition_time_embed_dim": 256,
6
- "block_out_channels": [
7
- 320,
8
- 640,
9
- 1280,
10
- 1280
11
- ],
12
- "cross_attention_dim": 1024,
13
- "down_block_types": [
14
- "CrossAttnDownBlockSpatioTemporal",
15
- "CrossAttnDownBlockSpatioTemporal",
16
- "CrossAttnDownBlockSpatioTemporal",
17
- "DownBlockSpatioTemporal"
18
- ],
19
- "in_channels": 8,
20
- "layers_per_block": 2,
21
- "num_attention_heads": [
22
- 5,
23
- 10,
24
- 20,
25
- 20
26
- ],
27
- "num_frames": 25,
28
- "out_channels": 4,
29
- "projection_class_embeddings_input_dim": 768,
30
- "sample_size": 96,
31
- "transformer_layers_per_block": 1,
32
- "up_block_types": [
33
- "UpBlockSpatioTemporal",
34
- "CrossAttnUpBlockSpatioTemporal",
35
- "CrossAttnUpBlockSpatioTemporal",
36
- "CrossAttnUpBlockSpatioTemporal"
37
- ]
38
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
unet/diffusion_pytorch_model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:9fbc02e90f37d422f5e3a4aeaee95f6629dc8c45ca211b951626e930daf2bddf
3
- size 3049435868
 
 
 
 
unet/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:7783d82729af04f26ded4641a5952617fe331fc46add332fb9e47674fecc6ad7
3
- size 6098682464
 
 
 
 
vae/config.json DELETED
@@ -1,24 +0,0 @@
1
- {
2
- "_class_name": "AutoencoderKLTemporalDecoder",
3
- "_diffusers_version": "0.24.0.dev0",
4
- "_name_or_path": "/home/suraj_huggingface_co/.cache/huggingface/hub/models--diffusers--svd-xt/snapshots/9703ded20c957c340781ee710b75660826deb487/vae",
5
- "block_out_channels": [
6
- 128,
7
- 256,
8
- 512,
9
- 512
10
- ],
11
- "down_block_types": [
12
- "DownEncoderBlock2D",
13
- "DownEncoderBlock2D",
14
- "DownEncoderBlock2D",
15
- "DownEncoderBlock2D"
16
- ],
17
- "force_upcast": true,
18
- "in_channels": 3,
19
- "latent_channels": 4,
20
- "layers_per_block": 2,
21
- "out_channels": 3,
22
- "sample_size": 768,
23
- "scaling_factor": 0.18215
24
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vae/diffusion_pytorch_model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:af602cd0eb4ad6086ec94fbf1438dfb1be5ec9ac03fd0215640854e90d6463a3
3
- size 195531910
 
 
 
 
vae/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:5d92aa595a53d9da9faf594f09910ee869d5d567c8bb0362d5095673c69997d6
3
- size 391017740