Edit model card
YAML Metadata Warning: The pipeline tag "video-to-video" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, other

model example example outputs (courtesy of dotsimulate)

zeroscope_v2 XL

A watermark-free Modelscope-based video model capable of generating high quality video at 1024 x 576. This model was trained from the original weights with offset noise using 9,923 clips and 29,769 tagged frames at 24 frames, 1024x576 resolution.
zeroscope_v2_XL is specifically designed for upscaling content made with zeroscope_v2_576w using vid2vid in the 1111 text2video extension by kabachuha. Leveraging this model as an upscaler allows for superior overall compositions at higher resolutions, permitting faster exploration in 576x320 (or 448x256) before transitioning to a high-resolution render.

zeroscope_v2_XL uses 15.3gb of vram when rendering 30 frames at 1024x576

Using it with the 1111 text2video extension

  1. Download files in the zs2_XL folder.
  2. Replace the respective files in the 'stable-diffusion-webui\models\ModelScope\t2v' directory.

Upscaling recommendations

For upscaling, it's recommended to use the 1111 extension. It works best at 1024x576 with a denoise strength between 0.66 and 0.85. Remember to use the same prompt that was used to generate the original clip.

Usage in ๐Ÿงจ Diffusers

Let's first install the libraries required:

$ pip install git+https://github.com/huggingface/diffusers.git
$ pip install transformers accelerate torch

Now, let's first generate a low resolution video using cerspense/zeroscope_v2_576w.

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) # disable if enough memory as this slows down significantly

prompt = "Darth Vader is surfing on waves"
video_frames = pipe(prompt, num_inference_steps=40, height=320, width=576, num_frames=36).frames
video_path = export_to_video(video_frames)

Next, we can upscale it using cerspense/zeroscope_v2_XL.

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

video_frames = pipe(prompt, video=video, strength=0.6).frames
video_path = export_to_video(video_frames, output_video_path="/home/patrick/videos/video_1024_darth_vader_36.mp4")

Here are some results:

Darth vader is surfing on waves.
Darth vader surfing in waves.

Known issues

Rendering at lower resolutions or fewer than 24 frames could lead to suboptimal outputs.

Thanks to camenduru, kabachuha, ExponentialML, dotsimulate, VANYA, polyware, tin2tin

Downloads last month

Spaces using cerspense/zeroscope_v2_XL 17