stabilityai/stable-video-diffusion-img2vid-xt · How to generate a video in 4 seconds?

To generate a 4 second long video (which is what I'm guessing you mean), change the frame rate parameter (fps) in the "export_to_video" function call.
The model will generate 25 frames (by default -- and what it's fine-tuned to do). If you use fps=25 as the parameter for your model call, and 25 fps as the parameter for the call to "export_to_video", the video produced will have a 1 second duration.

Alternatively, if you hold fps=25 as the parameter for the model call, but export to video using an fps of 6.25 (ie 25/4), the resulting video will have a duration of 4 seconds. However, the output will almost certainly be 'choppy', due to large perceptual jumps between frames (humans need ~20fps to perceive a smooth video)

Possible solutions:

One solution is to chain together multiple 1s videos (use the final frame of each preceding video as the starting frame of the next video), however, since the model does not have any insight into the motion of the preceding videos, the output may not have perfect coherence.

Instead, this model could be fine-tuned to produce videos given a more robust context than a single start frame, similar to the 2 frame input discussed in https://arxiv.org/abs/2304.08818.

Using this pretrained model without fine tuning, you might try to pass additional end frames, encode to latent space, and prevent alteration of these 'seed frames' during diffusion to extend the duration of the video while improving on consistency between frame group motion. (I've tried this, and it can sufficiently produce 4-5 second clips, but not much more than that due to compounding error).