Text-to-video models can be used in any application that requires generating consistent sequence of images from text.


Darth Vader is surfing on the waves.

Text-to-Video Model

About Text-to-Video

Use Cases

Script-based Video Generation

Text-to-video models can be used to create short-form video content from a provided text script. These models can be used to create engaging and informative marketing videos. For example, a company could use a text-to-video model to create a video that explains how their product works.

Content format conversion

Text-to-video models can be used to generate videos from long-form text, including blog posts, articles, and text files. Text-to-video models can be used to create educational videos that are more engaging and interactive. An example of this is creating a video that explains a complex concept from an article.

Voice-overs and Speech

Text-to-video models can be used to create an AI newscaster to deliver daily news, or for a film-maker to create a short film or a music video.

Task Variants

Text-to-video models have different variants based on inputs and outputs.

Text-to-video Editing

One text-to-video task is generating text-based video style and local attribute editing. Text-to-video editing models can make it easier to perform tasks like cropping, stabilization, color correction, resizing and audio editing consistently.

Text-to-video Search

Text-to-video search is the task of retrieving videos that are relevant to a given text query. This can be challenging, as videos are a complex medium that can contain a lot of information. By using semantic analysis to extract the meaning of the text query, visual analysis to extract features from the videos, such as the objects and actions that are present in the video, and temporal analysis to categorize relationships between the objects and actions in the video, we can determine which videos are most likely to be relevant to the text query.

Text-driven Video Prediction

Text-driven video prediction is the task of generating a video sequence from a text description. Text description can be anything from a simple sentence to a detailed story. The goal of this task is to generate a video that is both visually realistic and semantically consistent with the text description.

Video Translation

Text-to-video translation models can translate videos from one language to another or allow to query the multilingual text-video model with non-English sentences. This can be useful for people who want to watch videos in a language that they don't understand, especially when multi-lingual captions are available for training.


Contribute an inference snippet for text-to-video here!

Useful Resources

In this area, you can insert useful resources about how to train or use a model for this task.

Compatible libraries

Text-to-Video demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Text-to-Video
Browse Models (102)

Note A strong model for video generation.

Note A text-to-video generation model with high quality and smooth outputs.

Datasets for Text-to-Video
Browse Datasets (35)

Note Microsoft Research Video to Text is a large-scale dataset for open domain video captioning

Note UCF101 Human Actions dataset consists of 13,320 video clips from YouTube, with 101 classes.

Note A high-quality dataset for human action recognition in YouTube videos.

Note A dataset of video clips of humans performing pre-defined basic actions with everyday objects.

Note This dataset consists of text-video pairs and contains noisy samples with irrelevant video descriptions

Note A dataset of short Flickr videos for the temporal localization of events with descriptions.

Spaces using Text-to-Video

Note An application that generates video from text.

Note An application that generates video from image and text.

Metrics for Text-to-Video
Inception Score uses an image classification model that predicts class labels and evaluates how distinct and diverse the images are. A higher score indicates better video generation.
Frechet Inception Distance uses an image classification model to obtain image embeddings. The metric compares mean and standard deviation of the embeddings of real and generated images. A smaller score indicates better video generation.
Frechet Video Distance uses a model that captures coherence for changes in frames and the quality of each frame. A smaller score indicates better video generation.
CLIPSIM measures similarity between video frames and text using an image-text similarity model. A higher score indicates better video generation.