Papers
arxiv:2407.02371

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Published on Jul 2
· Submitted by yingtai on Jul 3
#2 Paper of the day
Authors:
,
,
,
,

Abstract

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Community

Paper author Paper submitter
•
edited 2 days ago

Paper, code, dataset and models are all released.

paper: http://export.arxiv.org/pdf/2407.02371
project website: https://nju-pcalab.github.io/projects/openvid
code: https://github.com/NJU-PCALab/OpenVid-1M
dataset: https://huggingface.co/datasets/nkp37/OpenVid-1M
models: https://huggingface.co/datasets/nkp37/OpenVid-1M/tree/main/model_weights

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The
previous popular video datasets, e.g.WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt.
To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million textvideo pairs, facilitating research on T2V generation. Furthermore, we curate
433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing highdefinition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from
visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Paper author Paper submitter
This comment has been hidden

Congrats!!Really cool to have a new dataset for text-video on the hub🔥

·

Thanks!

Hi @yingtai congrats on this work!

Great to see you're making the dataset and models available on HF.

Would you be able to link the dataset to this paper? Here's how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

I also saw the models are currently part of the dataset repo, would you be able to create model repositories for them instead (so that they appear as models citing this paper)? Here's how to do that: https://huggingface.co/docs/hub/models-uploading. They can be linked to the paper as explained here.

·

Thanks for your suggestions!

We will link the dataset and create mode repo in recent days!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.02371 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.02371 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.02371 in a Space README.md to link it from this page.

Collections including this paper 3