arxiv:2407.02371

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Published on Jul 2

· Submitted by

yingtai on Jul 3

#2 Paper of the day

Upvote

Authors:

Rui Xie ,

Penghao Zhou ,

Zhenheng Yang ,

Zhijie Chen ,

Ying Tai

Abstract

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

View arXiv page View PDF Add to collection

Community

yingtai

Paper author Paper submitter 2 days ago

•

edited 2 days ago

Paper, code, dataset and models are all released.

paper: http://export.arxiv.org/pdf/2407.02371
project website: https://nju-pcalab.github.io/projects/openvid
code: https://github.com/NJU-PCALab/OpenVid-1M
dataset: https://huggingface.co/datasets/nkp37/OpenVid-1M
models: https://huggingface.co/datasets/nkp37/OpenVid-1M/tree/main/model_weights

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The
previous popular video datasets, e.g.WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt.
To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million textvideo pairs, facilitating research on T2V generation. Furthermore, we curate
433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing highdefinition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from
visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

yingtai

Paper author Paper submitter 2 days ago

This comment has been hidden

AdinaY

1 day ago

•

edited 1 day ago

Congrats!!Really cool to have a new dataset for text-video on the hub🔥

yingtai

Paper author about 15 hours ago

Thanks!

nielsr

1 day ago

Hi @yingtai congrats on this work!

Great to see you're making the dataset and models available on HF.

Would you be able to link the dataset to this paper? Here's how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

I also saw the models are currently part of the dataset repo, would you be able to create model repositories for them instead (so that they appear as models citing this paper)? Here's how to do that: https://huggingface.co/docs/hub/models-uploading. They can be linked to the paper as explained here.