arxiv:2402.19479

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Published on Feb 29

· Submitted by

akhaliq on Mar 1

Upvote

Authors:

Tsai-Shien Chen ,

Aliaksandr Siarohin ,

Willi Menapace ,

Ekaterina Deyneka ,

Byung Eun Jeon ,

Hsin-Ying Lee ,

Jian Ren ,

Abstract

The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

View arXiv page View PDF Add to collection

Community

librarian-bot

Mar 3

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

mikelabs

Mar 4

•

edited Mar 4

Here's my summary!

Training AI to understand and describe video content requires datasets which are expensive for humans to annotate manually. Now researchers from Snap, UC Merced, and the University of Trento have put together a new dataset called Panda-70M that aims to help.

This new dataset has 70 million high-res YouTube clips paired with descriptive captions. The key is they used an automated pipeline with multiple cross-modal "teacher" AI models to generate captions based on different inputs like video, subtitles, images, etc.

Some highlights:

70M 720p YouTube clips about 8 secs long with 13-word captions
Teacher models include video QA, image captioning, text summarization
Ensemble of teachers can accurately describe 84% of clips vs 31% for any single model
Pretraining on this dataset improved video AI models' performance substantially:
- 18% boost in captioning accuracy after finetuning small 2.5M subset
- 7% better at text-video retrieval
- 77% reduction in video generation errors

Limitations remain around content diversity, caption density, and automated quality. But I think this is a big step forward for assembling large-scale video-text training data to advance multimodal AI.

Efficient pipelines like this could unlock video understanding capabilities approaching human level comprehension. Exciting to see some models trained on Panda-70M as they become available.

Full summary here.