arxiv:2309.13952

VidChapters-7M: Video Chapters at Scale

Published on Sep 25, 2023

· Submitted by

akhaliq on Sep 26, 2023

Upvote

Authors:

Antoine Yang ,

Abstract

Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.

View arXiv page View PDF Add to collection

Community

osanseviero

Sep 26, 2023

Here is an ML-generated summary

Objective
The paper presents VidChapters-7M, a large-scale dataset of user-annotated video chapters, and defines video chapter generation, video chapter generation with ground-truth boundaries, and video chapter grounding tasks based on this data.

Insights

Vid2Seq achieves the best video chapter generation performance, especially when using both speech and visual modalities. Pretraining on narrated videos further improves results.
For chapter title generation given ground-truth boundaries, Vid2Seq again outperforms baselines.
For chapter grounding, Moment-DETR trained on VidChapters-7M outperforms zero-shot baselines.
Vid2Seq pretrained on VidChapters-7M transfers well to dense video captioning, significantly improving over prior methods and showing promising scaling behavior.

Implementation

The authors collect a dataset of 817K YouTube videos with 7M user-annotated chapters by scraping chapter annotations from video descriptions.
They extract ASR transcripts using the Whisper model and CLIP visual features for each video.
For video chapter generation, they train and evaluate text tiling, shot detection, PDVC, and Vid2Seq models on VidChapters-7M.
For video chapter generation with ground-truth boundaries, they train and evaluate LLaMA, BLIP-2, and Vid2Seq on VidChapters-7M.
For video chapter grounding, they train and evaluate random, BERT, CLIP, and Moment-DETR models on VidChapters-7M.

Results
The paper demonstrates the value of VidChapters-7M for video chapter generation and grounding tasks, and shows its effectiveness for pretraining video-language models that transfer well to dense video captioning.