Papers
arxiv:2309.13952

VidChapters-7M: Video Chapters at Scale

Published on Sep 25, 2023
· Featured in Daily Papers on Sep 26, 2023
Authors:
,
,
,

Abstract

Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.

Community

Here is an ML-generated summary

Objective
The paper presents VidChapters-7M, a large-scale dataset of user-annotated video chapters, and defines video chapter generation, video chapter generation with ground-truth boundaries, and video chapter grounding tasks based on this data.

Insights

  • Vid2Seq achieves the best video chapter generation performance, especially when using both speech and visual modalities. Pretraining on narrated videos further improves results.
  • For chapter title generation given ground-truth boundaries, Vid2Seq again outperforms baselines.
  • For chapter grounding, Moment-DETR trained on VidChapters-7M outperforms zero-shot baselines.
  • Vid2Seq pretrained on VidChapters-7M transfers well to dense video captioning, significantly improving over prior methods and showing promising scaling behavior.

Implementation

  • The authors collect a dataset of 817K YouTube videos with 7M user-annotated chapters by scraping chapter annotations from video descriptions.
  • They extract ASR transcripts using the Whisper model and CLIP visual features for each video.
  • For video chapter generation, they train and evaluate text tiling, shot detection, PDVC, and Vid2Seq models on VidChapters-7M.
  • For video chapter generation with ground-truth boundaries, they train and evaluate LLaMA, BLIP-2, and Vid2Seq on VidChapters-7M.
  • For video chapter grounding, they train and evaluate random, BERT, CLIP, and Moment-DETR models on VidChapters-7M.

Results
The paper demonstrates the value of VidChapters-7M for video chapter generation and grounding tasks, and shows its effectiveness for pretraining video-language models that transfer well to dense video captioning.

Will this be a huge step-stone for training video diffusion models which employs large-scale video-caption dataset?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.13952 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.13952 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.13952 in a Space README.md to link it from this page.

Collections including this paper 3