0 Bytes
38 files
Updated 2 days ago
Name
Size
.gitattributes2.31 kB
xet
README.md10.1 kB
xet
lvb_test_wo_gt.json4.89 MB
xet
lvb_val.json1.31 MB
xet
subtitles.tar117 MB
xet
test-00000-of-00001.parquet1.61 MB
xet
validation-00000-of-00001.parquet427 kB
xet
videos.tar.part.aa5.24 GB
xet
videos.tar.part.ab5.24 GB
xet
videos.tar.part.ac5.24 GB
xet
videos.tar.part.ad5.24 GB
xet
videos.tar.part.ae5.24 GB
xet
videos.tar.part.af5.24 GB
xet
videos.tar.part.ag5.24 GB
xet
videos.tar.part.ah5.24 GB
xet
videos.tar.part.ai5.24 GB
xet
videos.tar.part.aj5.24 GB
xet
videos.tar.part.ak5.24 GB
xet
videos.tar.part.al5.24 GB
xet
videos.tar.part.am5.24 GB
xet
videos.tar.part.an5.24 GB
xet
videos.tar.part.ao5.24 GB
xet
videos.tar.part.ap5.24 GB
xet
videos.tar.part.aq5.24 GB
xet
videos.tar.part.ar5.24 GB
xet
videos.tar.part.as5.24 GB
xet
videos.tar.part.at5.24 GB
xet
videos.tar.part.au5.24 GB
xet
videos.tar.part.av5.24 GB
xet
videos.tar.part.aw5.24 GB
xet
videos.tar.part.ax5.24 GB
xet
videos.tar.part.ay5.24 GB
xet
videos.tar.part.az5.24 GB
xet
videos.tar.part.ba5.24 GB
xet
videos.tar.part.bb5.24 GB
xet
videos.tar.part.bc5.24 GB
xet
videos.tar.part.bd5.24 GB
xet
videos.tar.part.be4.28 GB
xet
README.md

Dataset Card for LongVideoBench

Large multimodal models (LMMs) are handling increasingly longer and more complex inputs. However, few public benchmarks are available to assess these advancements. To address this, we introduce LongVideoBench, a question-answering benchmark with video-language interleaved inputs up to an hour long. It comprises 3,763 web-collected videos with subtitles across diverse themes, designed to evaluate LMMs on long-term multimodal understanding.

The main challenge that LongVideoBench targets is to accurately retrieve and reason over detailed information from lengthy inputs. We present a novel task called referring reasoning, where questions contain a referring query that references related video contexts, requiring the model to reason over these details.

LongVideoBench includes 6,678 human-annotated multiple-choice questions across 17 categories, making it one of the most comprehensive benchmarks for long-form video understanding. Evaluations show significant challenges even for advanced proprietary models (e.g., GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), with open-source models performing worse. Performance improves only when models process more frames, establishing LongVideoBench as a valuable benchmark for future long-context LMMs.

Dataset Details

Dataset Description

  • Curated by: LongVideoBench Team
  • Language(s) (NLP): English
  • License: CC-BY-NC-SA 4.0

Dataset Sources [optional]

Leaderboard (until Oct. 14, 2024)

We rank models by Test Total Performance.

Model Test Total (5341) Test 8s-15s Test 15s-60s Test 180s-600s Test 900s-3600s Val Total (1337)
GPT-4o (0513) (256) 66.7 71.6 76.8 66.7 61.6 66.7
Aria (256) 65.0 69.4 76.6 64.6 60.1 64.2
LLaVA-Video-72B-Qwen2 (128) 64.9 72.4 77.4 63.9 59.3 63.9
Gemini-1.5-Pro (0514) (256) 64.4 70.2 75.3 65.0 59.1 64.0
LLaVA-OneVision-QWen2-72B-OV (32) 63.2 74.3 77.4 61.6 56.5 61.3
LLaVA-Video-7B-Qwen2 (128) 62.7 69.7 76.5 62.1 56.6 61.1
Gemini-1.5-Flash (0514) (256) 62.4 66.1 73.1 63.1 57.3 61.6
GPT-4-Turbo (0409) (256) 60.7 66.4 71.1 61.7 54.5 59.1
InternVL2-40B (16) 60.6 71.4 76.6 57.5 54.4 59.3
GPT-4o-mini (250) 58.8 66.6 73.4 56.9 53.4 56.5
MiniCPM-V-2.6 (64) 57.7 62.5 69.1 54.9 49.8 54.9
Qwen2-VL-7B (256) 56.8 60.1 67.6 56.7 52.5 55.6
Kangaroo (64) 54.8 65.6 65.7 52.7 49.1 54.2
PLLaVA-34B (32) 53.5 60.1 66.8 50.8 49.1 53.2
InternVL-Chat-V1-5-26B (16) 51.7 61.3 62.7 49.5 46.6 51.2
LLaVA-Next-Video-34B (32) 50.5 57.6 61.6 48.7 45.9 50.5
Phi-3-Vision-Instruct (16) 49.9 58.3 59.6 48.4 45.1 49.6
Idefics2 (16) 49.4 57.4 60.4 47.3 44.7 49.7
Mantis-Idefics2 (16) 47.6 56.1 61.4 44.6 42.5 47.0
LLaVA-Next-Mistral-7B (8) 47.1 53.4 57.2 46.9 42.1 49.1
PLLaVA-13B (32) 45.1 52.9 54.3 42.9 41.2 45.6
InstructBLIP-T5-XXL (8) 43.8 48.1 50.1 44.5 40.0 43.3
Mantis-BakLLaVA (16) 43.7 51.3 52.7 41.1 40.1 43.7
BLIP-2-T5-XXL (8) 43.5 46.7 47.4 44.2 40.9 42.7
LLaVA-Next-Video-M7B (32) 43.5 50.9 53.1 42.6 38.9 43.5
LLaVA-1.5-13B (8) 43.1 49.0 51.1 41.8 39.6 43.4
ShareGPT4Video (16) 41.8 46.9 50.1 40.0 38.7 39.7
VideoChat2 (Mistral-7B) (16) 41.2 49.3 49.3 39.0 37.5 39.3
LLaVA-1.5-7B (8) 40.4 45.0 47.4 40.1 37.0 40.3
mPLUG-Owl2 (8) 39.4 49.4 47.3 38.7 34.3 39.1
PLLaVA-7B (32) 39.2 45.3 47.3 38.5 35.2 40.2
VideoLLaVA (8) 37.6 43.1 44.6 36.4 34.4 39.1
VideoChat2 (Vicuna 7B) (16) 35.1 38.1 40.5 33.5 33.6 36.0

Uses

  1. Download the dataset via Hugging Face Client:
huggingface-cli download longvideobench/LongVideoBench --repo-type dataset --local-dir LongVideoBench --local-dir-use-symlinks False
  1. Extract from the .tar files:
cat videos.tar.part.* > videos.tar
tar -xvf videos.tar
tar -xvf subtitles.tar
  1. Use the [LongVideoBench] dataloader to load the data from raw MP4 files and subtitles:
  • (a) Install the dataloader:
git clone https://github.com/LongVideoBench/LongVideoBench.git
cd LongVideoBench
pip install -e .
  • (b) Load the dataset in python scripts:
from longvideobench import LongVideoBenchDataset

# validation
dataset = LongVideoBenchDataset(YOUR_DATA_PATH, "lvb_val.json", max_num_frames=64)

# test
dataset = LongVideoBenchDataset(YOUR_DATA_PATH, "lvb_test_wo_gt.json", max_num_frames=64)

print(dataset[0]["inputs"]) # A list consisting of PIL.Image and strings.

The "inputs" are interleaved video frames and text subtitles, followed by questions and option prompts. You can then convert them to the format that your LMMs can accept.

Direct Use

This dataset is meant to evaluate LMMs on video understanding and long-context understanding abilities.

Out-of-Scope Use

We do not advise to use this dataset for training.

Dataset Structure

  • lvb_val.json: Validation set annotations.

  • lvb_test_wo_gt.json: Test set annotations. Correct choice is not provided.

  • videos.tar.*: Links to Videos.

  • subtitles.tar: Links to Subtitles.

Dataset Card Contact

haoning001@e.ntu.edu.sg

@misc{wu2024longvideobenchbenchmarklongcontextinterleaved,
      title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding}, 
      author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
      year={2024},
      eprint={2407.15754},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.15754}, 
}
Total size
0 Bytes
Files
38
Last updated
Jun 15
Pre-warmed CDN
US EU US EU

Contributors