For training data, We combine partial data from two datasets: LLaVA-Video-178K and Valley.
Stage | Source | #Sample |
---|---|---|
Pretrain | LLaVA-Video-178K + Valley | 397k |
Finetune | LLaVA-Video-178K | 491k |
Pretrain Data
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1
, 30_60_s_academic_v0_1
, 0_30_s_youtube_v0_1
, and 30_60_s_youtube_v0_1
, supplemented with the filtered Video-LLaVA.
We provide cleaned annotations data, and the video data can be downloaded from LLaVA-Video-178K and Video-LLaVA.
Finetune Data
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1
, 30_60_s_academic_v0_1
, 0_30_s_youtube_v0_1
, and 30_60_s_youtube_v0_1
.
We provide cleaned annotations data, and the video data can be downloaded from LLaVA-Video-178K.
Organize Data
Organize the image files and annotation files as follows in path/to/your/dataset
:
dataset
βββ academic_source
βββ liwei_youtube_videos
βββ valley
βββ text_files
β βββ cleaned_video_caption.json
β βββ cleaned_video_openqa.json
Note: If there is any infringement, please contact us for removal.