Dataset Download and Management

Dataset Format

The training data should be provided in a CSV file with the following format:

/absolute/path/to/image1.jpg, caption1, num_of_frames
/absolute/path/to/image2.jpg, caption2, num_of_frames

HD-VG-130M

This dataset comprises 130M text-video pairs. You can download the dataset and prepare it for training according to the dataset repository's instructions. There is a README.md file in the Google Drive link that provides instructions on how to download and cut the videos. For this version, we directly use the dataset provided by the authors.

Demo Dataset

You can use ImageNet and UCF101 for a quick demo. After downloading the datasets, you can use the following command to prepare the csv file for the dataset:

# ImageNet
python -m tools.datasets.convert_dataset imagenet IMAGENET_FOLDER --split train
# UCF101
python -m tools.datasets.convert_dataset ucf101 UCF101_FOLDER --split videos

Manage datasets

We provide csvutils.py to manage the CSV files. You can use the following commands to process the CSV files:

# generate DATA_fmin_128_fmax_256.csv with frames between 128 and 256
python -m tools.datasets.csvutil DATA.csv --fmin 128 --fmax 256
# generate DATA_root.csv with absolute path
python -m tools.datasets.csvutil DATA.csv --root /absolute/path/to/dataset
# remove videos with no captions
python -m tools.datasets.csvutil DATA.csv --remove-empty-caption
# compute the number of frames for each video
python -m tools.datasets.csvutil DATA.csv --relength
# remove caption prefix
python -m tools.datasets.csvutil DATA.csv --remove-caption-prefix

To merge multiple CSV files, you can use the following command:

cat *csv > combined.csv