# Evaluation Preprocessing

## MSR-VTT
Download the bounding box annotations for MSR-VTT from [here](https://drive.google.com/file/d/1OQvoR5zkohz5GpZxT0-fN1CPY9LjKT6y/view?usp=sharing). This is a pickle file with a dictionary. Each dictionary element has the video id, caption, subject of the caption and a sequence of bounding boxes. These were generated using `get_fg_obj.py`. 
You can also download the videos from MSR-VTT from [this link](https://cove.thecvf.com/datasets/839). The [StyleGAN-v repo](https://github.com/universome/stylegan-v/blob/master/src/scripts/convert_videos_to_frames.py) is used to pre-process and convert the dataset into frames.

### Pre-processing
Our pre-processing pipeline is described here. We first extract the subject of the caption using Spacy. Then this subject is fed into Owl-ViT to obtain bounding boxes. If there are 0 bounding boxes corresponding to a subject, we use the next caption from the dataset. If there is atleast one bounding box, we interpolate bounding boxes for the missing frames linearly.

## ssv2-ST
Similar pre-processing is done for this dataset, except that a larger OwL-ViT model is used, and the first noun chunk is extracted instead of the subject. The former significantly slows down the pre-processing. The dataset downloading is a bit complex, you need to follow the instructions [here](https://github.com/MikeWangWZHL/Paxion#dataset-setup). Download the dataset and run `generate_ssv2_st.py`.

## Interactive Motion Control - IMC
We generate bounding boxes for this dataset using the `generate_imc.py` file. The prompts are in `custom_prompts.csv` and `filtered_prompts.csv`.