# LAVILA Pretraining

In this doc, we provide a step-by-step guide (with commands) to train LaViLa.
Note that we recommend running the following job with four 8x V100 (32GB) nodes (or eight nodes for the larger backbone) using [submitit](https://github.com/facebookincubator/submitit).
See how to install submitit at [here](./MODEL_ZOO.md#multi-node-training).


## Pre-training Dual-Encoder Baseline

We first pre-train a dual-encoder baseline with human annotations on Ego4d clips.
The goal is (1) to establish a comparable baseline for LAVILA, and (2) provide a video encoder for narrator (see below).
We use a default batch size of 32 per gpu so that the total batch size for InfoNCE loss is `32*8*4=1024`.

<details><summary> Train a baseline dual-encoder (with TSF-B) </summary>

```bash
python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_BASE \
    --norm-embed --freeze-temperature \
    --fix-lr --contrastive-use-vissl \
    --nodes 4 --use_volta32
```
</details>

To fit a High-Resolution TimeSformer-Large with a sufficient batch size, we use [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert), a memory-efficient text encoder, instead of the original text encoder in the CLIP. Additionally we apply [gradient checkpointing](https://pytorch.org/docs/stable/checkpoint.html) and [Zero Redundancy Optimizer (ZeRO)](https://arxiv.org/abs/1910.02054).

<details><summary> Train a baseline dual-encoder (with TSF-L@HR) </summary>

```bash
python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_LARGE_336PX_DISTILBERT_BASE \
    --batch-size 8 \
    --use-checkpoint --use-zero \
    --norm-embed --freeze-temperature \
    --fix-lr --contrastive-use-vissl \
    --nodes 8 --use_volta32
```
</details>

## Training and Evaluating Narrator

The narrator is a *visually conditioned* large language model (VCLM), which comprises a pre-trained video encoder (obtained above), a text decoder (GPT-2 family), and a few gated cross-attention modules that attends visual information while captioning. Both the video encoder and the text decoder are kept frozen while the cross-attention modules are learnable.

Note that we turn off Pytorch's automatic mixed-precision (AMP) during training the narrator. We observe training is instable if AMP is on.

Also note that `$PATH` can be found in the `Vis. Encoder` column of [MODEL_ZOO.md#Narrator](./MODEL_ZOO.md#narrator). If you are using your own checkpoint (e.g. pre-trained in the previous step), please make sure that the following keys in the checkpoint have been dropped: `epoch`, `optimizer`, and `scaler`.

<details><summary> Train a baseline narrator (TSF-B as visual encoder and GPT-2 base as textual decoder) </summary>

```bash
python run_with_submitit_pretrain.py \
    --model VCLM_OPENAI_TIMESFORMER_BASE_GPT2 \
    --gated-xattn --freeze-lm-vclm --freeze-visual-vclm --freeze-visual-vclm-temporal \
    --fix-lr --batch-size 8 --clip-grad-value 1.0 --eval-freq 1 --disable-amp \
    --nodes 4 --use_volta32 --resume $PATH   # Eg. $PATH can be "modelzoo/clip_openai_timesformer_base.baseline.ep_0003.pth"
```

</details>

<details><summary> Train a strong narrator (TSF-L@HR as visual encoder and GPT-2 XL as textual decoder) </summary>

```bash
python run_with_submitit_pretrain.py \
    --model VCLM_OPENAI_TIMESFORMER_LARGE_336PX_GPT2_XL \
    --gated-xattn --freeze-lm-vclm --freeze-visual-vclm --freeze-visual-vclm-temporal --use-checkpoint \
    --fix-lr --batch-size 8 --clip-grad-value 1.0 --eval-freq 1 --disable-amp \
    --nodes 4 --use_volta32 --resume $PATH   # Eg. $PATH can be "modelzoo/clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth"
```
</details>

<details><summary> Evaluate the narrator on Ego4D val split </summary>

```bash
torchrun --nproc_per_node=1 eval_narrator.py \
    --caption-top-p 0.95 --caption-temperature 0.7 \
    --eval-freq 10000 \     # evaluate on the val split of Ego4D (1/10000-subset for fast evaluation)
    --resume $VCLM_CHECKPOINT
```
This will output some common NLG metrics, such as BLEU-x, METEOR, ROUGE_L, and CIDEr (using the human narrations as ground-truth).
</details>

## Narrating video clips using LAVILA-Narrator


<details><summary> Infer the narrator </summary>

```bash
python run_with_submitit_infer_narrator.py \
    --metadata datasets/Ego4D/ego4d_train.pkl \
    --batch-size 64 \
    --resume $PATH --use-half \
    --nodes 4 --use_volta32
```
</details>

It will generate a pickle file (`$output_dir/total.pkl`) which is a list of quintuples - `(video_uid: str, start_time: float, end_time: float, narration_list: List[str], NLL_list: List[float])`.

For narrator-generated narrations on Ego4D ground-truth clips, we also provide a [replica](https://dl.fbaipublicfiles.com/lavila/metadata/ego4d/ego4d_train.narrator_63690737.return_10.pkl). Note that the narrator used here is our best performing one.

## Rephrasing human narrations using LAVILA-Rephraser

Rephraser is a standard LLM that can paraphrase narrations in existing clips.
Specifically, we use an off-the-shelf T5-based paraphraser which is publicly available at [Hugging Face's model hub](https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality).
For more details, please refer to the [model card](https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality).

For rephrased human narrations on Ego4D ground-truth clips, we provide a [replica](https://dl.fbaipublicfiles.com/lavila/metadata/ego4d/ego4d_train.rephraser.no_punkt_top3.pkl).


## Pre-training LAVILA Dual-Encoder
Now we are ready to pre-train our LAVILA's dual-encoder by combining human annotations (augmented by Rephraser) and the Narrator-generated narrations.

<details><summary> Training a LaViLa dual-encoder </summary>

```bash
python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_BASE \
    --metadata datasets/Ego4D/ego4d_train.rephraser.no_punkt_top3.pkl \
    --metadata-aux datasets/Ego4D/ego4d_train.narrator_63690737.return_10.pkl \
    --norm-embed --freeze-temperature \
    --freeze-pseudo-temperature \
    --fix-lr --contrastive-use-vissl \
    --nodes 4 --use_volta32
```
</details>

## Down-stream Evaluation
With the pre-trained dual-encoder at hand, we now can do zero-shot or fine-tuning evalution evaluations on down-stream benchmarks.
Please refer to [MODEL_ZOO.md](./MODEL_ZOO.md#zero-shot) for more details.