# LAVILA Pretraining In this doc, we provide a step-by-step guide (with commands) to train LaViLa. Note that we recommend running the following job with four 8x V100 (32GB) nodes (or eight nodes for the larger backbone) using [submitit](https://github.com/facebookincubator/submitit). See how to install submitit at [here](./MODEL_ZOO.md#multi-node-training). ## Pre-training Dual-Encoder Baseline We first pre-train a dual-encoder baseline with human annotations on Ego4d clips. The goal is (1) to establish a comparable baseline for LAVILA, and (2) provide a video encoder for narrator (see below). We use a default batch size of 32 per gpu so that the total batch size for InfoNCE loss is `32*8*4=1024`.
Train a baseline dual-encoder (with TSF-B) ```bash python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_BASE \ --norm-embed --freeze-temperature \ --fix-lr --contrastive-use-vissl \ --nodes 4 --use_volta32 ```
To fit a High-Resolution TimeSformer-Large with a sufficient batch size, we use [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert), a memory-efficient text encoder, instead of the original text encoder in the CLIP. Additionally we apply [gradient checkpointing](https://pytorch.org/docs/stable/checkpoint.html) and [Zero Redundancy Optimizer (ZeRO)](https://arxiv.org/abs/1910.02054).
Train a baseline dual-encoder (with TSF-L@HR) ```bash python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_LARGE_336PX_DISTILBERT_BASE \ --batch-size 8 \ --use-checkpoint --use-zero \ --norm-embed --freeze-temperature \ --fix-lr --contrastive-use-vissl \ --nodes 8 --use_volta32 ```
## Training and Evaluating Narrator The narrator is a *visually conditioned* large language model (VCLM), which comprises a pre-trained video encoder (obtained above), a text decoder (GPT-2 family), and a few gated cross-attention modules that attends visual information while captioning. Both the video encoder and the text decoder are kept frozen while the cross-attention modules are learnable. Note that we turn off Pytorch's automatic mixed-precision (AMP) during training the narrator. We observe training is instable if AMP is on. Also note that `$PATH` can be found in the `Vis. Encoder` column of [MODEL_ZOO.md#Narrator](./MODEL_ZOO.md#narrator). If you are using your own checkpoint (e.g. pre-trained in the previous step), please make sure that the following keys in the checkpoint have been dropped: `epoch`, `optimizer`, and `scaler`.
Train a baseline narrator (TSF-B as visual encoder and GPT-2 base as textual decoder) ```bash python run_with_submitit_pretrain.py \ --model VCLM_OPENAI_TIMESFORMER_BASE_GPT2 \ --gated-xattn --freeze-lm-vclm --freeze-visual-vclm --freeze-visual-vclm-temporal \ --fix-lr --batch-size 8 --clip-grad-value 1.0 --eval-freq 1 --disable-amp \ --nodes 4 --use_volta32 --resume $PATH # Eg. $PATH can be "modelzoo/clip_openai_timesformer_base.baseline.ep_0003.pth" ```
Train a strong narrator (TSF-L@HR as visual encoder and GPT-2 XL as textual decoder) ```bash python run_with_submitit_pretrain.py \ --model VCLM_OPENAI_TIMESFORMER_LARGE_336PX_GPT2_XL \ --gated-xattn --freeze-lm-vclm --freeze-visual-vclm --freeze-visual-vclm-temporal --use-checkpoint \ --fix-lr --batch-size 8 --clip-grad-value 1.0 --eval-freq 1 --disable-amp \ --nodes 4 --use_volta32 --resume $PATH # Eg. $PATH can be "modelzoo/clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth" ```
Evaluate the narrator on Ego4D val split ```bash torchrun --nproc_per_node=1 eval_narrator.py \ --caption-top-p 0.95 --caption-temperature 0.7 \ --eval-freq 10000 \ # evaluate on the val split of Ego4D (1/10000-subset for fast evaluation) --resume $VCLM_CHECKPOINT ``` This will output some common NLG metrics, such as BLEU-x, METEOR, ROUGE_L, and CIDEr (using the human narrations as ground-truth).
## Narrating video clips using LAVILA-Narrator
Infer the narrator ```bash python run_with_submitit_infer_narrator.py \ --metadata datasets/Ego4D/ego4d_train.pkl \ --batch-size 64 \ --resume $PATH --use-half \ --nodes 4 --use_volta32 ```
It will generate a pickle file (`$output_dir/total.pkl`) which is a list of quintuples - `(video_uid: str, start_time: float, end_time: float, narration_list: List[str], NLL_list: List[float])`. For narrator-generated narrations on Ego4D ground-truth clips, we also provide a [replica](https://dl.fbaipublicfiles.com/lavila/metadata/ego4d/ego4d_train.narrator_63690737.return_10.pkl). Note that the narrator used here is our best performing one. ## Rephrasing human narrations using LAVILA-Rephraser Rephraser is a standard LLM that can paraphrase narrations in existing clips. Specifically, we use an off-the-shelf T5-based paraphraser which is publicly available at [Hugging Face's model hub](https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality). For more details, please refer to the [model card](https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality). For rephrased human narrations on Ego4D ground-truth clips, we provide a [replica](https://dl.fbaipublicfiles.com/lavila/metadata/ego4d/ego4d_train.rephraser.no_punkt_top3.pkl). ## Pre-training LAVILA Dual-Encoder Now we are ready to pre-train our LAVILA's dual-encoder by combining human annotations (augmented by Rephraser) and the Narrator-generated narrations.
Training a LaViLa dual-encoder ```bash python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_BASE \ --metadata datasets/Ego4D/ego4d_train.rephraser.no_punkt_top3.pkl \ --metadata-aux datasets/Ego4D/ego4d_train.narrator_63690737.return_10.pkl \ --norm-embed --freeze-temperature \ --freeze-pseudo-temperature \ --fix-lr --contrastive-use-vissl \ --nodes 4 --use_volta32 ```
## Down-stream Evaluation With the pre-trained dual-encoder at hand, we now can do zero-shot or fine-tuning evalution evaluations on down-stream benchmarks. Please refer to [MODEL_ZOO.md](./MODEL_ZOO.md#zero-shot) for more details.