LAVILA Pretraining

In this doc, we provide a step-by-step guide (with commands) to train LaViLa. Note that we recommend running the following job with four 8x V100 (32GB) nodes (or eight nodes for the larger backbone) using submitit. See how to install submitit at here.

Pre-training Dual-Encoder Baseline

We first pre-train a dual-encoder baseline with human annotations on Ego4d clips. The goal is (1) to establish a comparable baseline for LAVILA, and (2) provide a video encoder for narrator (see below). We use a default batch size of 32 per gpu so that the total batch size for InfoNCE loss is 32*8*4=1024.

Train a baseline dual-encoder (with TSF-B)

python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_BASE \
    --norm-embed --freeze-temperature \
    --fix-lr --contrastive-use-vissl \
    --nodes 4 --use_volta32

To fit a High-Resolution TimeSformer-Large with a sufficient batch size, we use DistilBERT, a memory-efficient text encoder, instead of the original text encoder in the CLIP. Additionally we apply gradient checkpointing and Zero Redundancy Optimizer (ZeRO).

Train a baseline dual-encoder (with TSF-L@HR)

python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_LARGE_336PX_DISTILBERT_BASE \
    --batch-size 8 \
    --use-checkpoint --use-zero \
    --norm-embed --freeze-temperature \
    --fix-lr --contrastive-use-vissl \
    --nodes 8 --use_volta32

Training and Evaluating Narrator

The narrator is a visually conditioned large language model (VCLM), which comprises a pre-trained video encoder (obtained above), a text decoder (GPT-2 family), and a few gated cross-attention modules that attends visual information while captioning. Both the video encoder and the text decoder are kept frozen while the cross-attention modules are learnable.

Note that we turn off Pytorch's automatic mixed-precision (AMP) during training the narrator. We observe training is instable if AMP is on.

Also note that $PATH can be found in the Vis. Encoder column of MODEL_ZOO.md#Narrator. If you are using your own checkpoint (e.g. pre-trained in the previous step), please make sure that the following keys in the checkpoint have been dropped: epoch, optimizer, and scaler.

Train a baseline narrator (TSF-B as visual encoder and GPT-2 base as textual decoder)

python run_with_submitit_pretrain.py \
    --model VCLM_OPENAI_TIMESFORMER_BASE_GPT2 \
    --gated-xattn --freeze-lm-vclm --freeze-visual-vclm --freeze-visual-vclm-temporal \
    --fix-lr --batch-size 8 --clip-grad-value 1.0 --eval-freq 1 --disable-amp \
    --nodes 4 --use_volta32 --resume $PATH   # Eg. $PATH can be "modelzoo/clip_openai_timesformer_base.baseline.ep_0003.pth"

Train a strong narrator (TSF-L@HR as visual encoder and GPT-2 XL as textual decoder)

python run_with_submitit_pretrain.py \
    --model VCLM_OPENAI_TIMESFORMER_LARGE_336PX_GPT2_XL \
    --gated-xattn --freeze-lm-vclm --freeze-visual-vclm --freeze-visual-vclm-temporal --use-checkpoint \
    --fix-lr --batch-size 8 --clip-grad-value 1.0 --eval-freq 1 --disable-amp \
    --nodes 4 --use_volta32 --resume $PATH   # Eg. $PATH can be "modelzoo/clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth"

Evaluate the narrator on Ego4D val split

torchrun --nproc_per_node=1 eval_narrator.py \
    --caption-top-p 0.95 --caption-temperature 0.7 \
    --eval-freq 10000 \     # evaluate on the val split of Ego4D (1/10000-subset for fast evaluation)
    --resume $VCLM_CHECKPOINT

This will output some common NLG metrics, such as BLEU-x, METEOR, ROUGE_L, and CIDEr (using the human narrations as ground-truth).

Narrating video clips using LAVILA-Narrator

Infer the narrator

python run_with_submitit_infer_narrator.py \
    --metadata datasets/Ego4D/ego4d_train.pkl \
    --batch-size 64 \
    --resume $PATH --use-half \
    --nodes 4 --use_volta32

It will generate a pickle file ($output_dir/total.pkl) which is a list of quintuples - (video_uid: str, start_time: float, end_time: float, narration_list: List[str], NLL_list: List[float]).

For narrator-generated narrations on Ego4D ground-truth clips, we also provide a replica. Note that the narrator used here is our best performing one.

Rephrasing human narrations using LAVILA-Rephraser

Rephraser is a standard LLM that can paraphrase narrations in existing clips. Specifically, we use an off-the-shelf T5-based paraphraser which is publicly available at Hugging Face's model hub. For more details, please refer to the model card.

For rephrased human narrations on Ego4D ground-truth clips, we provide a replica.

Pre-training LAVILA Dual-Encoder

Now we are ready to pre-train our LAVILA's dual-encoder by combining human annotations (augmented by Rephraser) and the Narrator-generated narrations.

Training a LaViLa dual-encoder

python run_with_submitit_pretrain.py --model CLIP_OPENAI_TIMESFORMER_BASE \
    --metadata datasets/Ego4D/ego4d_train.rephraser.no_punkt_top3.pkl \
    --metadata-aux datasets/Ego4D/ego4d_train.narrator_63690737.return_10.pkl \
    --norm-embed --freeze-temperature \
    --freeze-pseudo-temperature \
    --fix-lr --contrastive-use-vissl \
    --nodes 4 --use_volta32

Down-stream Evaluation

With the pre-trained dual-encoder at hand, we now can do zero-shot or fine-tuning evalution evaluations on down-stream benchmarks. Please refer to MODEL_ZOO.md for more details.