TRAIN_AND_VALIDATE.md · LanguageBind/LanguageBind at main

We provide the off-the-shelf scripts in the scripts folder.

Training LanguageBind

For example, to train LanguageBind on Depth-Language with 16 GPUs (2 nodes x 8 GPUs).

First download the cache of pretrained weight and specify CACHE_DIR.
The second step is to develop a path to TRAIN_DATA according to the dataset preparation.
Then you can run

CACHE_DIR="path/to/pretrained/weight"
TRAIN_DATA="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
    -m main  \
    --train-data ${TRAIN_DATA} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --do_train \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_d_cls_data "NYUV2"

Validating LanguageBind

For example, to validate LanguageBind on Depth-Language with 1 GPUs.

First specify RESUME.
The second step is to prepare the downstream dataset.
Then you can run

CACHE_DIR="path/to/pretrained/weight"
RESUME="thermal_language.pt"
TRAIN_DATA="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${TRAIN_DATA} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
    --do_eval \
    --val_d_cls_data "NYUV2"

Downstream datasets

Depth

NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. Change the data_root here.

Video

Video datasets are downloaded from this repo and we show the folder structure. Change the data_root here.

Audio

Audio datasets are downloaded from this repo and we reformat them to conform to the standard ImageNet format. Change the data_root here.

Infrared (Thermal)

We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root here. We also provide the processed data as follows.

Datasets	Baidu Yun	Google Cloud	Peking University Yun
LLVIP	Link	Link	Link
FLIR V1	Link	Link	Link
FLIR V2	Link	Link	Link

Folder structure

downstream_datasets
├── Audio
│   ├── esc50
│   │   └── test
│   │       ├── airplane
│   │       ├── breathing
│   │       ├── brushing_teeth
│   │       ├── can_opening
│   │       ├── car_horn
│   │       ├── cat
│   │       ├── chainsaw
│   │       ├── chirping_birds
│   │       ├── church_bells
│   │       ├── clapping
│   │       ├── clock_alarm
│   │       ├── clock_tick
│   │       ├── coughing
│   │       ├── cow
│   │       ├── crackling_fire
│   │       ├── crickets
│   │       ├── crow
│   │       ├── crying_baby
│   │       ├── dog
│   │       ├── door_wood_creaks
│   │       ├── door_wood_knock
│   │       ├── drinking_sipping
│   │       ├── engine
│   │       ├── fireworks
│   │       ├── footsteps
│   │       ├── frog
│   │       ├── glass_breaking
│   │       ├── hand_saw
│   │       ├── helicopter
│   │       ├── hen
│   │       ├── insects
│   │       ├── keyboard_typing
│   │       ├── laughing
│   │       ├── mouse_click
│   │       ├── pig
│   │       ├── pouring_water
│   │       ├── rain
│   │       ├── rooster
│   │       ├── sea_waves
│   │       ├── sheep
│   │       ├── siren
│   │       ├── sneezing
│   │       ├── snoring
│   │       ├── thunderstorm
│   │       ├── toilet_flush
│   │       ├── train
│   │       ├── vacuum_cleaner
│   │       ├── washing_machine
│   │       ├── water_drops
│   │       └── wind
├── Depth
│   ├── nyuv2
│   │   ├── data
│   │   │   └── val
│   │   │       ├── bathroom
│   │   │       ├── bedroom
│   │   │       ├── bookstore
│   │   │       ├── classroom
│   │   │       ├── dining_room
│   │   │       ├── home_office
│   │   │       ├── kitchen
│   │   │       ├── living_room
│   │   │       ├── office
│   │   │       └── others
├── Thermal
│   ├── flirv1
│   │   └── val
│   │       ├── bicycle
│   │       ├── car
│   │       ├── dog
│   │       └── person
│   ├── flirv2
│   │   └── val
│   │       ├── bike
│   │       ├── bus
│   │       ├── car
│   │       ├── hydrant
│   │       ├── light
│   │       ├── motor
│   │       ├── other\ vehicle
│   │       ├── person
│   │       ├── sign
│   │       ├── skateboard
│   │       ├── stroller
│   │       └── truck
│   ├── llvip
│   │   ├── train
│   │   │   ├── background
│   │   │   └── person
│   │   └── val
│   │       ├── background
│   │       └── person
└── VideoTextRetrieval
    ├── vtRetdata
    │   ├── ActivityNet
    │   │   └── Videos
    │   │       └── Activity_Videos
    │   ├── Didemo
    │   │   └── videos
    │   ├── MSRVTT
    │   │   └── MSRVTT_Videos
    │   └── MSVD
    │       └── MSVD_Videos