LanguageBind / TRAIN_AND_VALIDATE.md
linbin
Upload 323 files
8373c11

We provide the off-the-shelf scripts in the scripts folder.

Training LanguageBind

For example, to train LanguageBind on Depth-Language with 16 GPUs (2 nodes x 8 GPUs).

CACHE_DIR="path/to/pretrained/weight"
TRAIN_DATA="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
    -m main  \
    --train-data ${TRAIN_DATA} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --do_train \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_d_cls_data "NYUV2"

Validating LanguageBind

For example, to validate LanguageBind on Depth-Language with 1 GPUs.

  • First specify RESUME.
  • The second step is to prepare the downstream dataset.
  • Then you can run
CACHE_DIR="path/to/pretrained/weight"
RESUME="thermal_language.pt"
TRAIN_DATA="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${TRAIN_DATA} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
    --do_eval \
    --val_d_cls_data "NYUV2"

Downstream datasets

Depth

NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. Change the data_root here.

Video

Video datasets are downloaded from this repo and we show the folder structure. Change the data_root here.

Audio

Audio datasets are downloaded from this repo and we reformat them to conform to the standard ImageNet format. Change the data_root here.

Infrared (Thermal)

We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root here. We also provide the processed data as follows.

DatasetsBaidu YunGoogle CloudPeking University Yun
LLVIPLinkLinkLink
FLIR V1LinkLinkLink
FLIR V2LinkLinkLink

Folder structure

downstream_datasets
├── Audio
│   ├── esc50
│   │   └── test
│   │       ├── airplane
│   │       ├── breathing
│   │       ├── brushing_teeth
│   │       ├── can_opening
│   │       ├── car_horn
│   │       ├── cat
│   │       ├── chainsaw
│   │       ├── chirping_birds
│   │       ├── church_bells
│   │       ├── clapping
│   │       ├── clock_alarm
│   │       ├── clock_tick
│   │       ├── coughing
│   │       ├── cow
│   │       ├── crackling_fire
│   │       ├── crickets
│   │       ├── crow
│   │       ├── crying_baby
│   │       ├── dog
│   │       ├── door_wood_creaks
│   │       ├── door_wood_knock
│   │       ├── drinking_sipping
│   │       ├── engine
│   │       ├── fireworks
│   │       ├── footsteps
│   │       ├── frog
│   │       ├── glass_breaking
│   │       ├── hand_saw
│   │       ├── helicopter
│   │       ├── hen
│   │       ├── insects
│   │       ├── keyboard_typing
│   │       ├── laughing
│   │       ├── mouse_click
│   │       ├── pig
│   │       ├── pouring_water
│   │       ├── rain
│   │       ├── rooster
│   │       ├── sea_waves
│   │       ├── sheep
│   │       ├── siren
│   │       ├── sneezing
│   │       ├── snoring
│   │       ├── thunderstorm
│   │       ├── toilet_flush
│   │       ├── train
│   │       ├── vacuum_cleaner
│   │       ├── washing_machine
│   │       ├── water_drops
│   │       └── wind
├── Depth
│   ├── nyuv2
│   │   ├── data
│   │   │   └── val
│   │   │       ├── bathroom
│   │   │       ├── bedroom
│   │   │       ├── bookstore
│   │   │       ├── classroom
│   │   │       ├── dining_room
│   │   │       ├── home_office
│   │   │       ├── kitchen
│   │   │       ├── living_room
│   │   │       ├── office
│   │   │       └── others
├── Thermal
│   ├── flirv1
│   │   └── val
│   │       ├── bicycle
│   │       ├── car
│   │       ├── dog
│   │       └── person
│   ├── flirv2
│   │   └── val
│   │       ├── bike
│   │       ├── bus
│   │       ├── car
│   │       ├── hydrant
│   │       ├── light
│   │       ├── motor
│   │       ├── other\ vehicle
│   │       ├── person
│   │       ├── sign
│   │       ├── skateboard
│   │       ├── stroller
│   │       └── truck
│   ├── llvip
│   │   ├── train
│   │   │   ├── background
│   │   │   └── person
│   │   └── val
│   │       ├── background
│   │       └── person
└── VideoTextRetrieval
    ├── vtRetdata
    │   ├── ActivityNet
    │   │   └── Videos
    │   │       └── Activity_Videos
    │   ├── Didemo
    │   │   └── videos
    │   ├── MSRVTT
    │   │   └── MSRVTT_Videos
    │   └── MSVD
    │       └── MSVD_Videos