We provide the **off-the-shelf** scripts in the [scripts folder](scripts). ## Training LanguageBind For example, to **train** LanguageBind on **Depth-Language** with 16 GPUs (2 nodes x 8 GPUs). * First download the [cache of pretrained weight](https://github.com/PKU-YuanGroup/LanguageBind#-model-zoo) and specify ```CACHE_DIR```. * The second step is to develop a path to ```TRAIN_DATA``` according to the [dataset preparation](https://github.com/PKU-YuanGroup/LanguageBind#-vidal-10m). * Then you can run ```bash CACHE_DIR="path/to/pretrained/weight" TRAIN_DATA="path/to/data" cd /path/to/LanguageBind TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \ -m main \ --train-data ${TRAIN_DATA} \ --train-num-samples 3020000 \ --clip-type "dl" --max-depth 10 \ --do_train \ --lock-text --lock-image --text-type "polish_mplug" \ --init-temp 0.07 --learn-temp \ --model "ViT-L-14" --cache-dir ${CACHE_DIR} \ --convert_to_lora --lora_r 2 \ --lr 5e-4 --coef-lr 1e-3 \ --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \ --num-frames 1 --force-patch-dropout 0.5 \ --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \ --precision "amp" --workers 10 --video-decode-backend "imgs" \ --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \ --do_eval \ --val_d_cls_data "NYUV2" ``` ## Validating LanguageBind For example, to **validate** LanguageBind on **Depth-Language** with 1 GPUs. * First specify ```RESUME```. * The second step is to prepare the [downstream dataset](https://github.com/PKU-YuanGroup/LanguageBind/blob/main/TRAIN_AND_VALIDATE.md#downstream-datasets). * Then you can run ```bash CACHE_DIR="path/to/pretrained/weight" RESUME="thermal_language.pt" TRAIN_DATA="path/to/data" cd /path/to/LanguageBind TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \ -m main \ --train-data ${TRAIN_DATA} \ --train-num-samples 3020000 \ --clip-type "dl" --max-depth 10 \ --lock-text --lock-image --text-type "polish_mplug" \ --init-temp 0.07 --learn-temp \ --model "ViT-L-14" --cache-dir ${CACHE_DIR} \ --convert_to_lora --lora_r 2 \ --lr 5e-4 --coef-lr 1e-3 \ --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \ --num-frames 1 --force-patch-dropout 0.5 \ --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \ --precision "amp" --workers 10 --video-decode-backend "imgs" \ --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \ --do_eval \ --val_d_cls_data "NYUV2" ``` ## Downstream datasets ### Depth NYU V2 dataset is downloaded from [this repo](https://github.com/TUI-NICR/nicr-scene-analysis-datasets/tree/main/nicr_scene_analysis_datasets/datasets/nyuv2) and we reformat them to conform to the standard ImageNet format. Change the ```data_root``` [here](https://github.com/PKU-YuanGroup/LanguageBind/blob/main/data/build_datasets.py#L148). ### Video Video datasets are downloaded from [this repo](https://github.com/jpthu17/HBI) and we show the folder structure. Change the ```data_root``` [here](https://github.com/PKU-YuanGroup/LanguageBind/blob/main/data/build_datasets.py#L74). ### Audio Audio datasets are downloaded from [this repo](https://github.com/OFA-Sys/ONE-PEACE/blob/main/datasets.md#audio) and we reformat them to conform to the standard ImageNet format. Change the ```data_root``` [here](https://github.com/PKU-YuanGroup/LanguageBind/blob/main/data/build_datasets.py#L127). ### Infrared (Thermal) We download LLVIP from [official website](https://bupt-ai-cz.github.io/LLVIP/), and FLIR from [here](https://www.flir.com/oem/adas/adas-dataset-form/). We reformat them to conform to the standard ImageNet format. Change the ```data_root``` [here](https://github.com/PKU-YuanGroup/LanguageBind/blob/main/data/build_datasets.py#L160). We also provide the processed data as follows.

Datasets	Baidu Yun	Google Cloud	Peking University Yun
LLVIP	Link	Link	Link
FLIR V1	Link	Link	Link
FLIR V2	Link	Link	Link

### Folder structure ```bash downstream_datasets ├── Audio │ ├── esc50 │ │ └── test │ │ ├── airplane │ │ ├── breathing │ │ ├── brushing_teeth │ │ ├── can_opening │ │ ├── car_horn │ │ ├── cat │ │ ├── chainsaw │ │ ├── chirping_birds │ │ ├── church_bells │ │ ├── clapping │ │ ├── clock_alarm │ │ ├── clock_tick │ │ ├── coughing │ │ ├── cow │ │ ├── crackling_fire │ │ ├── crickets │ │ ├── crow │ │ ├── crying_baby │ │ ├── dog │ │ ├── door_wood_creaks │ │ ├── door_wood_knock │ │ ├── drinking_sipping │ │ ├── engine │ │ ├── fireworks │ │ ├── footsteps │ │ ├── frog │ │ ├── glass_breaking │ │ ├── hand_saw │ │ ├── helicopter │ │ ├── hen │ │ ├── insects │ │ ├── keyboard_typing │ │ ├── laughing │ │ ├── mouse_click │ │ ├── pig │ │ ├── pouring_water │ │ ├── rain │ │ ├── rooster │ │ ├── sea_waves │ │ ├── sheep │ │ ├── siren │ │ ├── sneezing │ │ ├── snoring │ │ ├── thunderstorm │ │ ├── toilet_flush │ │ ├── train │ │ ├── vacuum_cleaner │ │ ├── washing_machine │ │ ├── water_drops │ │ └── wind ├── Depth │ ├── nyuv2 │ │ ├── data │ │ │ └── val │ │ │ ├── bathroom │ │ │ ├── bedroom │ │ │ ├── bookstore │ │ │ ├── classroom │ │ │ ├── dining_room │ │ │ ├── home_office │ │ │ ├── kitchen │ │ │ ├── living_room │ │ │ ├── office │ │ │ └── others ├── Thermal │ ├── flirv1 │ │ └── val │ │ ├── bicycle │ │ ├── car │ │ ├── dog │ │ └── person │ ├── flirv2 │ │ └── val │ │ ├── bike │ │ ├── bus │ │ ├── car │ │ ├── hydrant │ │ ├── light │ │ ├── motor │ │ ├── other\ vehicle │ │ ├── person │ │ ├── sign │ │ ├── skateboard │ │ ├── stroller │ │ └── truck │ ├── llvip │ │ ├── train │ │ │ ├── background │ │ │ └── person │ │ └── val │ │ ├── background │ │ └── person └── VideoTextRetrieval ├── vtRetdata │ ├── ActivityNet │ │ └── Videos │ │ └── Activity_Videos │ ├── Didemo │ │ └── videos │ ├── MSRVTT │ │ └── MSRVTT_Videos │ └── MSVD │ └── MSVD_Videos ```