Spaces:
Running
We provide the off-the-shelf scripts in the scripts folder.
Training LanguageBind
For example, to train LanguageBind on Depth-Language with 16 GPUs (2 nodes x 8 GPUs).
- First download the cache of pretrained weight and specify
CACHE_DIR
. - The second step is to develop a path to
TRAIN_DATA
according to the dataset preparation. - Then you can run
CACHE_DIR="path/to/pretrained/weight"
TRAIN_DATA="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
-m main \
--train-data ${TRAIN_DATA} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--do_train \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
--do_eval \
--val_d_cls_data "NYUV2"
Validating LanguageBind
For example, to validate LanguageBind on Depth-Language with 1 GPUs.
- First specify
RESUME
. - The second step is to prepare the downstream dataset.
- Then you can run
CACHE_DIR="path/to/pretrained/weight"
RESUME="thermal_language.pt"
TRAIN_DATA="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
-m main \
--train-data ${TRAIN_DATA} \
--train-num-samples 3020000 \
--clip-type "dl" --max-depth 10 \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 2 \
--lr 5e-4 --coef-lr 1e-3 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 1 --force-patch-dropout 0.5 \
--epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
--do_eval \
--val_d_cls_data "NYUV2"
Downstream datasets
Depth
NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. Change the data_root
here.
Video
Video datasets are downloaded from this repo and we show the folder structure. Change the data_root
here.
Audio
Audio datasets are downloaded from this repo and we reformat them to conform to the standard ImageNet format. Change the data_root
here.
Infrared (Thermal)
We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root
here. We also provide the processed data as follows.
Datasets | Baidu Yun | Google Cloud | Peking University Yun |
---|---|---|---|
LLVIP | Link | Link | Link |
FLIR V1 | Link | Link | Link |
FLIR V2 | Link | Link | Link |
Folder structure
downstream_datasets
├── Audio
│ ├── esc50
│ │ └── test
│ │ ├── airplane
│ │ ├── breathing
│ │ ├── brushing_teeth
│ │ ├── can_opening
│ │ ├── car_horn
│ │ ├── cat
│ │ ├── chainsaw
│ │ ├── chirping_birds
│ │ ├── church_bells
│ │ ├── clapping
│ │ ├── clock_alarm
│ │ ├── clock_tick
│ │ ├── coughing
│ │ ├── cow
│ │ ├── crackling_fire
│ │ ├── crickets
│ │ ├── crow
│ │ ├── crying_baby
│ │ ├── dog
│ │ ├── door_wood_creaks
│ │ ├── door_wood_knock
│ │ ├── drinking_sipping
│ │ ├── engine
│ │ ├── fireworks
│ │ ├── footsteps
│ │ ├── frog
│ │ ├── glass_breaking
│ │ ├── hand_saw
│ │ ├── helicopter
│ │ ├── hen
│ │ ├── insects
│ │ ├── keyboard_typing
│ │ ├── laughing
│ │ ├── mouse_click
│ │ ├── pig
│ │ ├── pouring_water
│ │ ├── rain
│ │ ├── rooster
│ │ ├── sea_waves
│ │ ├── sheep
│ │ ├── siren
│ │ ├── sneezing
│ │ ├── snoring
│ │ ├── thunderstorm
│ │ ├── toilet_flush
│ │ ├── train
│ │ ├── vacuum_cleaner
│ │ ├── washing_machine
│ │ ├── water_drops
│ │ └── wind
├── Depth
│ ├── nyuv2
│ │ ├── data
│ │ │ └── val
│ │ │ ├── bathroom
│ │ │ ├── bedroom
│ │ │ ├── bookstore
│ │ │ ├── classroom
│ │ │ ├── dining_room
│ │ │ ├── home_office
│ │ │ ├── kitchen
│ │ │ ├── living_room
│ │ │ ├── office
│ │ │ └── others
├── Thermal
│ ├── flirv1
│ │ └── val
│ │ ├── bicycle
│ │ ├── car
│ │ ├── dog
│ │ └── person
│ ├── flirv2
│ │ └── val
│ │ ├── bike
│ │ ├── bus
│ │ ├── car
│ │ ├── hydrant
│ │ ├── light
│ │ ├── motor
│ │ ├── other\ vehicle
│ │ ├── person
│ │ ├── sign
│ │ ├── skateboard
│ │ ├── stroller
│ │ └── truck
│ ├── llvip
│ │ ├── train
│ │ │ ├── background
│ │ │ └── person
│ │ └── val
│ │ ├── background
│ │ └── person
└── VideoTextRetrieval
├── vtRetdata
│ ├── ActivityNet
│ │ └── Videos
│ │ └── Activity_Videos
│ ├── Didemo
│ │ └── videos
│ ├── MSRVTT
│ │ └── MSRVTT_Videos
│ └── MSVD
│ └── MSVD_Videos