ViFi-CLIP Baseline for Elderly Action Recognition (EAR) Challenge

Model Zoo

NOTE: All models in our experiments below uses publicly available ViT/B-16 based CLIP model.

Installation

For installation and other package requirements, please follow the instructions below :

# Create a conda environment
conda create -y -n vclip python=3.7
# Activate the environment
conda activate vclip
# Install requirements
pip install -r requirements.txt
pip install torch===1.8.1+cu111 torchvision===0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
git clone -b 22.04-dev https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data preparation

Please follow the instructions at DATASETS.md to prepare all datasets. We concatenate the csv files of ETRI and Toyota Smarthome datasets for training the model. For testing -> we create a csv file with video name and a dummy class label column.

Before, training or testing the model, please crop the human bounding boxes in the video. Update the input video directory and the output cropped video directory in script_crop.sh.

./script_crop.sh

Note:

Following the ViFi-CLIP paper, we also recommend keeping the total batch size as mentioned in respective config files. Please use --accumulation-steps to maintain the total batch size. Specifically, here the effective total batch size is 8(GPUs_NUM) x 4(TRAIN.BATCH_SIZE) x 16(TRAIN.ACCUMULATION_STEPS) = 512.
After setting up the dataset, only argument in the config file that should be specified is data path. All other settings in config files are pre-set.

Training

For all experiments shown in above tables, we provide config files in configs folder.

./script_train.sh

Evaluating models

To evaluate the trained model on the challenge dataset, please use the correct config (config_challenge_test.yaml) and corresponding model weights (work_dirs/challenge_baseline_new/best.pth).

./script_test.sh

Make sure to update the val.csv file. It contains the location of the videos and a dummy videos label ('0'). The test scipt will output a file output.csv which is to be used for model evaluation.

Contact

If you have any questions, please create an issue on this repository.

Citation

If you use our approach (code, or the model) in your research, please consider citing ViFiCLIP and SKI-Models:

@inproceedings{tobeupdated,
    title={ViFi-CLIP Baseline for Elderly Action Recognition (EAR) Challenge},
    author={Srijan Das},
    year={2025}
}

Acknowledgements

Our code is based on ViFiCLIP's repository and SKI-Model's repository. We sincerely thank the authors for releasing their code. If you use our model and code, please consider citing ViFiCLIP and SKI-Models as well.