MiniGPT4-video

Running on Zero

App Files Files Community

Vision-CAIR commited on Apr 25, 2024

Commit

a6e7156

1 Parent(s): a2c1e4d

apply_latest_updates

Browse files

Files changed (27) hide show

Custom_training.md +33 -0
README.md +212 -12
environment.yml +1 -2
eval_video.py +1 -1
jobs_video/train/stage_2.sh +23 -0
jobs_video/train/stage_3.sh +23 -0
minigpt4/configs/datasets/cc_sbu/align.yaml +2 -1
minigpt4/configs/datasets/cmd_video/default.yaml +4 -3
minigpt4/configs/datasets/template/default.yaml +16 -0
minigpt4/configs/datasets/video_chatgpt/default.yaml +4 -8
minigpt4/configs/datasets/webvid/default.yaml +4 -3
minigpt4/datasets/builders/image_text_pair_builder.py +96 -3
minigpt4/datasets/datasets/cc_sbu_dataset.py +47 -0
minigpt4/datasets/datasets/video_datasets.py +211 -174
minigpt4/models/mini_gpt4_llama_v2.py +11 -85
minigpt4/runners/runner_base.py +4 -4
minigpt4_video_demo.py +7 -7
minigpt4_video_inference.py +170 -84
train_configs/224_minigpt4_llama2_image.yaml +5 -3
train_configs/224_minigpt4_llama2_image_align.yaml +53 -0
train_configs/224_minigpt4_mistral_image.yaml +6 -5
train_configs/224_minigpt4_mistral_image_align.yaml +53 -0
train_configs/224_v2_llama2_video_stage_2.yaml +2 -2
train_configs/224_v2_llama2_video_stage_3.yaml +3 -3
train_configs/224_v2_mistral_video_stage_2.yaml +2 -2
train_configs/224_v2_mistral_video_stage_3.yaml +2 -2
train_configs/alignment.txt +4 -0

Custom_training.md ADDED Viewed

	@@ -0,0 +1,33 @@

+# Customizing MiniGPT4-video for your own Video-text dataset
+## Add your own video dataloader
+Construct your own dataloader here `minigpt4/datasets/datasets/video_datasets.py` based on the existing dataloaders.<br>
+Copy Video_loader_template class and edit it according to you data nature.
+## Create config file for your dataloader
+Here `minigpt4/configs/datasets/dataset_name/default.yaml` creates your yaml file that includes paths to your dataset.<br>
+Copy the template file `minigpt4/configs/datasets/template/default.yaml` and edit the paths to your dataset.
+## Register your dataloader
+In the `minigpt4/datasets/builders/image_text_pair_builder.py` file
+Import your data loader class from the `minigpt4/datasets/datasets/video_datasets.py` file <br>
+Copy and edit the VideoTemplateBuilder class.<br>
+put the train_dataset_cls = YourVideoLoaderClass that you imported from `minigpt4/datasets/datasets/video_datasets.py` file.
+## Edit training config file
+Add your dataset to the datasets in the yml file as shown below:
+```yaml
+datasets:
+  dataset_name: # change this to your dataset name
+    batch_size: 4  # change this to your desired batch size
+    vis_processor:
+      train:
+        name: "blip2_image_train"
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+    sample_ratio: 200 # if you including joint training with other datasets, you can set the sample ratio here
+```

README.md CHANGED Viewed

@@ -1,12 +1,212 @@
----
-title: MiniGPT4 Video Zero
-emoji: 🎞️🍿
-colorFrom: green
-colorTo: yellow
-sdk: gradio
-sdk_version: 4.27.0
-app_file: minigpt4_video_demo.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
+<!-- technical report link  -->
+<!-- demo link  -->
+<a href='https://vision-cair.github.io/MiniGPT4-video/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
+<a href='https://arxiv.org/abs/2404.03413'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
+<a href='https://23e140b581cffa9101.gradio.live'><img src='https://img.shields.io/badge/Project-Demo-violet'></a>
+<!-- <a href='https://github.com/Vision-CAIR/MiniGPT4-video'><img src='https://img.shields.io/badge/Github-Code-blue'></a> -->
+![demo_1](repo_imgs/sample_1.gif)
+![demo_2](repo_imgs/sample_2.gif)
+![demo_3](repo_imgs/sample_3.gif)
+## Overview
+This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos.
+Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos.
+MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods,  registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively.
+During inference, a speech to text model such as Whisper model is utilized to generate subtitles for the video. Then, both the video and the subtitle are input to the MiniGPT4-Video model with the instruction and the model outputs the answer.
+![methodology](repo_imgs/MiniGPT4-video_fig.jpg)
+## :rocket: Demo
+**1. Clone the repository** <br>
+```bash
+git clone https://github.com/Vision-CAIR/MiniGPT4-video.git
+cd MiniGPT4-video
+```
+**2. Set up the environment** <br>
+```bash
+conda env create -f environment.yml
+```
+**3. Download the checkpoints**
+| MiniGPT4-Video (Llama2 Chat 7B) | MiniGPT4-Video (Mistral 7B) |
+:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
+| [Download](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_llama_checkpoint_last.pth) | [Download](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_mistral_checkpoint_last.pth) |
+**4. Run the demo** <br>
+```bash
+# Llama2
+python minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs/llama2_test_config.yaml
+# Mistral
+python minigpt4_video_demo.py --ckpt path_to_video_checkpoint --cfg-path test_configs/mistral_test_config.yaml
+```
+### Inference
+Do the previous steps and replace step 4 with this step
+```bash
+# Llama2
+python minigpt4_video_inference.py --ckpt path_to_video_checkpoint --cfg-path test_configs/llama2_test_config.yaml --video_path path_to_video --question "Your question here"
+# Mistral
+python minigpt4_video_inference.py --ckpt path_to_video_checkpoint --cfg-path test_configs/mistral_test_config.yaml --video_path path_to_video --question "Your question here"
+```
+## :fire: Training
+### To customize MiniGPT4-Video for your own Video-text dataset
+<!-- point to file here Custom_training.md -->
+You can find the steps to customize MiniGPT4-Video for your own video-text dataset in [Custom_training.md](Custom_training.md)
+### Training datasets
+After downloading the datasets below, **you should go to the datasets configuration folder here minigpt4/configs/datasets set the paths for each dataset there.**<br>
+Image text training<br>
+You can find the steps to download the datasets in [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4/tree/main/dataset)<br>
++ LAION <br>
++ Conceptual Captions <br>
++ SBU <br>
+Video text training:<br>
++ [CMD](https://www.robots.ox.ac.uk/~vgg/data/condensed-movies/) <br>
++ [Webvid](https://github.com/m-bain/webvid/) <br> <!-- + [Webvid](https://huggingface.co/datasets/TempoFunk/webvid-10M?row=2) <br> -->
++ [Video Instructional Dataset 100K](https://huggingface.co/datasets/MBZUAI/VideoInstruct-100K) <br>
+You can find the datasets annotation files for video_text datasets here [download](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/tree/main/datasets/training_datasets) <br>
+### Model training:
+You can edit the number of gpus in the each script.sh below<br>
+#### Stage 1 (image text pretraining)
+You can directly download the pretrained MiniGPT4 [checkpoint](https://drive.google.com/file/d/11nAPjEok8eAGGEG1N2vXo3kBLCg0WgUk/view?usp=sharing) aligned with Llama2. <br>
+Or train by yourself:
+```bash
+# pretrain
+# Llama2
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_llama2_image.yaml
+# Mistral
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_mistral_image.yaml
+# align
+# To launch the second stage alignment, first specify the path to the checkpoint file trained in pretrain stage.
+# Llama2
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_llama2_image_align.yaml
+# Mistral
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/224_minigpt4_mistral_image_align.yaml
+```
+You can download our trained weights for this stage from here [Llama2](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/image_llama2_checkpoint.pth) [Mistral](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/image_mistral_checkpoint.pth)<br>
+#### Stage 2 (video captioning pretraining)
+For **Llama2** <br>
+set the cfg-path in the script to `train_configs/224_v2_llama2_video_stage_2.yaml` <br>
+set the model name here `minigpt4/configs/datasets/cmd_video/default.yaml` and `minigpt4/configs/datasets/webvid/default.yaml` to llama2<br>
+For **Mistral**<br>
+set the cfg-path in the script to `train_configs/224_v2_mistral_video_stage_2.yaml` <br>
+set the model name here `minigpt4/configs/datasets/cmd_video/default.yaml` and `minigpt4/configs/datasets/webvid/default.yaml` to mistral<br>
+```bash
+bash jobs_video/train/stage_2.sh
+```
+You can download our trained weights for this stage from here [Llama2](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_captioning_llama_checkpoint_last.pth) [Mistral](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_captioning_mistral_checkpoint_last.pth)<br>
+#### Stage 3 (video Instruction finetuning)
+For **Llama2** <br>
+set the cfg-path in the script to `train_configs/224_v2_llama2_video_stage_3.yaml` <br>
+set the model name here `minigpt4/configs/datasets/video_chatgpt/default.yaml` to llama2<br>
+For **Mistral**<br>
+set the cfg-path in the script to `train_configs/224_v2_mistral_video_stage_3.yaml` <br>
+set the model name here `minigpt4/configs/datasets/video_chatgpt/default.yaml` to mistral<br>
+```bash
+bash jobs_video/train/stage_3.sh
+```
+You can download our trained weights for this stage from here [Llama2](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_llama_checkpoint_last.pth) [Mistral](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_mistral_checkpoint_last.pth)<br>
+## :zap: Evaluation
+To reproduce the results use the best checkpoints for each model <br>
+[Llama2](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_captioning_llama_checkpoint_best.pth) [Mistral](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/blob/main/checkpoints/video_captioning_mistral_checkpoint_best.pth)<br>
+We used the same evaluation as [Video-ChatGPT](https://mbzuai-oryx.github.io/Video-ChatGPT/)<br>
+<!-- ![short_results](repo_imgs/short_results.PNG) -->
+|Method| Using Subtitles | Information Correctness | Detailed Orientation | Contextual Understanding | Temporal Understanding | Consistency |
+|:--------------------:|:----:|:------------------------:|:---------------------:|:-------------------------:|:-----------------------:|:------------:|
+| LLaMA Adapter | :x:| 2.03 | 2.32| 2.30| 1.98| 2.15 |
+| Video LLaMA| :x:| 1.96 | 2.18| 2.16| 1.82| 1.79 |
+| Video Chat| :x:| 2.23 | 2.50| 2.53| 1.94| 2.24 |
+| Video-ChatGPT | :x:| 2.40 | 2.52| 2.62| 1.98| 2.37 |
+| BT-Adapter-7B | :x:| 2.68 | 2.69| 3.27| 2.34| 2.46 |
+| LLaMA-VID-7B| :x:| 2.96 | 3.00| 3.53| 2.46| 2.51 |
+| **Ours-7B Llama2**| :x:| 2.93 | 2.97| 3.45| **2.47**| **2.60**|
+| **Ours-7B Llama2**| :white_check_mark:| **3.08** | **3.02**| **3.57**| **2.65**| **2.67**|
+| **Ours-7B Mistral** | :x:| 2.83|2.52 |3.01 |2.32 |2.40 |
+| **Ours-7B Mistral**| :white_check_mark:| 2.91 | 2.57| 3.11|2.33 | 2.39|
+|Method| Using Subtitles | MSVD Acc.↑ | MSVD Score↑ | MSRVTT Acc.↑ | MSRVTT Score↑ | TGIF Acc.↑ | TGIF Score↑ | ActivityNet Acc.↑ | ActivityNet Score↑ | TVQA Acc.↑ |
+|:---------------------------------------:|:----------------:|:-----------:|:------------:|:--------------:|:---------------:|:-----------:|:------------:|:-------------------:|:--------------------:|:------------:|
+| FrozenBiLM|:x:|32.2| --|16.8 |--| 41 |-- |24.7|--|29.7 |
+| LLaMA Adapter|:x:|54.9| 3.1 |43.8 |2.7| -- |-- |34.2| 2.7| --|
+| Video LLaMA|:x:|51.6| 2.5 |29|1.8| -- |-- |12.4| 1.1| --|
+| Video Chat|:x:|56.3| 2.8 |45|2.5|34.4| 2.3 |26.5| 2.2|--|
+| Video-ChatGPT|:x:|64.9| 3.3 |49.3 |2.8|51.4| 3.0 |35.2| 2.7|23.35|
+| BT-Adapter-7B|:x:|67.7| 3.7 |57|3.2| -- |-- |45.7| 3.2| --|
+| LLaMA-VID-7B |:x:|69.7| 3.7 |57.7 |3.2| -- |-- |**47.4**| **3.3**| --|
+| **Ours-7B LLama2**|:x:|72.93|3.84|58.83|3.29|67.9|3.71| 45.85 |3.23|36.45|
+| **Ours-7B Llama2**|:white_check_mark:|72.93|3.84|**59.73**|**3.3** |67.9|3.71| 46.3|3.4 |46.94|
+| **Ours-7B Mistral**|:x:|**73.92**|**4.06**|58.26|3.52|**72.22**|**4.08**|44.25 |3.35|33.90|
+| **Ours-7B Mistral**|:white_check_mark:|**73.92**|**4.06**|58.68|3.53 |**72.22**|**4.08**| 44.38|3.36 |**54.21** |
+### Download datasets for evaluation
++ [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) <br>
++ [MSRVTT](https://cove.thecvf.com/datasets/839) <br>
++ [TGIF](https://github.com/YunseokJANG/tgif-qa/blob/master/dataset/README.md) <br>
++ [ActivityNet](https://mbzuaiac-my.sharepoint.com/:u:/g/personal/hanoona_bangalath_mbzuai_ac_ae/ESa302OCJMNHsMk7wuBbQc8BZH5CqlcdCWiSpXynQZDfAQ?e=CrOPbm) <br>
++ [TVQA](https://tvqa.cs.unc.edu/) <br>
++ [Video-ChatGPT benchmark](https://mbzuai-oryx.github.io/Video-ChatGPT/) <br>
+You can find the evaluation datasets annotation files [download](https://huggingface.co/Vision-CAIR/MiniGPT4-Video/tree/main/datasets/evaluation_datasets) <br>
+### Run evaluation script
+Set the each evaluation script parameters to include the path to the checkpoints, the dataset name and whether to use subtitles or not <br>
+```bash
+# Llama2
+bash jobs_video/eval/llama2_evaluation.sh
+# Mistral
+bash jobs_video/eval/mistral_evalualtion.sh
+```
+Then Use GPT3.5 turbo to compare the predictions with the ground truth and generate the accuracy and scores <br>
+Set these variables in both evaluate_benchmark.sh and evaluate_zeroshot.sh <br>
+```bash
+PRED="path_to_predictions"
+OUTPUT_DIR="path_to_output_dir"
+API_KEY="openAI_key"
+NUM_TASKS=128
+```
+Then to evaluate [Video-ChatGPT benchmark] run the following script <br>
+```bash
+bash test_benchmark/quantitative_evaluation/evaluate_benchmark.sh
+```
+To evaluate open ended questions run the following script <br>
+```bash
+bash test_benchmark/quantitative_evaluation/evaluate_zeroshot.sh
+```
+If you're using MiniGPT4-Video in your research or applications, please cite using this BibTeX:
+```
+@article{ataallah2024minigpt4,
+  title={MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens},
+  author={Ataallah, Kirolos and Shen, Xiaoqian and Abdelrahman, Eslam and Sleiman, Essam and Zhu, Deyao and Ding, Jian and Elhoseiny, Mohamed},
+  journal={arXiv preprint arXiv:2404.03413},
+  year={2024}
+}
+```
+## Acknowledgements
+[MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4) <br>
+[Video-ChatGPT](https://mbzuai-oryx.github.io/Video-ChatGPT)
+## License
+This repository is under [BSD 3-Clause License](LICENSE.md).
+Many codes are based on [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4).

environment.yml CHANGED Viewed

@@ -1,4 +1,4 @@
-name: minigpt4_video_test_v100
 channels:
   - conda-forge
 dependencies:
@@ -143,7 +143,6 @@ dependencies:
       - ffmpeg-python==0.2.0
       - ffmpy==0.3.1
       - filelock==3.13.1
-      - flash-attn==2.5.4
       - flask==3.0.2
       - flatbuffers==23.5.26
       - fonttools==4.47.0

+name: minigpt4_video
 channels:
   - conda-forge
 dependencies:
       - ffmpeg-python==0.2.0
       - ffmpy==0.3.1
       - filelock==3.13.1
       - flask==3.0.2
       - flatbuffers==23.5.26
       - fonttools==4.47.0

eval_video.py CHANGED Viewed

@@ -5,7 +5,7 @@ from torch.utils.data import DataLoader
 from minigpt4.common.eval_utils import prepare_texts, init_model, eval_parser
 from minigpt4.conversation.conversation import CONV_VISION
 from minigpt4.processors.blip_processors import Blip2ImageTrainProcessor,BlipCaptionProcessor
-from minigpt4.datasets.datasets.video_datasets import VideoChatGPTEvalDataset,VideoChatGPTEval_consistancy,Video_validation_Dataset,TVQAEVAL,TVQAEVAL_Long
 parser = eval_parser()
 parser.add_argument("--dataset", type=str, default='msvd', help="dataset to evaluate")

 from minigpt4.common.eval_utils import prepare_texts, init_model, eval_parser
 from minigpt4.conversation.conversation import CONV_VISION
 from minigpt4.processors.blip_processors import Blip2ImageTrainProcessor,BlipCaptionProcessor
+from minigpt4.datasets.datasets.video_datasets import VideoChatGPTEvalDataset,VideoChatGPTEval_consistancy,Video_validation_Dataset,TVQAEVAL
 parser = eval_parser()
 parser.add_argument("--dataset", type=str, default='msvd', help="dataset to evaluate")

jobs_video/train/stage_2.sh ADDED Viewed

	@@ -0,0 +1,23 @@

+#!/bin/bash
+#SBATCH --partition=batch
+#SBATCH --job-name=test
+#SBATCH --output=test.out
+#SBATCH --error=test.err
+#SBATCH --time=23:00:00
+#SBATCH --mem=110G
+#SBATCH --gres=gpu:a100:4
+#SBATCH --cpus-per-task=16
+## run the application:
+job_name=test # Name of the experiment
+cfg_path="train_configs/224_v2_llama2_video_stage_2.yaml" # path to the config file
+number_of_gpus=1 # number of gpus
+# cd ../../
+read LOWERPORT UPPERPORT < /proc/sys/net/ipv4/ip_local_port_range
+while :
+do
+        PORT="`shuf -i $LOWERPORT-$UPPERPORT -n 1`"
+        ss -lpn | grep -q ":$PORT " || break
+done
+echo "Port is $PORT"
+torchrun --master-port ${PORT} --nproc-per-node $number_of_gpus train.py --job_name ${job_name} --cfg-path ${cfg_path}

jobs_video/train/stage_3.sh ADDED Viewed

	@@ -0,0 +1,23 @@

+#!/bin/bash
+#SBATCH --partition=batch
+#SBATCH --job-name=test
+#SBATCH --output=test.out
+#SBATCH --error=test.err
+#SBATCH --time=23:00:00
+#SBATCH --mem=110G
+#SBATCH --gres=gpu:a100:4
+#SBATCH --cpus-per-task=16
+## run the application:
+job_name="test" # Name of the experiment
+cfg_path="train_configs/224_v2_llama2_video_stage_3.yaml" # path to the config file
+number_of_gpus=1 # number of gpus
+# cd ../../
+read LOWERPORT UPPERPORT < /proc/sys/net/ipv4/ip_local_port_range
+while :
+do
+        PORT="`shuf -i $LOWERPORT-$UPPERPORT -n 1`"
+        ss -lpn | grep -q ":$PORT " || break
+done
+echo "Port is $PORT"
+torchrun --master-port ${PORT} --nproc-per-node $number_of_gpus train.py --job_name ${job_name} --cfg-path ${cfg_path}

minigpt4/configs/datasets/cc_sbu/align.yaml CHANGED Viewed

@@ -2,4 +2,5 @@ datasets:
   cc_sbu_align:
     data_type: images
     build_info:
-      storage: "/ibex/project/c2133/minigpt4_1/MiniGPT-4/minigpt4/configs/datasets/cc_sbu_align"

   cc_sbu_align:
     data_type: images
     build_info:
+      # storage: "/ibex/project/c2090/datasets/cc_sbu_align"
+      storage: "path/to/cc_sbu_align/dataset"

minigpt4/configs/datasets/cmd_video/default.yaml CHANGED Viewed

@@ -10,6 +10,7 @@ datasets:
     build_info:
       # Be careful not to append minus sign (-) before split to avoid itemizing
-      vis_root: /ibex/ai/reference/videos/CondensedMovies/data/images
-      ann_paths: [datasets/training_datasets/video_text_data/cmd/train.json]
-      cc_path: datasets/training_datasets/video_text_data/cmd/caption.json

     build_info:
       # Be careful not to append minus sign (-) before split to avoid itemizing
+      vis_root: path/to/videos/
+      ann_paths: [path/to/annotations.json]
+      subtitles_path: path/to/subtitles_folder # folder that contains subtitles of .vtt format
+      model_name: 'llama2' # Language Model Name (available: llama2, mistral)

minigpt4/configs/datasets/template/default.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+datasets:
+  dataset_name: # same as the name of the train_config yaml file
+    # data_dir: ${env.data_dir}/datasets
+    data_type: images # let it be images for now even if it is videos
+    build_info: # this is the information needed to build the dataset
+      # Be careful not to append minus sign (-) before split to avoid itemizing
+      ann_paths: [path/to/annotations_json] # list of paths to annotation files
+      vis_root: path/to/videos_folder
+      subtitles_path: path/to/subtitles_folder
+      model_name: 'llama2' # Language Model Name (available: llama2, mistral)

minigpt4/configs/datasets/video_chatgpt/default.yaml CHANGED Viewed

@@ -10,11 +10,7 @@ datasets:
     build_info:
       # Be careful not to append minus sign (-) before split to avoid itemizing
-      ann_paths: [datasets/training_datasets/video_text_data/video_instruct_100/VideoInstruct100K.json]
-      vis_root: /ibex/project/c2090/datasets/VideoInstruct100K/
-    valid:
-      ann_path: "datasets/video_text_data/validation/all_datasets_samples_val_qa.json"
-      videos_path: "/ibex/project/c2090/datasets/VideoInstruct100K/test_videos/all_datasets_samples_val"
-      subtitles_path: "inference_subtitles"
-      annotations_keys: ['question','answer','video_id']
-      add_subtitles: True

     build_info:
       # Be careful not to append minus sign (-) before split to avoid itemizing
+      ann_paths: [path/to/annotations_json] # list of paths to annotation files
+      vis_root: path/to/videos_folder
+      subtitles_path: path/to/subtitles_folder # folder that contains subtitles of .vtt format
+      model_name: 'llama2' # Language Model Name (available: llama2, mistral)

minigpt4/configs/datasets/webvid/default.yaml CHANGED Viewed

@@ -10,6 +10,7 @@ datasets:
     build_info:
       # Be careful not to append minus sign (-) before split to avoid itemizing
-      ann_paths: [datasets/training_datasets/video_text_data/webvid/train.json]
-      vis_root: /ibex/ai/reference/videos/webvid/data/videos
-      subtitles_path: /ibex/project/c2090/datasets/Webvid/webvid_val_subtitles/

     build_info:
       # Be careful not to append minus sign (-) before split to avoid itemizing
+      ann_paths: [path/to/annotations.json]
+      vis_root: path/to/videos/
+      subtitles_path: path/to/subtitles_folder/ # folder that contains subtitles of .vtt format
+      model_name: 'llama2' # Language Model Name (available: llama2, mistral)

minigpt4/datasets/builders/image_text_pair_builder.py CHANGED Viewed

@@ -5,6 +5,7 @@ import warnings
 from minigpt4.common.registry import registry
 from minigpt4.datasets.builders.base_dataset_builder import BaseDatasetBuilder
 from minigpt4.datasets.datasets.laion_dataset import LaionDataset
 from minigpt4.datasets.datasets.vg_dataset import ReferVisualGenomeDataset
 from minigpt4.datasets.datasets.open_images import OpenImageDataset,OpenBboxToObjectDataset
 from minigpt4.datasets.datasets.locna_dataset import LocNaCOCODataset
@@ -16,7 +17,7 @@ from minigpt4.datasets.datasets.coyo_dataset import COYOCaptionWDSDataset,COYOBo
 # , COYOBBoxPhraseDataset
 from minigpt4.datasets.datasets.grounded_detailed_image_caption_dataset import GroundedDetailDataset
 from minigpt4.datasets.datasets.reasoning_dataset import ReasoningDataset
-from minigpt4.datasets.datasets.video_datasets import CMDVideoDataset, WebVidDataset,VideoChatGPTDataset,Video_validation_Dataset
 from minigpt4.datasets.datasets.cot import CoTDataset
 from minigpt4.datasets.datasets.unnatural_instruction import UnnaturalDataset
 from minigpt4.datasets.datasets.caption_reasoning import CaptionReasonDataset
@@ -441,9 +442,68 @@ class CoyoBboxPhraseBuilder(BaseDatasetBuilder):
         return datasets
 @registry.register_builder("textcaps_ocr")
@@ -739,7 +799,8 @@ class CMDVideoBuilder(BaseDatasetBuilder):
             text_processor=self.text_processors["train"],
             vis_root=build_info.vis_root,
             ann_paths=build_info.ann_paths,
-            cc_path=build_info.cc_path
         )
         return datasets
@@ -770,6 +831,7 @@ class WebVidBuilder(BaseDatasetBuilder):
             vis_root=build_info.vis_root,
             ann_paths=build_info.ann_paths,
             subtitles_path=build_info.subtitles_path,
         )
         return datasets
@@ -778,7 +840,6 @@ class WebVidBuilder(BaseDatasetBuilder):
 @registry.register_builder("video_chatgpt")
 class VideoChatGPTBuilder(BaseDatasetBuilder):
     train_dataset_cls = VideoChatGPTDataset
-    eval_dataset_cls=Video_validation_Dataset
     DATASET_CONFIG_DICT = {
         "default": "configs/datasets/video_chatgpt/default.yaml",
@@ -800,6 +861,38 @@ class VideoChatGPTBuilder(BaseDatasetBuilder):
             text_processor=self.text_processors["train"],
             vis_root=build_info.vis_root,
             ann_paths=build_info.ann_paths,
         )
         return datasets

 from minigpt4.common.registry import registry
 from minigpt4.datasets.builders.base_dataset_builder import BaseDatasetBuilder
 from minigpt4.datasets.datasets.laion_dataset import LaionDataset
+from minigpt4.datasets.datasets.cc_sbu_dataset import CCSBUDataset, CCSBUAlignDataset
 from minigpt4.datasets.datasets.vg_dataset import ReferVisualGenomeDataset
 from minigpt4.datasets.datasets.open_images import OpenImageDataset,OpenBboxToObjectDataset
 from minigpt4.datasets.datasets.locna_dataset import LocNaCOCODataset
 # , COYOBBoxPhraseDataset
 from minigpt4.datasets.datasets.grounded_detailed_image_caption_dataset import GroundedDetailDataset
 from minigpt4.datasets.datasets.reasoning_dataset import ReasoningDataset
+from minigpt4.datasets.datasets.video_datasets import CMDVideoDataset, WebVidDataset,VideoChatGPTDataset
 from minigpt4.datasets.datasets.cot import CoTDataset
 from minigpt4.datasets.datasets.unnatural_instruction import UnnaturalDataset
 from minigpt4.datasets.datasets.caption_reasoning import CaptionReasonDataset
         return datasets
+@registry.register_builder("cc_sbu_align")
+class CCSBUAlignBuilder(BaseDatasetBuilder):
+    train_dataset_cls = CCSBUAlignDataset
+    DATASET_CONFIG_DICT = {
+        "default": "configs/datasets/cc_sbu/align.yaml",
+    }
+    def build_datasets(self):
+        # at this point, all the annotations and image/videos should be all downloaded to the specified locations.
+        logging.info("Building datasets...")
+        self.build_processors()
+        build_info = self.config.build_info
+        storage_path = build_info.storage
+        datasets = dict()
+        if not os.path.exists(storage_path):
+            warnings.warn("storage path {} does not exist.".format(storage_path))
+        # create datasets
+        dataset_cls = self.train_dataset_cls
+        datasets['train'] = dataset_cls(
+            vis_processor=self.vis_processors["train"],
+            text_processor=self.text_processors["train"],
+            ann_paths=[os.path.join(storage_path, 'filter_cap.json')],
+            vis_root=os.path.join(storage_path, 'image'),
+        )
+        return datasets
+@registry.register_builder("cc_sbu")
+class CCSBUBuilder(BaseDatasetBuilder):
+    train_dataset_cls = CCSBUDataset
+    DATASET_CONFIG_DICT = {"default": "configs/datasets/cc_sbu/defaults.yaml"}
+    def _download_ann(self):
+        pass
+    def _download_vis(self):
+        pass
+    def build(self):
+        self.build_processors()
+        build_info = self.config.build_info
+        datasets = dict()
+        split = "train"
+        # create datasets
+        # [NOTE] return inner_datasets (wds.DataPipeline)
+        dataset_cls = self.train_dataset_cls
+        datasets[split] = dataset_cls(
+            vis_processor=self.vis_processors[split],
+            text_processor=self.text_processors[split],
+            location=build_info.storage,
+        ).inner_dataset
+        return datasets
 @registry.register_builder("textcaps_ocr")
             text_processor=self.text_processors["train"],
             vis_root=build_info.vis_root,
             ann_paths=build_info.ann_paths,
+            subtitles_path=build_info.subtitles_path,
+            model_name= build_info.model_name,
         )
         return datasets
             vis_root=build_info.vis_root,
             ann_paths=build_info.ann_paths,
             subtitles_path=build_info.subtitles_path,
+            model_name= build_info.model_name,
         )
         return datasets
 @registry.register_builder("video_chatgpt")
 class VideoChatGPTBuilder(BaseDatasetBuilder):
     train_dataset_cls = VideoChatGPTDataset
     DATASET_CONFIG_DICT = {
         "default": "configs/datasets/video_chatgpt/default.yaml",
             text_processor=self.text_processors["train"],
             vis_root=build_info.vis_root,
             ann_paths=build_info.ann_paths,
+            subtitles_path=build_info.subtitles_path,
+            model_name=build_info.model_name
+        )
+        return datasets
+@registry.register_builder("Name of the builder as in the config file")
+class VideoTemplateBuilder(BaseDatasetBuilder):
+    train_dataset_cls = ... # Add the dataset class here
+    DATASET_CONFIG_DICT = {
+        "default": "path to the config file",
+    }
+    print(DATASET_CONFIG_DICT)
+    def build_datasets(self):
+        # download, split, etc...
+        # only called on 1 GPU/TPU in distributed
+        self.build_processors()
+        build_info = self.config.build_info # information from the config file
+        datasets = dict()
+        # create datasets
+        dataset_cls = self.train_dataset_cls
+        datasets['train'] = dataset_cls(
+            vis_processor=self.vis_processors["train"], # Add the vis_processor here
+            text_processor=self.text_processors["train"], # Add the text_processor here
+            vis_root=build_info.vis_root, # Add videos path here
+            ann_paths=build_info.ann_paths, # Add annotations path here
+            subtitles_path=build_info.subtitles_path, # Add subtitles path here
+            model_name='llama2' # Add model name here (llama2 or mistral)
         )
         return datasets

minigpt4/datasets/datasets/cc_sbu_dataset.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import os
+from PIL import Image
+import webdataset as wds
+from minigpt4.datasets.datasets.base_dataset import BaseDataset
+from minigpt4.datasets.datasets.caption_datasets import CaptionDataset
+class CCSBUDataset(BaseDataset):
+    def __init__(self, vis_processor, text_processor, location):
+        super().__init__(vis_processor=vis_processor, text_processor=text_processor)
+        self.inner_dataset = wds.DataPipeline(
+            wds.ResampledShards(location),
+            wds.tarfile_to_samples(handler=wds.warn_and_continue),
+            wds.shuffle(1000, handler=wds.warn_and_continue),
+            wds.decode("pilrgb", handler=wds.warn_and_continue),
+            wds.to_tuple("jpg", "json", handler=wds.warn_and_continue),
+            wds.map_tuple(self.vis_processor, handler=wds.warn_and_continue),
+            wds.map(self.to_dict, handler=wds.warn_and_continue),
+        )
+    def to_dict(self, sample):
+        return {
+            "image": sample[0],
+            "answer": self.text_processor(sample[1]["caption"]),
+        }
+class CCSBUAlignDataset(CaptionDataset):
+    def __getitem__(self, index):
+        # TODO this assumes image input, not general enough
+        ann = self.annotation[index]
+        img_file = '{}.jpg'.format(ann["image_id"])
+        image_path = os.path.join(self.vis_root, img_file)
+        image = Image.open(image_path).convert("RGB")
+        image = self.vis_processor(image)
+        caption = ann["caption"]
+        return {
+            "image": image,
+            "answer": caption,
+            "image_id": self.img_ids[ann["image_id"]],
+        }

minigpt4/datasets/datasets/video_datasets.py CHANGED Viewed

@@ -7,7 +7,8 @@
 import os
 from collections import OrderedDict
-import sys
 from minigpt4.datasets.datasets.base_dataset import BaseDataset
 from PIL import Image
 import random
@@ -98,7 +99,7 @@ class __DisplMixin:
 class CMDVideoDataset(BaseDataset, __DisplMixin):
-    def __init__(self, vis_processor, text_processor, vis_root, ann_paths, cc_path):
         """
         vis_root (string): Root directory of images (e.g. coco/images/)
         ann_root (string): directory to store the annotation file
@@ -119,51 +120,89 @@ class CMDVideoDataset(BaseDataset, __DisplMixin):
             'Please provide a depiction of the video.',
             'Illustrate what is happening in the video.',
         ]
-        self.img_ids = {}
-        n = 0
-        self.length = 90
-        for ann in self.annotation:
-            img_id = ann["image_id"]
-            if img_id not in self.img_ids.keys():
-                self.img_ids[img_id] = n
-                n += 1
-        self.cc = json.load(open(cc_path,'r'))
-        self.image_sep = "<Img>"
-        self.text_sep = "<Cap>"
     def __getitem__(self, index):
         ann = self.annotation[index]
         video_id = ann["image_id"]
-        captions = self.cc[video_id] if video_id in self.cc else None
-        answer = self.text_processor(ann["caption"])
         instruction = random.choice(self.instruction_pool)
-        images = []
-        img_placeholder = ""
-        num_of_images=len(os.listdir(os.path.join(self.vis_root, video_id)))
-        sampling_interval = int(num_of_images / self.length)
         if sampling_interval == 0:
             sampling_interval = 1
-        for frame_id in range(0,num_of_images,sampling_interval):
-            image_path = os.path.join(self.vis_root, video_id, f'frame_{frame_id}.jpg')
-            image = Image.open(image_path).convert("RGB")
-            image = self.vis_processor(image)
-            images.append(image)
-            img_placeholder += f"{self.image_sep}<ImageHere>"
-            time_step = str(frame_id * 2)
-            if captions is not None:
-                if time_step in captions:
-                    img_placeholder += f"{self.text_sep}{captions[time_step]}"
             if len(images) >= self.length:
                 break
-        if len(images) < self.length:
             last_item = images[-1]
             while len(images) < self.length:
                 images.append(last_item)
         images = torch.stack(images)
-        instruction = f"{img_placeholder}\n{instruction}"
-        return {
             "image": images,
             "answer": answer,
             "image_id": video_id,
@@ -172,10 +211,8 @@ class CMDVideoDataset(BaseDataset, __DisplMixin):
         }
 class WebVidDataset(BaseDataset, __DisplMixin):
-    def __init__(self, vis_processor, text_processor, vis_root, ann_paths,subtitles_path,add_subtitles=False):
         """
         vis_root (string): Root directory of images (e.g. coco/images/)
         ann_root (string): directory to store the annotation file
@@ -196,10 +233,13 @@ class WebVidDataset(BaseDataset, __DisplMixin):
             'Please provide a depiction of the video.',
             'Illustrate what is happening in the video.',
         ]
-        self.img_ids = {}
-        n = 0
-        self.length = 90
-        self.max_sub_len = 800
         self.add_subtitles = add_subtitles
         self.videos_has_subtitles = {}
         if self.add_subtitles:
@@ -207,23 +247,16 @@ class WebVidDataset(BaseDataset, __DisplMixin):
             for sub in os.listdir(self.subtitle_folder):
                 video_id = sub.split('.')[0]
                 self.videos_has_subtitles[video_id] = True
-        for ann in self.annotation:
-            img_id = ann["videoid"]
-            if img_id not in self.img_ids.keys():
-                self.img_ids[img_id] = n
-                n += 1
         self.transform = transforms.Compose([
                 transforms.ToPILImage(),
             ])
     def __getitem__(self, index):
         ann = self.annotation[index]
         video_id = ann["videoid"]
         images = []
         caption = ann["name"].split('-')[-1].split(':')[-1]
         # caption = self.text_processor(caption)
         video_path = os.path.join(self.vis_root, ann['page_dir'], f'{video_id}.mp4')
         has_subtitles = self.videos_has_subtitles.get(video_id, False)
         if self.add_subtitles and has_subtitles:
@@ -245,20 +278,22 @@ class WebVidDataset(BaseDataset, __DisplMixin):
         subtitle_text_in_interval = ""
         history_subtitles = {}
         number_of_sub_words=0
         while cap.isOpened():
             ret, frame = cap.read()
             if not ret:
                 break
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles and has_subtitles:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
-                    if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds) and sub not in subtitle_text_in_interval:
                         if not history_subtitles.get(sub,False):
-                            subtitle_text_in_interval+=sub+" "
                         history_subtitles[sub]=True
-                        break
             if frame_count % sampling_interval == 0:
                 frame = self.transform(frame[:,:,::-1])
                 frame = self.vis_processor(frame)
@@ -267,6 +302,7 @@ class WebVidDataset(BaseDataset, __DisplMixin):
                 if self.add_subtitles and has_subtitles and subtitle_text_in_interval != "" and number_of_sub_words<self.max_sub_len:
                     img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
                     number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
                     subtitle_text_in_interval = ""
             frame_count += 1
             if len(images) >= self.length:
@@ -291,7 +327,7 @@ class WebVidDataset(BaseDataset, __DisplMixin):
         }
 class VideoChatGPTDataset(BaseDataset, __DisplMixin):
-    def __init__(self, vis_processor, text_processor, vis_root, ann_paths,add_subtitles=True,llm_name="llama2"):
         """
         vis_root (string): Root directory of images (e.g. coco/images/)
         ann_root (string): directory to store the annotation file
@@ -299,12 +335,17 @@ class VideoChatGPTDataset(BaseDataset, __DisplMixin):
         super().__init__(vis_processor, text_processor, vis_root, ann_paths)
         self.img_ids = {}
         n=0
-        self.length = 90
-        self.max_sub_len = 800
         self.add_subtitles = add_subtitles
         self.videos_has_subtitles = {}
         if self.add_subtitles:
-            self.subtitle_folder = os.path.join(self.vis_root,'subtitles')
             for sub in os.listdir(self.subtitle_folder):
                 video_id = sub.split('.')[0]
                 self.videos_has_subtitles[video_id] = True
@@ -315,7 +356,7 @@ class VideoChatGPTDataset(BaseDataset, __DisplMixin):
                 n+= 1
         self.videos_extension={}
-        for video in os.listdir(os.path.join(self.vis_root,'videos')):
             self.videos_extension[video.split('.')[0]]=video.split('.')[1]
         self.transform = transforms.Compose([
@@ -336,7 +377,7 @@ class VideoChatGPTDataset(BaseDataset, __DisplMixin):
             # Load the VTT subtitle file
             vtt_file = webvtt.read(subtitle_path)
-        video_path = os.path.join(self.vis_root,'videos',f'{video_id}.{self.videos_extension[video_id]}')
         clip = VideoFileClip(video_path)
         total_num_frames = int(clip.duration * clip.fps)
         clip.close()
@@ -349,20 +390,22 @@ class VideoChatGPTDataset(BaseDataset, __DisplMixin):
         subtitle_text_in_interval = ""
         history_subtitles = {}
         number_of_sub_words=0
         while cap.isOpened():
             ret, frame = cap.read()
             if not ret:
                 break
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles and has_subtitles:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
-                    if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds) and sub not in subtitle_text_in_interval:
                         if not history_subtitles.get(sub,False):
-                            subtitle_text_in_interval+=sub+" "
                         history_subtitles[sub]=True
-                        break
             if frame_count % sampling_interval == 0:
                 frame = self.transform(frame[:,:,::-1])# BGR to RGB
                 frame = self.vis_processor(frame)
@@ -372,6 +415,7 @@ class VideoChatGPTDataset(BaseDataset, __DisplMixin):
                     if subtitle_text_in_interval != "":
                         img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
                         number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
                         subtitle_text_in_interval = ""
             frame_count += 1
             if len(images) >= self.length:
@@ -513,8 +557,8 @@ class WebVidEvalDataset(torch.utils.data.Dataset):
             ret, frame = cap.read()
             if not ret:
                 break
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles and has_subtitles:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
@@ -616,8 +660,8 @@ class VideoChatGPTEvalDataset(torch.utils.data.Dataset):
             ret, frame = cap.read()
             if not ret:
                 break
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles and subtitle_path is not None:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
@@ -711,8 +755,8 @@ class Video_validation_Dataset(torch.utils.data.Dataset):
             ret, frame = cap.read()
             if not ret:
                 break
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles and subtitle_path is not None:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
@@ -808,8 +852,8 @@ class VideoChatGPTEval_consistancy(torch.utils.data.Dataset):
             ret, frame = cap.read()
             if not ret:
                 break
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles and subtitle_path is not None:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
@@ -900,8 +944,8 @@ class TVQAEVAL (torch.utils.data.Dataset):
         history_subtitles = {}
         number_of_sub_words=0
         for i,frame in enumerate(sorted(os.listdir(video_frames_path))):
-            # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-            # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
             if self.add_subtitles:
                 for subtitle in self.subtitles[video_id]:
                     if (subtitle['start'] <= (i / self.fps) <= subtitle['end']) and subtitle['text'] not in subtitle_text_in_interval:
@@ -934,118 +978,111 @@ class TVQAEVAL (torch.utils.data.Dataset):
         return images,instruction,answer,self.length,video_id
-class TVQAEVAL_Long (torch.utils.data.Dataset):
-    def __init__(self, vis_processor, videos_path, ann_path,subtitles_path,videos_features_path,add_subtitles=False,llm_name="llama2"):
-        self.tv_shows_mapping={"Grey's Anatomy":"grey_frames", 'How I Met You Mother':"met_frames", 'Friends':"friends_frames", 'The Big Bang Theory':"bbt_frames", 'House M.D.':"house_frames", 'Castle':"castle_frames"}
-        self.fps=3
-        if llm_name=="llama2":
-            self.length = 45
-            self.max_sub_len = 400
-        else:
             self.length = 90
             self.max_sub_len = 800
         self.add_subtitles = add_subtitles
-        self.vis_processor=vis_processor
-        self.videos_path=videos_path
-        self.subtitles_path=subtitles_path
-        with open(ann_path,'r') as f:
-            self.annotation=json.load(f)
         self.transform = transforms.Compose([
                 transforms.ToPILImage(),
             ])
-        self.videos_features_path=videos_features_path
-        self.processed_videos={}
-        self.save_pkl="subtitles" if self.add_subtitles else "no_subtitles"
-        for video_pkl in os.listdir(videos_features_path):
-            video_id_sub=video_pkl.split('.')[0]
-            self.processed_videos[video_id_sub]=True
-    def extract_season_episode(self,video_name):
-        # Define a regex pattern to match season and episode numbers
-        pattern = r's(\d+)e(\d+)'
-        # Use re.search to find the pattern in the video name
-        match = re.search(pattern, video_name, re.IGNORECASE)
-        if match:
-            # Extract season and episode numbers from the matched groups
-            season_number = int(match.group(1))
-            episode_number = int(match.group(2))
-            return f"season_{season_number}", f"episode_{episode_number}"
-        else:
-            # Return None if the pattern is not found
-            return None, None
     def __len__(self):
         return len(self.annotation)
     def __getitem__(self, index):
         ann = self.annotation[index]
-        season_number,episode_number=self.extract_season_episode(ann["vid_name"])
-        folder_name=self.tv_shows_mapping[ann["show_name"]]
-        self.videos_path
-        video_id = f"{folder_name}_{season_number}_{episode_number}"
-        answer=str(ann['answer_idx'])
-        instruction=ann["q"]+" \n\n As you watched in this video Choose ONE suitable answer from these mutiple choices \n\n"
-        for i in range(5):
-            ans=ann[f"a{i}"]
-            instruction+=f"option {i}: {ans} \n\n"
-        # instruction+="\n Your output should be THE NUMBER OF THE CORRECT ANSWER FROM THE CHOICES FROM 0 TO 4 INCLUSIVE"
-        instruction+=f"option 5: Can't answer based on the provided information \n\n"
-        instruction+="\n Your output should be THE NUMBER OF THE CORRECT ANSWER FROM THE CHOICES FROM 0 TO 5 INCLUSIVE"
         images=[]
         img_placeholder = ""
-        if self.processed_videos.get(f"{video_id}_{self.save_pkl}",False):
-            with open(f"{self.videos_features_path}/{video_id}_{self.save_pkl}.pkl",'rb') as f:
-                data=pickle.load(f)
-            images=data['images']
-            img_placeholder = data['img_placeholder']
-        else:
-            video_frames_path = os.path.join(self.videos_path,folder_name,season_number,episode_number)
-            video_subtitle_path=os.path.join(self.subtitles_path,folder_name,season_number,episode_number+".srt")
-            video_subtitles=read_subtitles(video_subtitle_path)
-            total_num_frames=len(os.listdir(video_frames_path))
-            sampling_interval = round(total_num_frames / self.length)
-            if sampling_interval == 0:
-                sampling_interval = 1
-            subtitle_text_in_interval = ""
-            history_subtitles = {}
-            number_of_sub_words=0
-            number_of_interval_words=0
-            max_number_of_interval_words=10
-            for i,frame in enumerate(sorted(os.listdir(video_frames_path))):
-                # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
-                # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
-                if self.add_subtitles:
-                    for subtitle in video_subtitles:
-                        if (srt_time_to_seconds(subtitle.start) <= (i / self.fps) <= srt_time_to_seconds(subtitle.end)) and subtitle.text not in subtitle_text_in_interval:
-                            if not history_subtitles.get(subtitle.text,False) and number_of_interval_words<max_number_of_interval_words:
-                                subtitle_text_in_interval+=subtitle.text+" "
-                                number_of_interval_words+=len(subtitle.text.split(' '))
-                            history_subtitles[subtitle.text]=True
-                            break
-                if i % sampling_interval == 0:
-                    frame = Image.open(os.path.join(video_frames_path,frame)).convert("RGB")
-                    frame = self.vis_processor(frame)
-                    images.append(frame)
-                    img_placeholder += '<Img><ImageHere>'
-                    if self.add_subtitles and number_of_sub_words<self.max_sub_len:
-                        if subtitle_text_in_interval != "":
-                            img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
-                            number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
-                            subtitle_text_in_interval = ""
-                if len(images) >= self.length:
-                    break
-            if len(images) ==0:
-                print("Video not found",video_frames_path)
-            if 0 <len(images) < self.length:
-                last_item = images[-1]
-                while len(images) < self.length:
-                    images.append(last_item)
-                    img_placeholder += '<Img><ImageHere>'
-            images = torch.stack(images)
-            with open(f"{self.videos_features_path}/{video_id}_{self.save_pkl}.pkl",'wb') as f:
-                pickle.dump({"images":images,"img_placeholder":img_placeholder},f)
-            self.processed_videos[f"{video_id}_{self.save_pkl}"]=True
-        instruction = img_placeholder + '\n\n' + instruction
-        return images,instruction,answer,self.length,video_id

 import os
 from collections import OrderedDict
+import sys
+sys.path.append('/ibex/project/c2090/kirolos/MiniGPT4-video-llama3')
 from minigpt4.datasets.datasets.base_dataset import BaseDataset
 from PIL import Image
 import random
 class CMDVideoDataset(BaseDataset, __DisplMixin):
+    def __init__(self, vis_processor, text_processor, vis_root, ann_paths, subtitles_path,model_name='llama2'):
         """
         vis_root (string): Root directory of images (e.g. coco/images/)
         ann_root (string): directory to store the annotation file
             'Please provide a depiction of the video.',
             'Illustrate what is happening in the video.',
         ]
+        self.model_name=model_name
+        if self.model_name =='mistral':
+            self.length = 90
+            self.max_sub_len = 800
+        else:
+            self.length = 45
+            self.max_sub_len = 400
+        self.subtitle_folder = subtitles_path
+        self.videos_has_subtitles={}
+        for sub in os.listdir(self.subtitle_folder):
+            video_id = sub.split('.')[0]
+            self.videos_has_subtitles[video_id] = True
+        self.transform = transforms.Compose([
+                transforms.ToPILImage(),
+            ])
     def __getitem__(self, index):
         ann = self.annotation[index]
         video_id = ann["image_id"]
+        answer =ann['caption']
         instruction = random.choice(self.instruction_pool)
+        has_subtitles = self.videos_has_subtitles.get(video_id, False)
+        if has_subtitles:
+            subtitle_path = os.path.join(self.subtitle_folder, f'{video_id}.en.vtt')
+            # Load the VTT subtitle file
+            vtt_file = webvtt.read(subtitle_path)
+        video_path = os.path.join(self.vis_root, f'{video_id}.mp4')
+        clip = VideoFileClip(video_path)
+        total_num_frames = int(clip.duration * clip.fps)
+        clip.close()
+        cap = cv2.VideoCapture(video_path)
+        frame_count = 0
+        sampling_interval = int(total_num_frames / self.length)
         if sampling_interval == 0:
             sampling_interval = 1
+        img_placeholder = ""
+        subtitle_text_in_interval = ""
+        number_of_sub_words=0
+        images=[]
+        history_subtitles = {}
+        previous_sub = ""
+        while cap.isOpened():
+            ret, frame = cap.read()
+            if not ret:
+                break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
+            if has_subtitles:
+                for subtitle in vtt_file:
+                    sub=subtitle.text.replace('\n',' ')
+                    if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds):
+                        if not history_subtitles.get(sub,False):
+                            for word in sub.split(' '):
+                                if word not in subtitle_text_in_interval and word not in previous_sub:
+                                    subtitle_text_in_interval+=word+" "
+                        history_subtitles[sub]=True
+            if frame_count % sampling_interval == 0:
+                frame = self.transform(frame[:,:,::-1])# BGR to RGB
+                frame = self.vis_processor(frame)
+                images.append(frame)
+                img_placeholder += '<Img><ImageHere>'
+                if has_subtitles and number_of_sub_words<self.max_sub_len:
+                    if subtitle_text_in_interval != "":
+                        img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
+                        number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
+                        previous_sub = subtitle_text_in_interval
+                        subtitle_text_in_interval = ""
+            frame_count += 1
             if len(images) >= self.length:
                 break
+        cap.release()
+        if len(images) ==0:
+            print("Video not found",video_path)
+        if 0 <len(images) < self.length:
             last_item = images[-1]
             while len(images) < self.length:
                 images.append(last_item)
+                img_placeholder += '<Img><ImageHere>'
         images = torch.stack(images)
+        instruction = img_placeholder + '\n' + instruction
+        return{
             "image": images,
             "answer": answer,
             "image_id": video_id,
         }
 class WebVidDataset(BaseDataset, __DisplMixin):
+    def __init__(self, vis_processor, text_processor, vis_root, ann_paths,subtitles_path,model_name,add_subtitles=False):
         """
         vis_root (string): Root directory of images (e.g. coco/images/)
         ann_root (string): directory to store the annotation file
             'Please provide a depiction of the video.',
             'Illustrate what is happening in the video.',
         ]
+        self.model_name=model_name
+        if self.model_name =='mistral':
+            self.length = 90
+            self.max_sub_len = 800
+        else:
+            self.length = 45
+            self.max_sub_len = 400
         self.add_subtitles = add_subtitles
         self.videos_has_subtitles = {}
         if self.add_subtitles:
             for sub in os.listdir(self.subtitle_folder):
                 video_id = sub.split('.')[0]
                 self.videos_has_subtitles[video_id] = True
         self.transform = transforms.Compose([
                 transforms.ToPILImage(),
             ])
     def __getitem__(self, index):
         ann = self.annotation[index]
         video_id = ann["videoid"]
         images = []
         caption = ann["name"].split('-')[-1].split(':')[-1]
         # caption = self.text_processor(caption)
         video_path = os.path.join(self.vis_root, ann['page_dir'], f'{video_id}.mp4')
         has_subtitles = self.videos_has_subtitles.get(video_id, False)
         if self.add_subtitles and has_subtitles:
         subtitle_text_in_interval = ""
         history_subtitles = {}
         number_of_sub_words=0
+        previous_sub = ""
         while cap.isOpened():
             ret, frame = cap.read()
             if not ret:
                 break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles and has_subtitles:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
+                    if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds):
                         if not history_subtitles.get(sub,False):
+                            for word in sub.split(' '):
+                                if word not in subtitle_text_in_interval and word not in previous_sub:
+                                    subtitle_text_in_interval+=word+" "
                         history_subtitles[sub]=True
             if frame_count % sampling_interval == 0:
                 frame = self.transform(frame[:,:,::-1])
                 frame = self.vis_processor(frame)
                 if self.add_subtitles and has_subtitles and subtitle_text_in_interval != "" and number_of_sub_words<self.max_sub_len:
                     img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
                     number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
+                    previous_sub = subtitle_text_in_interval
                     subtitle_text_in_interval = ""
             frame_count += 1
             if len(images) >= self.length:
         }
 class VideoChatGPTDataset(BaseDataset, __DisplMixin):
+    def __init__(self, vis_processor, text_processor, vis_root, ann_paths,subtitles_path,model_name='llama2',add_subtitles=True):
         """
         vis_root (string): Root directory of images (e.g. coco/images/)
         ann_root (string): directory to store the annotation file
         super().__init__(vis_processor, text_processor, vis_root, ann_paths)
         self.img_ids = {}
         n=0
+        self.model_name=model_name
+        if self.model_name =='mistral':
+            self.length = 90
+            self.max_sub_len = 800
+        else:
+            self.length = 45
+            self.max_sub_len = 400
         self.add_subtitles = add_subtitles
         self.videos_has_subtitles = {}
         if self.add_subtitles:
+            self.subtitle_folder = subtitles_path
             for sub in os.listdir(self.subtitle_folder):
                 video_id = sub.split('.')[0]
                 self.videos_has_subtitles[video_id] = True
                 n+= 1
         self.videos_extension={}
+        for video in os.listdir(self.vis_root):
             self.videos_extension[video.split('.')[0]]=video.split('.')[1]
         self.transform = transforms.Compose([
             # Load the VTT subtitle file
             vtt_file = webvtt.read(subtitle_path)
+        video_path = os.path.join(self.vis_root,f'{video_id}.{self.videos_extension[video_id]}')
         clip = VideoFileClip(video_path)
         total_num_frames = int(clip.duration * clip.fps)
         clip.close()
         subtitle_text_in_interval = ""
         history_subtitles = {}
         number_of_sub_words=0
+        previous_sub = ""
         while cap.isOpened():
             ret, frame = cap.read()
             if not ret:
                 break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles and has_subtitles:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
+                    if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds):
                         if not history_subtitles.get(sub,False):
+                            for word in sub.split(' '):
+                                if word not in subtitle_text_in_interval and word not in previous_sub:
+                                    subtitle_text_in_interval+=word+" "
                         history_subtitles[sub]=True
             if frame_count % sampling_interval == 0:
                 frame = self.transform(frame[:,:,::-1])# BGR to RGB
                 frame = self.vis_processor(frame)
                     if subtitle_text_in_interval != "":
                         img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
                         number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
+                        previous_sub = subtitle_text_in_interval
                         subtitle_text_in_interval = ""
             frame_count += 1
             if len(images) >= self.length:
             ret, frame = cap.read()
             if not ret:
                 break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles and has_subtitles:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
             ret, frame = cap.read()
             if not ret:
                 break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles and subtitle_path is not None:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
             ret, frame = cap.read()
             if not ret:
                 break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles and subtitle_path is not None:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
             ret, frame = cap.read()
             if not ret:
                 break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles and subtitle_path is not None:
                 for subtitle in vtt_file:
                     sub=subtitle.text.replace('\n',' ')
         history_subtitles = {}
         number_of_sub_words=0
         for i,frame in enumerate(sorted(os.listdir(video_frames_path))):
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
             if self.add_subtitles:
                 for subtitle in self.subtitles[video_id]:
                     if (subtitle['start'] <= (i / self.fps) <= subtitle['end']) and subtitle['text'] not in subtitle_text_in_interval:
         return images,instruction,answer,self.length,video_id
+class Video_loader_template(BaseDataset, __DisplMixin):
+    def __init__(self, vis_processor, text_processor, vis_root, ann_paths,subtitles_path,model_name='llama2',add_subtitles=True):
+        """
+        vis_root (string): Root directory of images (e.g. coco/images/)
+        ann_root (string): directory to store the annotation file
+        """
+        super().__init__(vis_processor, text_processor, vis_root, ann_paths)
+        self.model_name=model_name
+        if self.model_name =='mistral':
             self.length = 90
             self.max_sub_len = 800
+        else:
+            self.length = 45
+            self.max_sub_len = 400
         self.add_subtitles = add_subtitles
+        self.videos_has_subtitles = {}
+        if self.add_subtitles:
+            self.subtitle_folder = subtitles_path
+            for sub in os.listdir(self.subtitle_folder):
+                video_id = sub.split('.')[0]
+                self.videos_has_subtitles[video_id] = True
+        self.videos_extension={}
+        for video in os.listdir(os.path.join(self.vis_root,'videos')):
+            self.videos_extension[video.split('.')[0]]=video.split('.')[1]
         self.transform = transforms.Compose([
                 transforms.ToPILImage(),
             ])
     def __len__(self):
         return len(self.annotation)
     def __getitem__(self, index):
         ann = self.annotation[index]
+        video_id = ann["video_id"] # video_id
+        answer=ann["a"] # answer (ground truth)
+        instruction=ann["q"] # question (instruction)
         images=[]
         img_placeholder = ""
+        has_subtitles = self.videos_has_subtitles.get(video_id, False)
+        if self.add_subtitles and has_subtitles:
+            subtitle_path = os.path.join(self.subtitle_folder, f'{video_id}.vtt')
+            # Load the VTT subtitle file
+            vtt_file = webvtt.read(subtitle_path)
+        video_path = os.path.join(self.vis_root,'videos',f'{video_id}.{self.videos_extension[video_id]}')
+        clip = VideoFileClip(video_path)
+        total_num_frames = int(clip.duration * clip.fps)
+        clip.close()
+        cap = cv2.VideoCapture(video_path)
+        frame_count = 0
+        # Choose sampling interval based on the total number of frames in the video and the desired length of the video
+        sampling_interval = int(total_num_frames / self.length)
+        if sampling_interval == 0:
+            sampling_interval = 1
+        img_placeholder = ""
+        subtitle_text_in_interval = ""
+        history_subtitles = {}
+        number_of_sub_words=0
+        # Iterate through the video frames and extract the frames based on the sampling interval and add the subtitles if needed
+        while cap.isOpened():
+            ret, frame = cap.read()
+            if not ret:
+                break
+            # Find the corresponding subtitle for the each frame and combine the interval subtitles into one subtitle
+            if self.add_subtitles and has_subtitles:
+                for subtitle in vtt_file:
+                    sub=subtitle.text.replace('\n',' ')
+                    if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds) and sub not in subtitle_text_in_interval:
+                        if not history_subtitles.get(sub,False):
+                            subtitle_text_in_interval+=sub+" "
+                        history_subtitles[sub]=True
+                        break
+            if frame_count % sampling_interval == 0:
+                frame = self.transform(frame[:,:,::-1])# BGR to RGB
+                frame = self.vis_processor(frame)
+                images.append(frame)
+                img_placeholder += '<Img><ImageHere>'
+                if self.add_subtitles and has_subtitles and number_of_sub_words<self.max_sub_len:
+                    if subtitle_text_in_interval != "":
+                        img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
+                        number_of_sub_words+=len(subtitle_text_in_interval.split(' '))
+                        subtitle_text_in_interval = ""
+            frame_count += 1
+            if len(images) >= self.length:
+                break
+        cap.release()
+        if len(images) ==0:
+            print("Video not found",video_path)
+        if 0 <len(images) < self.length:
+            last_item = images[-1]
+            while len(images) < self.length:
+                images.append(last_item)
+                img_placeholder += '<Img><ImageHere>'
+        images = torch.stack(images)
+        # Combine the images and the instruction
+        instruction = img_placeholder + '\n' + instruction
+        # Return the images, instruction, answer, video_id, and the length of the video
+        return{
+            "image": images,
+            "answer": answer,
+            "image_id": video_id,
+            "instruction_input": instruction,
+            "length": self.length,
+        }

minigpt4/models/mini_gpt4_llama_v2.py CHANGED Viewed

@@ -87,11 +87,6 @@ class MiniGPT4_llama_v2(Blip2Base):
         self.max_context_len = max_context_len
         self.chat_template = chat_template
-        # print('Loading VIT')
-        # self.visual_encoder, self.ln_vision = self.init_vision_encoder(
-        #     vit_model, img_size, drop_path_rate, use_grad_checkpoint, vit_precision
-        # )
         if freeze_vit:
             # vit_precision="fp32"
             print("vit precision", vit_precision)
@@ -147,18 +142,6 @@ class MiniGPT4_llama_v2(Blip2Base):
                 # device_map={'':0}
             )
-            # bnb_config = BitsAndBytesConfig(
-            #         load_in_4bit=True,
-            #         bnb_4bit_use_double_quant=True,
-            #         bnb_4bit_quant_type="nf4",
-            #         bnb_4bit_compute_dtype=torch.bfloat16,
-            #     )
-            # self.llama_model = llm_model.from_pretrained(
-            #         llama_model,
-            #         torch_dtype=torch.bfloat16,
-            #         device_map={'':torch.cuda.current_device()},
-            #         quantization_config=bnb_config,
-            #     )
         else:
             self.llama_model = llm_model.from_pretrained(
                 llama_model,
@@ -182,24 +165,10 @@ class MiniGPT4_llama_v2(Blip2Base):
         )
         self.llama_model = get_peft_model(self.llama_model, loraconfig)
-        # if ckpt_path:
-        #     print('load the llm under lora')
-        #     ckpt = torch.load(ckpt_path)
-        #     set_peft_model_state_dict(self.llama_model,ckpt)
         self.llama_model.print_trainable_parameters()
         if self.use_grad_checkpoint_llm:
             self.llama_model.gradient_checkpointing_enable()
-        # if not self.low_resource:
-        #     for name, param in self.llama_model.named_parameters():
-        #         if "embed_token" in name:
-        #             param.data = param.data.float()
-        #             param.requires_grad = True
         print('Loading LLAMA Done')
@@ -256,15 +225,6 @@ class MiniGPT4_llama_v2(Blip2Base):
         mixed_embs = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair] + [seg_embs[-1]]
         mixed_embs = torch.cat(mixed_embs, dim=1)
-        #  # truncate the length of tokens to the max context window
-        # mixed_embs_without_instruction = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair]
-        # mixed_embs_without_instruction=torch.cat(mixed_embs_without_instruction, dim=1)
-        # # check if the number of token in the second dimention is more than the max context window then truncate it
-        # context_window=self.max_context_len-seg_embs[-1].shape[1]
-        # if mixed_embs_without_instruction.shape[1] > context_window :
-        #     mixed_embs_without_instruction = mixed_embs_without_instruction[:, 0:context_window]
-        # mixed_embs=torch.cat([mixed_embs_without_instruction,seg_embs[-1]], dim=1)
-        # print("mixed_embs",mixed_embs.shape)
         return mixed_embs
@@ -288,7 +248,8 @@ class MiniGPT4_llama_v2(Blip2Base):
         else:
             # return the multi-modal embedding in right padding
             emb_lists = []
             for idx, (each_img_embed, each_prompt) in enumerate(zip(img_embeds, prompts)):
                 pn = each_img_embed.shape[-2]
                 if lengths is not None:
@@ -299,12 +260,8 @@ class MiniGPT4_llama_v2(Blip2Base):
                 interleave_emb = []
                 for idx, seg in enumerate(p_segs[:-1]):
                     p_tokens = self.llama_tokenizer(seg, return_tensors="pt", add_special_tokens=False).to(img_embeds.device)
-                    # print("p_embed device",p_tokens.input_ids.device)
-                    # print("p_tokens",img_embeds.device)
-                    # print("emb layer", list(self.llama_model.base_model.model.model.embed_tokens.parameters())[0].device)
                     p_embed = self.embed_tokens(p_tokens.input_ids)
-                    # print("model device",self.llama_model.get_device())
                     interleave_emb.append(torch.cat([p_embed, each_img_embed[None][:, idx*pn:(idx+1)*pn]], dim=1))
                 wrapped_emb = torch.cat(interleave_emb, dim=1)
@@ -356,17 +313,6 @@ class MiniGPT4_llama_v2(Blip2Base):
                     input_atts[i][input_len:]
                 ])
             )
-            # print('===================================')
-            # print('check input emb: ', input_embs[i][this_input_ones-2:this_input_ones])
-            # print('check pad emb: ', input_embs[i][this_input_ones:this_input_ones+2])
-            # print('check out emb: ', output_embs[i][:2])
-            # print('check out pad emb: ', output_embs[i][-2:])
-            # print('+++++++++++++++++++++++++++++++++++')
-            #
-            # print('check attn before: ', input_atts[i][:this_input_ones])
-            # print('check attn after: ', input_atts[i][this_input_ones:])
-            # print('check attn gt before: ', output_atts[i][:3])
-            # print('check attn gt after: ', output_atts[i][-3:])
         cat_embs = torch.stack(cat_embs)
         cat_atts = torch.stack(cat_atts)
@@ -433,7 +379,6 @@ class MiniGPT4_llama_v2(Blip2Base):
         ### prepare input tokens
         if 'image' in samples:
             img_embeds, img_atts = self.encode_img(samples["image"])
-            # print("img_embeds shape",img_embeds.shape)
         else:
             img_embeds = img_atts = None
@@ -453,12 +398,15 @@ class MiniGPT4_llama_v2(Blip2Base):
             cond_embeds, cond_atts = regress_embeds[:, :0], regress_atts[:, :0]
         else:
-            instruction = samples["instruction_input"] if "instruction_input" in samples else None
-            # print("instruction before", instruction)
             if self.remove_template:
                 instruction = remove_special_tokens(instruction)
-            # print("instruction after", instruction)
             if self.chat_template:
                 instruction = ["[INST] " + instruct + "[/INST]" for instruct in instruction]
@@ -502,9 +450,6 @@ class MiniGPT4_llama_v2(Blip2Base):
         # concat the embedding to condition and the embedding to regress
         inputs_embeds, attention_mask, input_lens = \
             self.concat_emb_input_output(cond_embeds, cond_atts, regress_embeds, regress_atts)
-        print("inputs_embeds shape",inputs_embeds.shape)
-        print("cond_embeds shape",cond_embeds.shape)
-        print("regress_embeds shape",regress_embeds.shape)
         # get bos token embedding
         bos = torch.ones_like(part_targets[:, :1]) * self.llama_tokenizer.bos_token_id
         bos_embeds = self.embed_tokens(bos)
@@ -513,16 +458,12 @@ class MiniGPT4_llama_v2(Blip2Base):
         # add bos token at the begining
         inputs_embeds = torch.cat([bos_embeds, inputs_embeds], dim=1)
         attention_mask = torch.cat([bos_atts, attention_mask], dim=1)
-        # print length of instruction_input and answer words
-        # for i in range (len(samples["instruction_input"])):
-        #     print("instruction_input length",len(samples["instruction_input"][i].split(" ")))
-        #     print("answer length",len(samples["answer"][i].split(" ")))
-        # ensemble the final targets
         targets = torch.ones([inputs_embeds.shape[0], inputs_embeds.shape[1]],
                              dtype=torch.long).to(self.device).fill_(-100)
         for i, target in enumerate(part_targets):
             targets[i, input_lens[i]+1:input_lens[i]+len(target)+1] = target  # plus 1 for bos
-        print("targets shape",targets.shape)
         with self.maybe_autocast():
             outputs = self.llama_model(
                 inputs_embeds=inputs_embeds,
@@ -569,7 +510,6 @@ class MiniGPT4_llama_v2(Blip2Base):
             img_embeds = self.llama_proj(img_embeds) # project to llama input size (200,64,5632) -> (200,64,4096)
             atts_img = torch.ones(img_embeds.size()[:-1], dtype=torch.long).to(self.device)
-        print("img_embeds shape",img_embeds.shape)
         if lengths is not None:
             image_lists = []
             img_embeds = img_embeds.reshape(len(lengths), -1, img_embeds.shape[-2], img_embeds.shape[-1])
@@ -592,8 +532,6 @@ class MiniGPT4_llama_v2(Blip2Base):
             emb_len = emb.shape[1]
             embs[i, -emb_len:] = emb[0]
             attn_mask[i, -emb_len:] = 1
-        # print("inputs_embeds shape",embs.shape)
-        # print("attention_mask shape",attn_mask.shape)
         # check if the input embedding tokens are in the range of the model cotext window (4096) and if it is not, then truncate it to the max context window
         if self.model_type == "Llama":
             context_window = 3700
@@ -602,8 +540,6 @@ class MiniGPT4_llama_v2(Blip2Base):
         if embs.shape[1] > context_window:
             embs = embs[:, -context_window:]
             attn_mask = attn_mask[:, -context_window:]
-        print("inputs_embeds shape",embs.shape)
-        print("attention_mask shape",attn_mask.shape)
         with self.maybe_autocast():
             if return_video_temporal_features:
                 last_hidden_state = self.llama_model(
@@ -665,15 +601,8 @@ class MiniGPT4_llama_v2(Blip2Base):
         stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(
             stops=[torch.tensor([i]).to(self.device) for i in stop_words_ids])])
-        # seg_tokens=[]
-        # for i, text in enumerate(texts):
-        #     seg_tokens.append(self.llama_tokenizer(text, return_tensors="pt", add_special_tokens=True).to(self.device).input_ids)
         batch_embs = [torch.cat([self.embed_tokens(seg_t)]) for seg_t in seg_tokens]
-        # seg_embs = torch.cat(seg_embs, dim=1)
-        # print("seg_embs shape",seg_embs.shape)
-        # batch_embs=[seg_embs]
         batch_size = len(batch_embs)
         max_len = max([emb.shape[1] for emb in batch_embs])
         emb_dim = batch_embs[0].shape[2]
@@ -687,9 +616,6 @@ class MiniGPT4_llama_v2(Blip2Base):
             embs[i, -emb_len:] = emb[0]
             attn_mask[i, -emb_len:] = 1
-        print("inputs_embeds shape",embs.shape)
-        print("attention_mask shape",attn_mask.shape)
         with self.maybe_autocast():
             outputs = self.llama_model.generate(
                 inputs_embeds=embs,
@@ -892,4 +818,4 @@ def assign_imgs(batched_instruct_list, batched_img_embeds):
                 n_assigned.append(None)
         batched_assigned.append(assigned_img)
-    return batched_assigned

         self.max_context_len = max_context_len
         self.chat_template = chat_template
         if freeze_vit:
             # vit_precision="fp32"
             print("vit precision", vit_precision)
                 # device_map={'':0}
             )
         else:
             self.llama_model = llm_model.from_pretrained(
                 llama_model,
         )
         self.llama_model = get_peft_model(self.llama_model, loraconfig)
         self.llama_model.print_trainable_parameters()
         if self.use_grad_checkpoint_llm:
             self.llama_model.gradient_checkpointing_enable()
         print('Loading LLAMA Done')
         mixed_embs = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair] + [seg_embs[-1]]
         mixed_embs = torch.cat(mixed_embs, dim=1)
         return mixed_embs
         else:
             # return the multi-modal embedding in right padding
             emb_lists = []
+            if type(prompts) == str:
+                prompts = [prompts] * len(img_embeds)
             for idx, (each_img_embed, each_prompt) in enumerate(zip(img_embeds, prompts)):
                 pn = each_img_embed.shape[-2]
                 if lengths is not None:
                 interleave_emb = []
                 for idx, seg in enumerate(p_segs[:-1]):
                     p_tokens = self.llama_tokenizer(seg, return_tensors="pt", add_special_tokens=False).to(img_embeds.device)
                     p_embed = self.embed_tokens(p_tokens.input_ids)
                     interleave_emb.append(torch.cat([p_embed, each_img_embed[None][:, idx*pn:(idx+1)*pn]], dim=1))
                 wrapped_emb = torch.cat(interleave_emb, dim=1)
                     input_atts[i][input_len:]
                 ])
             )
         cat_embs = torch.stack(cat_embs)
         cat_atts = torch.stack(cat_atts)
         ### prepare input tokens
         if 'image' in samples:
             img_embeds, img_atts = self.encode_img(samples["image"])
         else:
             img_embeds = img_atts = None
             cond_embeds, cond_atts = regress_embeds[:, :0], regress_atts[:, :0]
         else:
+            if "instruction_input" in samples:
+                instruction = samples["instruction_input"]
+            elif len(self.prompt_list) > 1:
+                instruction = random.choice(self.prompt_list)
+            else:
+                instruction = None
             if self.remove_template:
                 instruction = remove_special_tokens(instruction)
             if self.chat_template:
                 instruction = ["[INST] " + instruct + "[/INST]" for instruct in instruction]
         # concat the embedding to condition and the embedding to regress
         inputs_embeds, attention_mask, input_lens = \
             self.concat_emb_input_output(cond_embeds, cond_atts, regress_embeds, regress_atts)
         # get bos token embedding
         bos = torch.ones_like(part_targets[:, :1]) * self.llama_tokenizer.bos_token_id
         bos_embeds = self.embed_tokens(bos)
         # add bos token at the begining
         inputs_embeds = torch.cat([bos_embeds, inputs_embeds], dim=1)
         attention_mask = torch.cat([bos_atts, attention_mask], dim=1)
         targets = torch.ones([inputs_embeds.shape[0], inputs_embeds.shape[1]],
                              dtype=torch.long).to(self.device).fill_(-100)
         for i, target in enumerate(part_targets):
             targets[i, input_lens[i]+1:input_lens[i]+len(target)+1] = target  # plus 1 for bos
         with self.maybe_autocast():
             outputs = self.llama_model(
                 inputs_embeds=inputs_embeds,
             img_embeds = self.llama_proj(img_embeds) # project to llama input size (200,64,5632) -> (200,64,4096)
             atts_img = torch.ones(img_embeds.size()[:-1], dtype=torch.long).to(self.device)
         if lengths is not None:
             image_lists = []
             img_embeds = img_embeds.reshape(len(lengths), -1, img_embeds.shape[-2], img_embeds.shape[-1])
             emb_len = emb.shape[1]
             embs[i, -emb_len:] = emb[0]
             attn_mask[i, -emb_len:] = 1
         # check if the input embedding tokens are in the range of the model cotext window (4096) and if it is not, then truncate it to the max context window
         if self.model_type == "Llama":
             context_window = 3700
         if embs.shape[1] > context_window:
             embs = embs[:, -context_window:]
             attn_mask = attn_mask[:, -context_window:]
         with self.maybe_autocast():
             if return_video_temporal_features:
                 last_hidden_state = self.llama_model(
         stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(
             stops=[torch.tensor([i]).to(self.device) for i in stop_words_ids])])
         batch_embs = [torch.cat([self.embed_tokens(seg_t)]) for seg_t in seg_tokens]
         batch_size = len(batch_embs)
         max_len = max([emb.shape[1] for emb in batch_embs])
         emb_dim = batch_embs[0].shape[2]
             embs[i, -emb_len:] = emb[0]
             attn_mask[i, -emb_len:] = 1
         with self.maybe_autocast():
             outputs = self.llama_model.generate(
                 inputs_embeds=embs,
                 n_assigned.append(None)
         batched_assigned.append(assigned_img)
+    return batched_assigned

minigpt4/runners/runner_base.py CHANGED Viewed

@@ -428,10 +428,10 @@ class RunnerBase:
             #                 wandb.log({"epoch": cur_epoch, "GPT4_Accuracy": val_log['agg_metrics']})
             #                 print("Validation finished")
-            else:
-                # if no validation split is provided, we just save the checkpoint at the end of each epoch.
-                if not self.evaluate_only:
-                    self._save_checkpoint(cur_epoch, is_best=False)
             if self.evaluate_only:
                 break

             #                 wandb.log({"epoch": cur_epoch, "GPT4_Accuracy": val_log['agg_metrics']})
             #                 print("Validation finished")
+            # else:
+            # if no validation split is provided, we just save the checkpoint at the end of each epoch.
+            if not self.evaluate_only:
+                self._save_checkpoint(cur_epoch, is_best=False)
             if self.evaluate_only:
                 break

minigpt4_video_demo.py CHANGED Viewed

@@ -155,7 +155,7 @@ def run (video_path,instruction,model,vis_processor,gen_subtitles=True):
         subtitle_path=None
     prepared_images,prepared_instruction=prepare_input(vis_processor,video_path,subtitle_path,instruction)
     if prepared_images is None:
-        return "Video cann't be open ,check the video path again"
     length=len(prepared_images)
     prepared_images=prepared_images.unsqueeze(0)
     conv = CONV_VISION.copy()
@@ -166,10 +166,10 @@ def run (video_path,instruction,model,vis_processor,gen_subtitles=True):
     prompt = [conv.get_prompt()]
     answers = model.generate(prepared_images, prompt, max_new_tokens=args.max_new_tokens, do_sample=True, lengths=[length],num_beams=2)
     # remove the subtitle file and the video file
-    if subtitle_path:
-        os.system(f"rm {subtitle_path}")
-    #if video_path.split('.')[-1] == 'mp4' or video_path.split('.')[-1] == 'mkv' or video_path.split('.')[-1] == 'avi':
-    #    os.system(f"rm {video_path}")
     return answers[0]
 def run_single_image (image_path,instruction,model,vis_processor):
@@ -268,7 +268,7 @@ description = """<h5>This is the demo of MiniGPT4-video Model.</h5>"""
 project_page = """<p><a href='https://vision-cair.github.io/MiniGPT4-video/'><img src='https://img.shields.io/badge/Project-Page-Green'></a></p>"""
 code_link="""<p><a href='https://github.com/Vision-CAIR/MiniGPT4-video'><img src='https://img.shields.io/badge/Github-Code-blue'></a></p>"""
 paper_link="""<p><a href=''><img src='https://img.shields.io/badge/Paper-PDF-red'></a></p>"""
-#video_path=""
 with gr.Blocks(title="MiniGPT4-video 🎞️🍿",css=text_css ) as demo :
     # with gr.Row():
     #     with gr.Column(scale=2):
@@ -330,7 +330,7 @@ with gr.Blocks(title="MiniGPT4-video 🎞️🍿",css=text_css ) as demo :
         # )
         with gr.Row():
             with gr.Column():
-                youtube_link = gr.Textbox(label="Enter the youtube link", placeholder="Paste YouTube URL here")
                 video_player = gr.Video(autoplay=False)
                 download_finish = gr.State(value=False)
                 youtube_link.change(

         subtitle_path=None
     prepared_images,prepared_instruction=prepare_input(vis_processor,video_path,subtitle_path,instruction)
     if prepared_images is None:
+        return "Please re-upload the video while changing the instructions."
     length=len(prepared_images)
     prepared_images=prepared_images.unsqueeze(0)
     conv = CONV_VISION.copy()
     prompt = [conv.get_prompt()]
     answers = model.generate(prepared_images, prompt, max_new_tokens=args.max_new_tokens, do_sample=True, lengths=[length],num_beams=2)
     # remove the subtitle file and the video file
+    # if subtitle_path:
+    #     os.system(f"rm {subtitle_path}")
+    # if video_path.split('.')[-1] == 'mp4' or video_path.split('.')[-1] == 'mkv' or video_path.split('.')[-1] == 'avi':
+    #     os.system(f"rm {video_path}")
     return answers[0]
 def run_single_image (image_path,instruction,model,vis_processor):
 project_page = """<p><a href='https://vision-cair.github.io/MiniGPT4-video/'><img src='https://img.shields.io/badge/Project-Page-Green'></a></p>"""
 code_link="""<p><a href='https://github.com/Vision-CAIR/MiniGPT4-video'><img src='https://img.shields.io/badge/Github-Code-blue'></a></p>"""
 paper_link="""<p><a href=''><img src='https://img.shields.io/badge/Paper-PDF-red'></a></p>"""
+video_path=""
 with gr.Blocks(title="MiniGPT4-video 🎞️🍿",css=text_css ) as demo :
     # with gr.Row():
     #     with gr.Column(scale=2):
         # )
         with gr.Row():
             with gr.Column():
+                youtube_link = gr.Textbox(label="Enter the youtube link", placeholder="Paste YouTube URL with this format 'https://www.youtube.com/watch?v=video_id'")
                 video_player = gr.Video(autoplay=False)
                 download_finish = gr.State(value=False)
                 youtube_link.change(

minigpt4_video_inference.py CHANGED Viewed

@@ -1,94 +1,180 @@
-import json
 from tqdm import tqdm
 from pytubefix import YouTube
-import xml.etree.ElementTree as ET
-import os
-with open ('VideoInstruct100K.json','r') as f :
-    data=json.load(f)
-# Usage
-existed_video_id={}
-for video_name in os.listdir('videos'):
-    video_id = video_name.split('.')[0]
-    existed_video_id[video_id]=True
-def download_video_with_subtitles(video_id):
-    # Create a YouTube object.
-    yt = YouTube(f'https://www.youtube.com/watch?v={video_id}')
-    video_filename = f"{video_id}.mp4"
-    video_downloaded=False
-    try :
-        # Get the video stream with the highest resolution and download the video.
-        stream = yt.streams.get_highest_resolution()
-        stream.download(output_path='videos', filename=video_filename)
-        video_downloaded=True
     except Exception as e:
-        print(f"Error downloading video {video_id}: {str(e)}")
-        video_downloaded=False
-    if not video_downloaded:
-        return False,False
-    # Get the video's available captions (subtitles).
-    captions = yt.captions.all()
-    # Download the captions if available in xml format.
-    caption_downloaded = False
-    for caption in captions:
-        caption_code = caption.code
-        # select only english captions
-        if 'en' in caption_code:
-            caption.download(title=f"{video_id}", output_path='subtitles_xml',srt=False)
-            caption_downloaded = True
-    return video_downloaded,caption_downloaded
-def convert_xml_vtt(xml_path, vtt_path):
-    # Parse the XML subtitle file
-    tree = ET.parse(xml_path)
-    root = tree.getroot()
-    # Initialize a list to store VTT subtitle entries
-    vtt_subtitle = []
-    # Function to convert time in milliseconds to WebVTT format
-    def ms_to_vtt_time(milliseconds):
-        seconds, milliseconds = divmod(milliseconds, 1000)
-        minutes, seconds = divmod(seconds, 60)
-        return f"{minutes:02d}:{seconds:02d}.{milliseconds:03d}"
-    # Iterate through subtitle elements
-    toggle = True
-    for p in root.findall(".//p"):
-        if toggle:
-            start_time = int(p.get("t"))
-            subtitle_text = " ".join(s.text.strip() for s in p.findall(".//s"))
-        # duration = int(p.get("d")) if p.get("d") is not None else 0
-        if not toggle:
-            end_time = int(p.get("t"))
-            # Format and append the VTT entry to the list
-            vtt_subtitle.append(f"{ms_to_vtt_time(start_time)} --> {ms_to_vtt_time(end_time)}\n{subtitle_text}\n")
-        toggle = not toggle
-    # Join the VTT entries into a single string
-    vtt_content = "WEBVTT\n\n" + "\n".join(vtt_subtitle)
-    # Save the VTT content to a file
-    with open(vtt_path, "w", encoding="utf-8") as vtt_file:
-        vtt_file.write(vtt_content)
-import os
-os.makedirs('videos', exist_ok=True)
-os.makedirs('subtitles_vtt', exist_ok=True)
-os.makedirs('subtitles_xml', exist_ok=True)
-for video_path in tqdm(data,desc='Downloading videos') :
-    video_id=video_path.split('/')[-1].split('.')[0]
-    if existed_video_id.get(video_id,False):
-        continue
-    video_downloaded,caption_downloaded=download_video_with_subtitles(video_id)
-    if caption_downloaded:
-        # convert xml to vtt
-        xml_file_path=f'subtitles_xml/{video_id} (a.en).xml'
-        convert_xml_vtt(xml_file_path,f'subtitles_vtt/{video_id}.vtt')

+import torch
+import webvtt
+import os
+import cv2
+from minigpt4.common.eval_utils import prepare_texts, init_model
+from minigpt4.conversation.conversation import CONV_VISION
+from torchvision import transforms
+import json
 from tqdm import tqdm
+import soundfile as sf
+import argparse
+import moviepy.editor as mp
+import gradio as gr
 from pytubefix import YouTube
+import shutil
+from PIL import Image
+from moviepy.editor import VideoFileClip
+import torch
+import random
+import numpy as np
+import torch.backends.cudnn as cudnn
+def prepare_input(vis_processor,video_path,subtitle_path,instruction):
+    cap = cv2.VideoCapture(video_path)
+    if subtitle_path is not None:
+        # Load the VTT subtitle file
+        vtt_file = webvtt.read(subtitle_path)
+        print("subtitle loaded successfully")
+        clip = VideoFileClip(video_path)
+        total_num_frames = int(clip.duration * clip.fps)
+        # print("Video duration = ",clip.duration)
+        clip.close()
+    else :
+        # calculate the total number of frames in the video using opencv
+        total_num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    max_images_length = 45
+    max_sub_len = 400
+    images = []
+    frame_count = 0
+    sampling_interval = int(total_num_frames / max_images_length)
+    if sampling_interval == 0:
+        sampling_interval = 1
+    img_placeholder = ""
+    subtitle_text_in_interval = ""
+    history_subtitles = {}
+    raw_frames=[]
+    number_of_words=0
+    transform=transforms.Compose([
+                transforms.ToPILImage(),
+            ])
+    while cap.isOpened():
+        ret, frame = cap.read()
+        if not ret:
+            break
+        # Find the corresponding subtitle for the frame and combine the interval subtitles into one subtitle
+        # we choose 1 frame for every 2 seconds,so we need to combine the subtitles in the interval of 2 seconds
+        if subtitle_path is not None:
+            for subtitle in vtt_file:
+                sub=subtitle.text.replace('\n',' ')
+                if (subtitle.start_in_seconds <= (frame_count / int(clip.fps)) <= subtitle.end_in_seconds) and sub not in subtitle_text_in_interval:
+                    if not history_subtitles.get(sub,False):
+                        subtitle_text_in_interval+=sub+" "
+                    history_subtitles[sub]=True
+                    break
+        if frame_count % sampling_interval == 0:
+            raw_frames.append(Image.fromarray(cv2.cvtColor(frame.copy(), cv2.COLOR_BGR2RGB)))
+            frame = transform(frame[:,:,::-1]) # convert to RGB
+            frame = vis_processor(frame)
+            images.append(frame)
+            img_placeholder += '<Img><ImageHere>'
+            if subtitle_path is not None and subtitle_text_in_interval != "" and number_of_words< max_sub_len:
+                img_placeholder+=f'<Cap>{subtitle_text_in_interval}'
+                number_of_words+=len(subtitle_text_in_interval.split(' '))
+                subtitle_text_in_interval = ""
+        frame_count += 1
+        if len(images) >= max_images_length:
+            break
+    cap.release()
+    cv2.destroyAllWindows()
+    if len(images) == 0:
+        # skip the video if no frame is extracted
+        return None,None
+    images = torch.stack(images)
+    instruction = img_placeholder + '\n' + instruction
+    return images,instruction
+def extract_audio(video_path, audio_path):
+    video_clip = mp.VideoFileClip(video_path)
+    audio_clip = video_clip.audio
+    audio_clip.write_audiofile(audio_path, codec="libmp3lame", bitrate="320k")
+def generate_subtitles(video_path):
+    video_id=video_path.split('/')[-1].split('.')[0]
+    audio_path = f"workspace/inference_subtitles/mp3/{video_id}"+'.mp3'
+    os.makedirs("workspace/inference_subtitles/mp3",exist_ok=True)
+    if existed_subtitles.get(video_id,False):
+        return f"workspace/inference_subtitles/{video_id}"+'.vtt'
+    try:
+        extract_audio(video_path,audio_path)
+        print("successfully extracted")
+        os.system(f"whisper {audio_path}  --language English --model large --output_format vtt --output_dir workspace/inference_subtitles")
+        # remove the audio file
+        os.system(f"rm {audio_path}")
+        print("subtitle successfully generated")
+        return f"workspace/inference_subtitles/{video_id}"+'.vtt'
     except Exception as e:
+        print("error",e)
+        print("error",video_path)
+        return None
+def run (video_path,instruction,model,vis_processor,gen_subtitles=True):
+    if gen_subtitles:
+        subtitle_path=generate_subtitles(video_path)
+    else :
+        subtitle_path=None
+    prepared_images,prepared_instruction=prepare_input(vis_processor,video_path,subtitle_path,instruction)
+    if prepared_images is None:
+        return "Video cann't be open ,check the video path again"
+    length=len(prepared_images)
+    prepared_images=prepared_images.unsqueeze(0)
+    conv = CONV_VISION.copy()
+    conv.system = ""
+    # if you want to make conversation comment the 2 lines above and make the conv is global variable
+    conv.append_message(conv.roles[0], prepared_instruction)
+    conv.append_message(conv.roles[1], None)
+    prompt = [conv.get_prompt()]
+    answers = model.generate(prepared_images, prompt, max_new_tokens=args.max_new_tokens, do_sample=True, lengths=[length],num_beams=1)
+    return answers[0]
+def get_arguments():
+    parser = argparse.ArgumentParser(description="Inference parameters")
+    parser.add_argument("--cfg-path", help="path to configuration file.",default="test_configs/llama2_test_config.yaml")
+    parser.add_argument("--ckpt", type=str,default='checkpoints/video_llama_checkpoint_last.pth', help="path to checkpoint")
+    parser.add_argument("--add_subtitles",action= 'store_true',help="whether to add subtitles")
+    parser.add_argument("--question", type=str, help="question to ask")
+    parser.add_argument("--video_path", type=str, help="Path to the video file")
+    parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+    parser.add_argument("--lora_r", type=int, default=64, help="lora rank of the model")
+    parser.add_argument("--lora_alpha", type=int, default=16, help="lora alpha")
+    parser.add_argument(
+        "--options",
+        nargs="+",
+        help="override some settings in the used config, the key-value pair "
+                "in xxx=yyy format will be merged into config file (deprecate), "
+                "change to --cfg-options instead.",
+    )
+    return parser.parse_args()
+args=get_arguments()
+def setup_seeds(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    cudnn.benchmark = False
+    cudnn.deterministic = True
+import yaml
+with open('test_configs/llama2_test_config.yaml') as file:
+    config = yaml.load(file, Loader=yaml.FullLoader)
+seed=config['run']['seed']
+print("seed",seed)
+model, vis_processor = init_model(args)
+conv = CONV_VISION.copy()
+conv.system = ""
+inference_subtitles_folder="inference_subtitles"
+os.makedirs(inference_subtitles_folder,exist_ok=True)
+existed_subtitles={}
+for sub in os.listdir(inference_subtitles_folder):
+    existed_subtitles[sub.split('.')[0]]=True
+if __name__ == "__main__":
+    video_path=args.video_path
+    instruction=args.question
+    add_subtitles=args.add_subtitles
+    # setup_seeds(seed)
+    pred=run(video_path,instruction,model,vis_processor,gen_subtitles=add_subtitles)
+    print(pred)

train_configs/224_minigpt4_llama2_image.yaml CHANGED Viewed

@@ -1,7 +1,9 @@
 model:
-  arch: minigpt4
-  model_type: mini_gpt4_llama_v2
   llama_model: "meta-llama/Llama-2-7b-chat-hf"
 datasets:
@@ -42,7 +44,7 @@ run:
   iters_per_epoch: 5000
   seed: 42
-  output_dir: "output/minigpt4_stage1_pretrain"
   amp: True
   resume_ckpt_path: null

 model:
+  arch: mini_gpt4_llama_v2
+  model_type: pretrain_vicuna
   llama_model: "meta-llama/Llama-2-7b-chat-hf"
+  max_txt_len: 160
+  max_context_len: 512
 datasets:
   iters_per_epoch: 5000
   seed: 42
+  output_dir: "output/minigpt4_stage1_pretrain_llama2"
   amp: True
   resume_ckpt_path: null

train_configs/224_minigpt4_llama2_image_align.yaml ADDED Viewed

	@@ -0,0 +1,53 @@

+model:
+  arch: mini_gpt4_llama_v2
+  model_type: pretrain_vicuna
+  llama_model: "meta-llama/Llama-2-7b-chat-hf"
+  max_txt_len: 160
+  max_context_len: 512
+  end_sym: "</s>"
+  prompt_path: "train_configs/alignment.txt"
+  prompt_template: '[INST] {} [/INST] '
+  ckpt: put your pretrained ckpt here
+datasets:
+  cc_sbu_align:
+    batch_size: 12
+    vis_processor:
+      train:
+        name: "blip2_image_train"
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+run:
+  task: image_text_pretrain
+  # optimizer
+  lr_sched: "linear_warmup_cosine_lr"
+  init_lr: 3e-5
+  min_lr: 1e-5
+  warmup_lr: 1e-6
+  weight_decay: 0.05
+  max_epoch: 5
+  iters_per_epoch: 200
+  num_workers: 4
+  warmup_steps: 200
+  seed: 42
+  output_dir: "output/minigpt4_stage2_finetune"
+  amp: True
+  resume_ckpt_path: null
+  evaluate: False
+  train_splits: ["train"]
+  device: "cuda"
+  world_size: 1
+  dist_url: "env://"
+  distributed: True
+  wandb_log: True
+  job_name: minigpt4_finetune

train_configs/224_minigpt4_mistral_image.yaml CHANGED Viewed

@@ -1,8 +1,9 @@
 model:
-  arch: minigpt4
-  model_type: mini_gpt4_llama_v2
   llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
 datasets:
   laion:
@@ -42,7 +43,7 @@ run:
   iters_per_epoch: 5000
   seed: 42
-  output_dir: "output/minigpt4_stage1_pretrain"
   amp: True
   resume_ckpt_path: null
@@ -56,4 +57,4 @@ run:
   distributed: True
   wandb_log: True
-  job_name: minigpt4_llama2_pretrain

 model:
+  arch: mini_gpt4_llama_v2
+  model_type: pretrain_vicuna
   llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
+  max_txt_len: 160
+  max_context_len: 512
 datasets:
   laion:
   iters_per_epoch: 5000
   seed: 42
+  output_dir: "output/minigpt4_stage1_pretrain_mistral"
   amp: True
   resume_ckpt_path: null
   distributed: True
   wandb_log: True
+  job_name: minigpt4_mistral_pretrain

train_configs/224_minigpt4_mistral_image_align.yaml ADDED Viewed

	@@ -0,0 +1,53 @@

+model:
+  arch: mini_gpt4_llama_v2
+  model_type: pretrain_vicuna
+  llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
+  max_txt_len: 160
+  max_context_len: 512
+  end_sym: "</s>"
+  prompt_path: "train_configs/alignment.txt"
+  prompt_template: '[INST] {} [/INST] '
+  ckpt: put your pretrained ckpt here
+datasets:
+  cc_sbu_align:
+    batch_size: 12
+    vis_processor:
+      train:
+        name: "blip2_image_train"
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+run:
+  task: image_text_pretrain
+  # optimizer
+  lr_sched: "linear_warmup_cosine_lr"
+  init_lr: 3e-5
+  min_lr: 1e-5
+  warmup_lr: 1e-6
+  weight_decay: 0.05
+  max_epoch: 5
+  iters_per_epoch: 200
+  num_workers: 4
+  warmup_steps: 200
+  seed: 42
+  output_dir: "output/minigpt4_stage2_finetune"
+  amp: True
+  resume_ckpt_path: null
+  evaluate: False
+  train_splits: ["train"]
+  device: "cuda"
+  world_size: 1
+  dist_url: "env://"
+  distributed: True
+  wandb_log: True
+  job_name: minigpt4_finetune

train_configs/224_v2_llama2_video_stage_2.yaml CHANGED Viewed

@@ -8,7 +8,7 @@ model:
   image_size: 224
   end_sym: "</s>"
   llama_model: "meta-llama/Llama-2-7b-chat-hf"
-  ckpt: "checkpoints/image_llama2_checkpoint.pth"
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
@@ -56,7 +56,7 @@ run:
   iters_per_epoch: 1000
   seed: 42
-  output_dir: "training_output/cmd_webvid_pretrain"
   amp: True
   resume_ckpt_path: null

   image_size: 224
   end_sym: "</s>"
   llama_model: "meta-llama/Llama-2-7b-chat-hf"
+  ckpt: "checkpoints/image_llama2_checkpoint.pth" # set the checkpoint to start the training from
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
   iters_per_epoch: 1000
   seed: 42
+  output_dir: "training_output/cmd_webvid_pretrain/llama2"
   amp: True
   resume_ckpt_path: null

train_configs/224_v2_llama2_video_stage_3.yaml CHANGED Viewed

@@ -7,8 +7,8 @@ model:
   low_resource: False
   image_size: 224
   end_sym: "</s>"
-  llama_model: "meta-llama/Llama-2-7b-chat-hf"
-  ckpt: "checkpoints/video_captioning_llama_checkpoint_last.pth"
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
@@ -44,7 +44,7 @@ run:
   iters_per_epoch: 1000
   seed: 42
-  output_dir: "training_output/pretrained_video_instruct"
   amp: True
   resume_ckpt_path: null

   low_resource: False
   image_size: 224
   end_sym: "</s>"
+  llama_model: "meta-llama/Meta-Llama-3-8B-Instruct"
+  # ckpt: "checkpoints/video_captioning_llama_checkpoint_last.pth" # set the checkpoint to start the training from
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
   iters_per_epoch: 1000
   seed: 42
+  output_dir: "training_output/pretrained_video_instruct/llama2"
   amp: True
   resume_ckpt_path: null

train_configs/224_v2_mistral_video_stage_2.yaml CHANGED Viewed

@@ -8,7 +8,7 @@ model:
   image_size: 224
   end_sym: "</s>"
   llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
-  ckpt: "checkpoints/image_mistral_checkpoint.pth"
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
@@ -56,7 +56,7 @@ run:
   iters_per_epoch: 875
   seed: 42
-  output_dir: "training_output/cmd_webvid_pretrain"
   amp: True
   resume_ckpt_path: null

   image_size: 224
   end_sym: "</s>"
   llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
+  ckpt: "checkpoints/image_mistral_checkpoint.pth" # set the checkpoint to start the training from
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
   iters_per_epoch: 875
   seed: 42
+  output_dir: "training_output/cmd_webvid_pretrain/mistral"
   amp: True
   resume_ckpt_path: null

train_configs/224_v2_mistral_video_stage_3.yaml CHANGED Viewed

@@ -8,7 +8,7 @@ model:
   image_size: 224
   end_sym: "</s>"
   llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
-  ckpt: "checkpoints/video_captioning_mistral_checkpoint_last.pth"
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
@@ -46,7 +46,7 @@ run:
   iters_per_epoch: 875
   seed: 42
-  output_dir: "training_output/pretrained_video_instruct"
   amp: True
   resume_ckpt_path: null

   image_size: 224
   end_sym: "</s>"
   llama_model: "mistralai/Mistral-7B-Instruct-v0.2"
+  ckpt: "checkpoints/video_captioning_mistral_checkpoint_last.pth" # set the checkpoint to start the training from
   use_grad_checkpoint: True
   chat_template: True
   lora_r: 64
   iters_per_epoch: 875
   seed: 42
+  output_dir: "training_output/pretrained_video_instruct/mistral"
   amp: True
   resume_ckpt_path: null

train_configs/alignment.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+<Img><ImageHere></Img> Describe this image in detail.
+<Img><ImageHere></Img> Take a look at this image and describe what you notice.
+<Img><ImageHere></Img> Please provide a detailed description of the picture.
+<Img><ImageHere></Img> Could you describe the contents of this image for me?