Spaces:

DiscloseAI
/

pllava-7b-demo

Runtime error

App Files Files Community

cathyxl commited on May 1, 2024

Commit

f239efc

1 Parent(s): 8fb958b

added

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +5 -0
.gitignore +66 -0
DATA.md +124 -0
README.md +376 -5
app.py +18 -0
assert/data.png +3 -0
assert/logo.png +3 -0
assert/module.png +3 -0
assert/performance.png +3 -0
assert/teaser.jpg +3 -0
assert/zeroshot.png +3 -0
dataset/__init__.py +158 -0
dataset/base_dataset.py +108 -0
dataset/it_dataset.py +206 -0
dataset/utils.py +41 -0
dataset/video_utils.py +214 -0
docs/PoolLLaVA_Report.pdf +3 -0
example/1917.mp4 +3 -0
example/bear.jpg +3 -0
example/cooking.mp4 +3 -0
example/dog.png +3 -0
example/jesse_dance.mp4 +3 -0
example/working.mp4 +3 -0
example/yoga.mp4 +3 -0
models/__init__.py +0 -0
models/pllava/__init__.py +55 -0
models/pllava/configuration_pllava.py +149 -0
models/pllava/convert_pllava_weights_to_hf.py +1 -0
models/pllava/modeling_pllava.py +626 -0
models/pllava/processing_pllava.py +292 -0
python_scripts/hf.py +80 -0
requirements.no_torch.txt +244 -0
requirements.torch.txt +4 -0
requirements.txt +246 -0
scripts/accel_config_deepspeed_zero2.yaml +21 -0
scripts/accel_config_deepspeed_zero3_offload.yaml +22 -0
scripts/accel_config_deepspeed_zero3_offload_multinode.yaml +25 -0
scripts/accel_config_deepspeed_zero3_offload_multinode_1.yaml +25 -0
scripts/accel_config_deepspeed_zero3_offload_multinode_2.yaml +25 -0
scripts/accel_config_deepspeed_zero3_offload_singlegpu.yaml +23 -0
scripts/accel_config_multigpu.yaml +16 -0
scripts/accel_config_multinode.yaml +18 -0
scripts/accel_config_singlegpu.yaml +16 -0
scripts/demo.sh +32 -0
scripts/eval.sh +104 -0
scripts/eval_yiprompt.sh +53 -0
scripts/gallery.sh +11 -0
scripts/train_pllava.sh +34 -0
scripts/train_pllava_13b.sh +50 -0
scripts/train_pllava_34b.sh +50 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.mov filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,66 @@

+# local #
+tmp*/
+cache/*
+*/cache*/
+tmp*.py
+tmp*
+*pickle
+data/
+# Zip Files/Packages #
+*.7z
+*.dmg
+*.gz
+*.iso
+*.jar
+*.rar
+*.tar
+*.zip
+# Logs and databases #
+*.log
+*.sql
+*.sqlite
+.ipynb_checkpoints/
+*.swp
+*.vscode/
+*.idea/
+*.pyc
+__pycache__
+slurm*out
+# OS files #
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+.vim-arsync
+scratch.norg
+sync_to_red.sh
+anno/
+wandb/
+logs/
+accelerate_config/
+*.pth
+hf_*
+# local folders
+MODELS
+DATAS
+SAVED
+EXPERIMENTS
+REMOTE_HF
+TEST
+test_results
+test_training
+test_hdfs.py
+magic_video_outputs/llava*
+magic_video_outputs
+pllava_video_outputs/

DATA.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# Data
+## Instruction Training Data
+<!-- > *originated from [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)* -->
+For training, we leveraged the video instruction tuning data from [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2).
+#### 1. Download json annotation files from huggingface.
+[![Dataset meta](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-VideoChat2%20IT-blue)](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT)
+<!-- > ![images](./assert/data.png) -->
+#### 2. Download the raw videos from the following links.
+The video directories can be found in tasks/train/instruction_data.py. You can also change them to your own saved paths.
+- [VideoChat](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data): Based on [InternVid](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid), download the processed version directly [here](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/data/videochat2_conversation_videos.zip)
+- [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main/data)
+- [Kinetics-710](https://github.com/OpenGVLab/UniFormerV2/blob/main/DATASET.md), download Kinetics 400/600/700 [here](https://openxlab.org.cn/datasets?keywords=kinetics).
+- [SthSthV2](https://developer.qualcomm.com/software/ai-datasets/something-something): Option candidates were generated from [UMT](https://github.com/OpenGVLab/unmasked_teacher) top-20 predictions.
+- [NExTQA](https://github.com/doc-doc/NExT-QA)
+- [CLEVRER](https://clevrer.csail.mit.edu/)
+- [WebVid](https://maxbain.com/webvid-dataset/)
+- [YouCook2](https://youcook2.eecs.umich.edu/), download the processed version [here](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/data/youcook_split_videos.zip).
+- [TextVR](https://github.com/callsys/textvr)
+- [TGIF](https://github.com/YunseokJANG/tgif-qa)
+- [EgoQA](https://ego4d-data.org/), download the processed version [here](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/data/egoqa_split_videos.zip).
+#### 3. We also provide our processed json annotation files here.
+[![Dataset meta](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-magic%5Fjsons-blue)](https://huggingface.co/datasets/cathyxl/magic_jsons)
+<!-- We leveraged the training data from [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2). We only used the video part for video instruct tuning. -->
+## Evaluation Data & Others
+Follow this section to obtain the evaluation open resources.
+### VCGBench
+We refer to the VideoChatGPT video question answering evaluation as VCGBench in this repo. We followed the original [repo](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main) to prepare the evaluation data.
+### MVBench
+We follow the original [Videochat2 repo](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) in setting up the MVBench Evaluation. You can also find helpful resources at their [huggingface repo](https://huggingface.co/datasets/OpenGVLab/MVBench)
+### Videoqabench
+We refer to all other video question answering benchmarks as videoqabench in this repo. They are mainly prepared folloing the original repos. Each listed:
+1. [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) & [MSRVTT](https://github.com/xudejing/video-question-answering)
+3. [Activity Net](https://github.com/MILVLG/activitynet-qa/tree/master)
+4. [TGIF](https://github.com/raingo/TGIF-Release/tree/master)
+Also other fantastic repo intergrating these benchmarks are helpful in the process of setting up the evaluation data:
+- [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main)
+- [VideoLlava](https://github.com/PKU-YuanGroup/Video-LLaVA/tree/main/videollava)
+- [IG-VLM](https://github.com/imagegridworth/IG-VLM/tree/main)
+### Recaptioning
+#### Inter4k
+This is a dataset with 1000 samples of high resolution videos. We prepare the data folloing the instructions from their [official website](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html)
+#### Extending Reacptioning
+The recaptioning part is designed to be extendable.
+inference script [tasks/eval/recaption/pllava_recaption.py](tasks/eval/recaption/pllava_recaption.py) would use a dataset class [RecaptionDataset](tasks/eval/recaption/__init__.py#L197). The detailed information is kept in the data_list_info attribute as:
+```
+data_list_info = OrderedDict({
+        # "Panda70M": OrderedDict(
+        #     json_relpath="Panda70M/annotations.json",
+        #     prefix="DATAS/Recaption/Panda70M/videos",
+        #     data_type="video",
+        #     bound=False,
+        #     key_rename_map={
+        #         # 'caption': 'hint',
+        #     },
+        #     name_key='video_name',
+        #     postfix=('mp4', 'mkv', 'webm'),
+        #     recaption_type=RecaptionSample,
+        # ), # don't has start & end
+        "Inter4K": OrderedDict(
+            json_relpath="Inter4K/annotations.json",
+            prefix="DATAS/Recaption/Inter4K/60fps/UHD",
+            data_type="video",
+            bound=False,
+            key_rename_map={
+                # 'caption': 'hint',
+            },
+            name_key='video_name',
+            postfix=('mp4', 'mkv', 'webm'),
+            recaption_type=CaptionSample,
+        ), # don't has start & end
+    })
+```
+It contains the path to a annotation json file where there is a list and each item of the list is a sample waiting for captioning. For example, the Inter4K/annotations.json is like:
+```json
+[
+    {
+        "video_name": "973"
+    },
+    ...
+]
+```
+and the directory DATAS/Recaption/Inter4K/60fps/UHD would look like:
+```
+$ ls DATAS/Recaption/Inter4K/60fps/UHD
+1.mp4 134.mp4  170.mp4 ....
+```
+Naively, only the video is needed when captioning directly, therefore the annotation file only needs to contain the names of each video under the "prefix" directory.
+Extending a dataset for captioning would consist of the folloing steps:
+1. have all the videos downloaded
+2. construct a annotation.json file with sepecific format.
+3. configure the recaption dataset [here](tasks/eval/recaption/__init__.py#L197), where you would need to determine:
+    - json_relpath: the annotation relative path
+    - prefix: root directory for videos
+    - postfix: a list containing all the file extensions for these videos
+The other options are experimental, so stick with the default setting as in Inter4k. The recommended length of video is around 5-20 seconds.
+p.s. "bound" is to make sure the video pass to the model doesn't have scene transition or so. This part wasn't tested, so set the bound to false and make sure the original videos files are single clip of a video. But always feel free to discover and contribute to PLLaVA!

README.md CHANGED Viewed

@@ -1,12 +1,383 @@
 ---
-title: Pllava 7b Demo
-emoji: 🌖
 colorFrom: blue
-colorTo: blue
 sdk: gradio
-sdk_version: 4.28.3
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Plava 7b Demo
+emoji: 👁
 colorFrom: blue
+colorTo: yellow
 sdk: gradio
+sdk_version: 4.27.0
 app_file: app.py
 pinned: false
 ---
+<div align="center">
+<h2><a href="https://pllava.github.io/">PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning</a></h2>
+[Lin Xu](https://scholar.google.com/citations?user=_Gu69coAAAAJ), [Yilin Zhao](https://ermu2001.github.io/me.io/), [Daquan Zhou](https://scholar.google.com/citations?user=DdCAbWwAAAAJ), [Zhijie Lin](https://scholar.google.com/citations?user=xXMj6_EAAAAJ), [See-Kiong Ng](https://scholar.google.com/citations?user=_wsommYAAAAJ), [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en)
+</div>
+<!-- [![Paper](https://img.shields.io/badge/cs.CV-2311.17005-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2311.17005) -->
+**Project Page: [PLLaVA](https://pllava.github.io/)**
+[![arXiv](https://img.shields.io/badge/arXiv-2404.16994-b31b1b.svg)](https://arxiv.org/abs/2404.16994)
+[![YouTube Video](https://img.shields.io/badge/YouTube-Video-red)](https://www.youtube.com/watch?v=nAEje8tu18U)
+[![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-34b)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=pllava-parameter-free-llava-extension-from-1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pllava-parameter-free-llava-extension-from-1/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=pllava-parameter-free-llava-extension-from-1)
+![]()
+<div align="center">
+  <a href="https://pllava.github.io">
+    <img src="assert/logo.png">
+  </a>
+</div>
+<div align="center">
+  <video src="https://github.com/magic-research/PLLaVA/assets/55656210/a6619702-12d3-489d-bfcc-0ef7105544b2" width="100%">
+</div>
+## Overview
+Welcome to PLLAVA!
+The primary purpose of this repository is to support research and the development of prototype models. It is designed to facilitate ease of experimentation and enable a clear overview of results. Please note that this repo is currently undergoing development and reconstruction.
+It's important to mention that we have not optimized the response speed of the application or the frontend logic. Our goal is to maintain simplicity, clarity, and ease of development, making it accessible for both researchers and students. If you have suggestions or want to enhance the application's performance, please feel free to contact us or contribute to the project.
+We've briefly introduce our work in section [PLLAVA](#%EF%B8%8F-pllava). For more details, feel free to read our paper. Check out section [Usage](#hammer-usage) to start using this repo. If you felt our works interesting, please star us, your support is all we want. If you find our work helpful, feel free to [cite](#page_facing_up-citation) us directly.
+## :fire: Updates
+- **2024/4/24**: Release:
+  - We are releasing our code/models/datasets.
+## 🏖️ PLLAVA
+<div align="center">
+  <a href="https://www.youtube.com/embed/nAEje8tu18U?si=GXxjgP93j77FzDbw">
+    <img src="assert/teaser.jpg">
+  </a>
+</div>
+### Abstract
+Vision-language pre-training (VLP) has significantly elevated performance across a range of vision-language applications. Yet, the pre-training process for video-related tasks demands an exceptionally high degree of computational and data resources. This paper investigates a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-training model for video data. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames on video datasets leads to performance saturation or even a drop in caption-related tasks. Besides, it is also vulnerable to prompts and tends to provide short descriptions. We conducted a deep analysis and observed that the performance saturation and the vulnerability might be related to the dominant patches that exist in some single video patches. We then propose a simple pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from some extreme tokens. The new model is termed Pooling LLaVA, or PLLaVA in short. With the proposed pooling strategy, we achieve new state-of-the-art performance on all evaluated datasets. Notably, on the recent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, which is the new state-of-the-art score on the leaderboard and is 0.31 higher than the previous SOTA results from GPT4V (IG-VLM). On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, which is the new state-of-the-art result and is 14.5% higher than GPT4V (IG-VLM).
+<div align="center"><img src="assert/module.png"></div>
+### SEARCHING FOR OPTIMAL POOLING STRATEGY
+There are two dimensions for the pooling strategy: the spatial dimension and the temporal dimension. We empirically found that reducing the spatial dimension with a larger temporal dimension could lead to better model performance, compared to reducing the temporal dimension directly.
+<div align="center"><img src="assert/zeroshot.png"></div>
+### STATE-OF-THE-ART PERFORMANCE
+We compare the performance of PLLAVA with recent popular methods over both question-answer and captioning datasets. The results are shown below.
+<div align="center"><img src="assert/performance.png"></div>
+## :hammer: Usage
+This section provides guidance on how to run, train, and evaluate our models.
+### Install
+First, you will need to set up the environment and download some pre-trained weights.
+This repo is built up using [transformers](https://github.com/huggingface/transformers) for model construction along with [accelerate](https://github.com/huggingface/accelerate) for distributed training. Follow the instructions to install the needed environment.
+1. Above all, the following environment set up is for python 3.10. If you choose to use conda for environment setup, we recommend creating the virtual environment with:
+```bash
+conda create -n pllava python=3.10
+```
+1. Firstly, install [pytorch](https://pytorch.org/) from the official website. The code runs on torch 2.2.1, cu118 or cu122. Select the version that suits your drive version.
+```
+torch                       2.2.1+cu118
+torchaudio                  2.2.1+cu118
+torchvision                 0.17.1+cu118
+```
+If your driver version is higher than cu121, you could probably try installing with the following scripts:
+```bash
+pip install -r requirements.txt
+```
+Otherwise, you would need to install a torch for your server first, then install the other packages:
+```bash
+pip install -r requirements.torch.txt # decide your own requirements, (this is for cu11), or install torch directly following the official website.
+pip install -r requirements.no_torch.txt # install the following
+```
+1. Prepare the model.
+We prefer to have huggingface models explicitly downloaded to a MODELS directory. However, if you are familiar with huggingface-hub usage, feel free to organize the model yourself.
+```
+python python_scripts/hf.py
+```
+Here are some detailed information of the obtained models:
+| Model      | Link                                                                                                                                                  | Initialized From                                                                                              |
+| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
+| pllava-7b  | [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-7b)  | [llava-hf/llava-v1.6-vicuna-7b-hf · Hugging Face](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf)   |
+| pllava-13b | [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-13b) | [llava-hf/llava-v1.6-vicuna-13b-hf · Hugging Face](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) |
+| pllava-34b | [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm-dark.svg)](https://huggingface.co/ermu2001/pllava-34b) | [llava-hf/llava-v1.6-34b-hf · Hugging Face](https://huggingface.co/llava-hf/llava-v1.6-34b-hf)               |
+The model directory should look like this, where you would only need the corresponding model's weights and directory.
+```
+$ tree MODELS
+MODELS
+|-- pllava-13b
+|   |-- added_tokens.json
+|   |-- config.json
+|   |-- generation_config.json
+|   |-- model-00001-of-00006.safetensors
+|   |-- model-00002-of-00006.safetensors
+|   |-- model-00003-of-00006.safetensors
+|   |-- model-00004-of-00006.safetensors
+|   |-- model-00005-of-00006.safetensors
+|   |-- model-00006-of-00006.safetensors
+|   |-- model.safetensors.index.json
+|   |-- preprocessor_config.json
+|   |-- processor_config.json
+|   |-- special_tokens_map.json
+|   |-- tokenizer.json
+|   |-- tokenizer.model
+|   `-- tokenizer_config.json
+|-- pllava-34b
+|   |-- added_tokens.json
+|   |-- config.json
+|   |-- generation_config.json
+|   |-- model-00001-of-00015.safetensors
+|   |-- model-00002-of-00015.safetensors
+|   |-- model-00003-of-00015.safetensors
+|   |-- model-00004-of-00015.safetensors
+|   |-- model-00005-of-00015.safetensors
+|   |-- model-00006-of-00015.safetensors
+|   |-- model-00007-of-00015.safetensors
+|   |-- model-00008-of-00015.safetensors
+|   |-- model-00009-of-00015.safetensors
+|   |-- model-00010-of-00015.safetensors
+|   |-- model-00011-of-00015.safetensors
+|   |-- model-00012-of-00015.safetensors
+|   |-- model-00013-of-00015.safetensors
+|   |-- model-00014-of-00015.safetensors
+|   |-- model-00015-of-00015.safetensors
+|   |-- model.safetensors-deprecated
+|   |-- model.safetensors.index.json
+|   |-- preprocessor_config.json
+|   |-- processor_config.json
+|   |-- special_tokens_map.json
+|   |-- tokenizer.json
+|   |-- tokenizer.model
+|   `-- tokenizer_config.json
+|-- pllava-7b
+    |-- added_tokens.json
+    |-- config.json
+    |-- generation_config.json
+    |-- model-00001-of-00003.safetensors
+    |-- model-00002-of-00003.safetensors
+    |-- model-00003-of-00003.safetensors
+    |-- model.safetensors.index.json
+    |-- preprocessor_config.json
+    |-- processor_config.json
+    |-- special_tokens_map.json
+    |-- tokenizer.json
+    |-- tokenizer.model
+    `-- tokenizer_config.json
+```
+With the above steps, you should be able to proceed on with the following usages.
+### Run Application
+To run our models, make sure you have downloaded a model pretrained weights from the huggingface spaces. Then, run the following scripts with the corresponding path input. Since we are only training with lora and the projector, the model to be run are determined with:
+- **model_dir**: model directory, one with config.json as compatible with transformers. This refers to the base model's directory, for example "llava-hf/llava-v1.6-vicuna-7b-hf"/"ermu2001/pllava-7b"/"MODELS/pllava-7b". (default to: MODELS/plave-7b)
+- **weights_dir**: your weights directory. could be the same as model_dir, but if you have a weights directory for the lora weights, you should set this weights_dir to that directory to load the lora weights. This directory should be local. Also, it would need to contain a config.json file within. (default to: ${model_dir}).
+```bash
+model_dir="model directory"
+weights_dir="weights directory"
+bash scripts/demo.sh ${model_dir} ${weights_dir}
+```
+Now check out the application demo and try play with PLLAVA!
+### Train
+Follow the following steps to reproduce our results or train your own variant:
+#### 1. Data Preparation
+To train our model from a starting Image-aligned Vision LLM, you would need to download the data first. Our data set up is mainly based on the original Videochat2's training data. Check out [Instruction Data](./DATA.md) to prepare the instruction training data. Ideally, setting up a root data directory and alter the code [here](./tasks/train/instruction_data.py#L6) would accomodate the data for training most smoothly.
+#### 2. Start Training
+Now you're only a few step away from starting the training. Follow the instructions:
+##### Setup Accelerator
+Customize a accelerate training config. For example, a simple config using multiple gpus with no distribution strategy (only torch DDP) would look like:
+```yaml
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+Check out out the [Accelerate](https://huggingface.co/docs/accelerate/index) documents for more details.
+##### Overwatch the training configuration
+Next, you should go over a basic training configuration of the training process in [here](tasks/train/config_pllava_nframe.py). Then passing this file as the first arg to the [training script](tasks/train/train_pllava_nframe_accel.py) would utilize every arguments in the file. You can customize some of the hyper parameters for your own training process by passing them in the format of "key" "value" pair in the following arguments. A example training scripts could be find [here](scripts/train_pllava.sh).
+We recommand customize a [configuration](tasks/train/config_pllava_nframe.py) to set up a customized training!
+With the above steps, you would be able to start the training process. The output would be well organized in the output directory, each a qualified model directory to pass in to demo as weights_dir, since we are only saveing the lora weights and projector weights to avoide redundancy.
+### Evaluation
+This section mainly introduce how to reproduce the evaluation or evaluate your own model.
+#### Set up Evaluation Data
+Make sure you set up the "DATAS" directory as in [DATA.md](DATA.md), then you would be able to run the inference with fortune! The evaluation data directory of DATAS would look like:
+```
+DATAS/:
+DATAS/VideoQA:
+DATAS/VideoQA/TGIF_QA:
+                     test_a.json
+                     test_q.json
+DATAS/VideoQA/TGIF_QA/videos:
+                            tumblr_m4387mGrlc1r6m5e8o1_250.gif
+                            ...
+DATAS/VideoQA/TGIF_QA/videos_mp4:
+                                tumblr_m4387mGrlc1r6m5e8o1_250.mp4
+                                ...
+DATAS/VideoQA/TGIF_QA/video_gif:
+                               tumblr_m4387mGrlc1r6m5e8o1_250.gif
+                               ...
+DATAS/VideoQA/MSVD_Zero_Shot_QA:
+                               test_a.json
+                               test_q.json
+DATAS/VideoQA/MSVD_Zero_Shot_QA/videos:
+                                      -4wsuPCjDBc_5_15.avi
+DATAS/VideoQA/MSVD_Zero_Shot_QA/msvd_qa:
+DATAS/VideoQA/ActivityNet:
+                         test_a.json
+                         test_q.json
+DATAS/VideoQA/ActivityNet/all_test:
+                                  v_--tFD65KaK4.mp4
+                                  ...
+DATAS/VideoQA/MSRVTT_Zero_Shot_QA:
+                                 test_a.json
+                                 test_q.json
+DATAS/VideoQA/MSRVTT_Zero_Shot_QA/videos:
+DATAS/VideoQA/MSRVTT_Zero_Shot_QA/videos/all:
+                                            video0.mp4
+                                            ...
+DATAS/MVBench:
+             ...
+DATAS/Recaption/Inter4K:
+                       annotations.json
+DATAS/Recaption/Inter4K/60fps:
+DATAS/Recaption/Inter4K/60fps/UHD:
+                                 1.mp4
+                                 ...
+```
+#### Start Evaluate
+Once you have construted the evaluation data, you can start the evaluation as in [here](scripts/eval.sh). This script is for evaluating 7B/13B models. As pllava-34b model uses a slightly different prompting, it is evaluated with this [script](scripts/eval_yiprompt.sh).
+```
+bash scripts/eval.sh
+```
+Same as running the demo, you would need to determine the model_dir and weights_dir to evaluate the model. Feel free to comment out some commands and produce partial evaluation.
+#### Overwatch the Results
+The evaluation results would be shown to you with our results gallery demo:
+```bash
+bash scripts/gallery.sh
+```
+Feel free to use the compare version to compare differnt models' results or use the single gallery version to check out one model's results. They are basically the same. Check out this [script](scripts/gallery.sh) for more details
+#### For Captioning and Recaptioning
+Follow instructions at [DATA.md](DATA.md#extending-reacptioning) and you can extend the recaptioning data with a few steps.
+Feel free to point out high quality dataset of videos, we would proceed on doing captioning on those datasets.
+# :page_facing_up: Citation
+If you find this project useful in your research, please consider cite:
+```BibTeX
+@misc{xu2024pllava,
+      title={PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning},
+      author={Lin Xu and Yilin Zhao and Daquan Zhou and Zhijie Lin and See Kiong Ng and Jiashi Feng},
+      year={2024},
+      eprint={2404.16994},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+# :dizzy: Acknowledgement
+This code base is mainly built upon [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2). SALUTE.
+We would also like to recognize and commend the following open source projects, thank you for your great contribution to the open source community:
+- [LLaVA](https://github.com/haotian-liu/LLaVA): Fantastic Open Source Image LLM Model.
+- [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main): Great Evaluation Benchmarking Framework.
+- [VideoLlava](https://github.com/PKU-YuanGroup/Video-LLaVA/tree/main/videollava)：Video LLM repo with helpful resources.

app.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import sys
+from huggingface_hub import snapshot_download
+snapshot_download(
+    'ermu2001/pllava-7b',
+    local_dir='MODELS/pllava-7b',
+    repo_type='model',
+    local_dir_use_symlinks=True,
+)
+sys.argv.extend([
+    "--pretrained_model_name_or_path", "MODELS/pllava-7b",
+    "--num_frames", "16",
+    "--use_lora",
+    "--weight_dir", "MODELS/pllava-7b",
+    "--lora_alpha", "4",
+    "--conv_mode", "plain",
+])
+import tasks.eval.demo.pllava_demo

assert/data.png ADDED Viewed

Git LFS Details

SHA256: 72bd5fa48454bfcb6ee1c5b26c3baffd2397502a27bb666860069f0a5755a51b
Pointer size: 131 Bytes
Size of remote file: 224 kB

assert/logo.png ADDED Viewed

Git LFS Details

SHA256: df1ae4a260b20b749eaaef02d9bad7057cbba958fff92e23e28d1d3b91224668
Pointer size: 132 Bytes
Size of remote file: 1.32 MB

assert/module.png ADDED Viewed

Git LFS Details

SHA256: 7933116caeb3552590bc80c543f37456261dcb9984d75a6f81555f4d38ccfa65
Pointer size: 131 Bytes
Size of remote file: 226 kB

assert/performance.png ADDED Viewed

Git LFS Details

SHA256: 9bced5f433da0a6424d8bd1bd776f6cb16407ae94d5cf2fbc09ba09e407c37ac
Pointer size: 131 Bytes
Size of remote file: 106 kB

assert/teaser.jpg ADDED Viewed

Git LFS Details

SHA256: f204476020f3995d37a5f7c5b341f8eb739cbb0b5e1e529a8c4e722e5976de54
Pointer size: 131 Bytes
Size of remote file: 372 kB

assert/zeroshot.png ADDED Viewed

Git LFS Details

SHA256: d6ee8e95e824759b2f93d63db9c4c57f81775576c8b2932b875dd4176b702dab
Pointer size: 131 Bytes
Size of remote file: 147 kB

dataset/__init__.py ADDED Viewed

	@@ -0,0 +1,158 @@

+import torch
+from torch.utils.data import ConcatDataset, DataLoader
+from torchvision import transforms
+from torchvision.transforms import InterpolationMode
+from dataset.it_dataset import ITImgTrainDataset, ITVidTrainDataset
+def get_media_type(dataset_config):
+    if len(dataset_config) == 3 and dataset_config[2] == "video":
+        return "video"
+    elif dataset_config[-1] == "only_video":
+        return "only_video"
+    else:
+        return "image"
+def create_dataset(dataset_type, config):
+    if "clip" in config.model.get("vit_model", 'vit'):
+        mean = (0.485, 0.456, 0.406)
+        std = (0.229, 0.224, 0.225)
+    else:
+        vision_enc_name = config.model.vision_encoder.name
+        if "swin" in vision_enc_name or "vit" in vision_enc_name:
+            mean = (0.485, 0.456, 0.406)
+            std = (0.229, 0.224, 0.225)
+        elif "beit" in vision_enc_name:
+            mean = (0.5, 0.5, 0.5)  # for all beit model except IN1K finetuning
+            std = (0.5, 0.5, 0.5)
+        elif "clip" in vision_enc_name:
+            mean = (0.48145466, 0.4578275, 0.40821073)
+            std = (0.26862954, 0.26130258, 0.27577711)
+        else:
+            raise ValueError
+    normalize = transforms.Normalize(mean, std)
+    # loaded images and videos are torch.Tensor of torch.uint8 format,
+    # ordered as (T, 1 or 3, H, W) where T=1 for image
+    type_transform = transforms.Lambda(lambda x: x.float().div(255.0))
+    if config.inputs.video_input.random_aug:
+        aug_transform = transforms.RandAugment()
+    else:
+        aug_transform = transforms.Lambda(lambda x: x)
+    train_transform = transforms.Compose(
+        [
+            aug_transform,
+            transforms.RandomResizedCrop(
+                config.inputs.image_res,
+                scale=(0.5, 1.0),
+                interpolation=InterpolationMode.BICUBIC,
+            ),
+            transforms.RandomHorizontalFlip(),
+            type_transform,
+            normalize,
+        ]
+    )
+    test_transform = transforms.Compose(
+        [
+            transforms.Resize(
+                (config.inputs.image_res, config.inputs.image_res),
+                interpolation=InterpolationMode.BICUBIC,
+            ),
+            type_transform,
+            normalize,
+        ]
+    )
+    video_reader_type = config.inputs.video_input.get("video_reader_type", "decord")
+    video_only_dataset_kwargs_train = dict(
+        video_reader_type=video_reader_type,
+        sample_type=config.inputs.video_input.sample_type,
+        num_frames=config.inputs.video_input.num_frames,
+        num_tries=3,  # false tolerance
+    )
+    if dataset_type == "pt_train":
+        raise ValueError("NOT PRETRAINING YET")
+    elif dataset_type in ["it_train"]:
+        # convert to list of lists
+        train_files = (
+            [config.train_file] if isinstance(config.train_file[0], str) else config.train_file
+        )
+        train_media_types = sorted(list({get_media_type(e) for e in train_files}))
+        train_datasets = []
+        for m in train_media_types:
+            dataset_cls = ITImgTrainDataset if m == "image" else ITVidTrainDataset
+            # dataset of the same media_type will be mixed in a single Dataset object
+            _train_files = [e for e in train_files if get_media_type(e) == m]
+            datasets = []
+            for train_file in _train_files:
+                dataset_kwargs = dict(
+                    ann_file=train_file,
+                    transform=train_transform,
+                    mm_alone=config.preprocess.get("mm_alone", True),
+                    add_second_msg=config.preprocess.get("add_second_msg", True),
+                    skip_short_sample=config.preprocess.get("skip_short_sample", False),
+                    clip_transform=config.preprocess.get("clip_transform", False),
+                    random_shuffle=config.preprocess.get("random_shuffle", True),
+                    system=config.preprocess.get("system", ""),
+                    role=config.preprocess.get('roles', ("Human", "Assistant")),
+                    end_signal=config.preprocess.get('end_signal', "###"),
+                    begin_signal=config.preprocess.get('begin_signal', ""),
+                )
+                if m == "video":
+                    video_only_dataset_kwargs_train.update({
+                        "start_token": config.model.get("start_token", "<Video>"),
+                        "end_token": config.model.get("end_token", "</Video>"),
+                    })
+                    dataset_kwargs.update(video_only_dataset_kwargs_train)
+                    if "tgif" in train_file[1]:
+                        video_only_dataset_kwargs_train.update({
+                            "video_reader_type": "gif"
+                        })
+                        dataset_kwargs.update(video_only_dataset_kwargs_train)
+                    elif "webvid" in train_file[1]:
+                        video_only_dataset_kwargs_train.update({
+                            "video_reader_type": "hdfs"
+                        })
+                    else:
+                        video_only_dataset_kwargs_train.update({
+                            "video_reader_type": "decord"
+                        })
+                    dataset_kwargs.update(video_only_dataset_kwargs_train)
+                datasets.append(dataset_cls(**dataset_kwargs))
+            dataset = ConcatDataset(datasets)
+            train_datasets.append(dataset)
+        return train_datasets
+def create_loader(datasets, samplers, batch_size, num_workers, is_trains, collate_fns):
+    loaders = []
+    for dataset, sampler, bs, n_worker, is_train, collate_fn in zip(
+        datasets, samplers, batch_size, num_workers, is_trains, collate_fns
+    ):
+        if is_train:
+            shuffle = sampler is None
+            drop_last = True
+        else:
+            shuffle = False
+            drop_last = False
+        loader = DataLoader(
+            dataset,
+            batch_size=bs,
+            num_workers=n_worker,
+            pin_memory=False,
+            sampler=sampler,
+            shuffle=shuffle,
+            collate_fn=collate_fn,
+            drop_last=drop_last,
+            persistent_workers=True if n_worker > 0 else False,
+        )
+        loaders.append(loader)
+    return loaders

dataset/base_dataset.py ADDED Viewed

	@@ -0,0 +1,108 @@

+import logging
+import os
+import json
+import random
+from torch.utils.data import Dataset
+import time
+from dataset.utils import load_image_from_path
+try:
+    from petrel_client.client import Client
+    has_client = True
+except ImportError:
+    has_client = False
+logger = logging.getLogger(__name__)
+class ImageVideoBaseDataset(Dataset):
+    """Base class that implements the image and video loading methods"""
+    media_type = "video"
+    def __init__(self):
+        assert self.media_type in ["image", "video", "only_video"]
+        self.data_root = None
+        self.anno_list = (
+            None  # list(dict), each dict contains {"image": str, # image or video path}
+        )
+        self.transform = None
+        self.video_reader = None
+        self.num_tries = None
+        self.client = None
+        if has_client:
+            self.client = Client('~/petreloss.conf')
+    def __getitem__(self, index):
+        raise NotImplementedError
+    def __len__(self):
+        raise NotImplementedError
+    def get_anno(self, index):
+        """obtain the annotation for one media (video or image)
+        Args:
+            index (int): The media index.
+        Returns: dict.
+            - "image": the filename, video also use "image".
+            - "caption": The caption for this file.
+        """
+        anno = self.anno_list[index]
+        if self.data_root is not None:
+            anno["image"] = os.path.join(self.data_root, anno["image"])
+        return anno
+    def load_and_transform_media_data(self, index, data_path):
+        if self.media_type == "image":
+            return self.load_and_transform_media_data_image(index, data_path, clip_transform=self.clip_transform)
+        else:
+            return self.load_and_transform_media_data_video(index, data_path, clip_transform=self.clip_transform)
+    def load_and_transform_media_data_image(self, index, data_path, clip_transform=False):
+        image = load_image_from_path(data_path, client=self.client)
+        if not clip_transform:
+            image = self.transform(image)
+        return image, index
+    def load_and_transform_media_data_video(self, index, data_path, return_fps=False, clip=None, clip_transform=False):
+        for _ in range(self.num_tries):
+            try:
+                max_num_frames = self.max_num_frames if hasattr(self, "max_num_frames") else -1
+                if "webvid" in data_path:
+                    hdfs_dir="hdfs://harunava/home/byte_ailab_us_cvg/user/weimin.wang/videogen_data/webvid_data/10M_full_train"
+                    video_name = os.path.basename(data_path)
+                    video_id, extension = os.path.splitext(video_name)
+                    ind_file = os.path.join(hdfs_dir, self.keys_indexfile[video_id])
+                    frames, frame_indices, fps = self.video_reader(ind_file, video_id, self.num_frames, self.sample_type,
+                                               max_num_frames=max_num_frames, client=self.client, clip=clip)
+                else:
+                    frames, frame_indices, fps = self.video_reader(
+                        data_path, self.num_frames, self.sample_type,
+                        max_num_frames=max_num_frames, client=self.client, clip=clip
+                    )
+            except Exception as e:
+                logger.warning(
+                    f"Caught exception {e} when loading video {data_path}, "
+                    f"randomly sample a new video as replacement"
+                )
+                index = random.randint(0, len(self) - 1)
+                ann = self.get_anno(index)
+                data_path = ann["image"]
+                continue
+            # shared aug for video frames
+            if not clip_transform:
+                frames = self.transform(frames)
+            if return_fps:
+                sec = [str(round(f / fps, 1)) for f in frame_indices]
+                return frames, index, sec
+            else:
+                return frames, index
+        else:
+            raise RuntimeError(
+                f"Failed to fetch video after {self.num_tries} tries. "
+                f"This might indicate that you have many corrupted videos."
+            )

dataset/it_dataset.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import logging
+import os
+import json
+import sqlite3
+import random
+from os.path import basename
+import numpy as np
+import datetime
+from dataset.base_dataset import ImageVideoBaseDataset
+from dataset.video_utils import VIDEO_READER_FUNCS
+logger = logging.getLogger(__name__)
+IMAGE_TOKEN="<image>"
+class ITImgTrainDataset(ImageVideoBaseDataset):
+    media_type = "image"
+    def __init__(
+        self, ann_file, transform,
+        system="", role=("Human", "Assistant"),
+        mm_alone=True,
+        add_second_msg=True,
+        start_token="<Image>", end_token="</Image>",
+        random_shuffle=True, # if True, shuffle the QA list ##xl:????? why need random shuffle
+        begin_signal=None,
+        end_signal=None,
+        clip_transform=False,
+        skip_short_sample=False,
+    ):
+        super().__init__()
+        self.mm_alone = mm_alone
+        self.clip_transform = clip_transform
+        if len(ann_file) == 3 and ann_file[2] == "video":
+            self.media_type = "video"
+        else:
+            self.media_type = "image"
+        self.label_file, self.data_root = ann_file[:2]
+        logger.info('Load json file')
+        with open(self.label_file, 'r') as f:
+            self.anno = json.load(f)
+        self.num_examples = len(self.anno)
+        self.transform = transform
+        annos = []
+        for ann in self.anno:
+            filename = ann['video'] if 'video' in ann else ann['image']
+            if self.media_type =='video' and "webvid" in self.data_root:
+                video_id, extension = os.path.splitext(os.path.basename(filename))
+                if video_id not in self.keys_indexfile:
+                    pass
+                else:
+                    annos.append(ann)
+            else:
+                if filename is None or filename=="None":
+                    pass
+                else:
+                    if os.path.exists(os.path.join(self.data_root, filename)):
+                        annos.append(ann)
+                    else:
+                        ...
+        self.anno = annos
+        self.num_examples = len(self.anno)
+        # prompt parameters
+        if system:
+            assert system[-1] == " ", "' ' should be add in the end of system, thus '###' will be tokenized into one token."
+        # currently not support add start_token and end_token in the system, since the msg should be added properly
+        self.begin_signal = [begin_signal for _ in role] if isinstance(begin_signal, str) else begin_signal
+        self.end_signal = [end_signal for _ in role] if isinstance(end_signal, str) else end_signal
+        self.start_token = start_token
+        self.end_token = end_token
+        self.system = system
+        self.role = role
+        self.random_shuffle = random_shuffle
+        # instruction location and number
+        logger.info(f"Random shuffle: {self.random_shuffle}")
+    def get_anno(self, index):
+        filename = self.anno[index][self.media_type]
+        qa = self.anno[index]["QA"]
+        if "start" in self.anno[index] and "end" in self.anno[index]:
+            anno = {
+                "image": os.path.join(self.data_root, filename), "qa": qa,
+                "start": self.anno[index]["start"], "end": self.anno[index]["end"],
+            }
+        else:
+            anno = {"image": os.path.join(self.data_root, filename), "qa": qa}
+        return anno
+    def __len__(self):
+        return self.num_examples
+    def process_qa(self, qa, msg=""):
+        cur_instruction = ""
+        # randomly shuffle qa for conversation
+        if self.random_shuffle and len(qa) > 1:
+            random.shuffle(qa)
+        if "i" in qa[0].keys() and qa[0]["i"] != "":
+            cur_instruction = qa[0]["i"] + self.end_signal[0]
+        conversation = self.system
+        # add instruction as system message
+        if cur_instruction:
+            conversation += cur_instruction
+        # rstrip() for the extra " " in msg
+        if self.mm_alone:
+            conversation += (
+                self.begin_signal[0] + self.role[0] +
+                self.start_token + self.end_token + msg.rstrip() + self.end_signal[0]
+            )
+        for i, sentence in enumerate(qa):
+            q = self.start_token + self.end_token+"\n"+ qa[0]["q"] if (not self.mm_alone) and (i == 0) else sentence["q"]
+            a = sentence["a"]
+            if q != "":
+                conversation += (self.begin_signal[0] + self.role[0] + q + self.end_signal[1])
+            else:
+                # no question, often in caption dataset
+                pass
+            conversation += (self.begin_signal[0] + self.role[1] + a + self.end_signal[1])
+        if cur_instruction:
+            cur_instruction += qa[0]["q"]
+        return conversation, cur_instruction.strip()
+    def __getitem__(self, index):
+        try:
+            ann = self.get_anno(index)
+            image, index = self.load_and_transform_media_data_image(index, ann["image"], clip_transform=self.clip_transform)
+            conversation, instruction = self.process_qa(ann["qa"])
+            return image, conversation, instruction, index
+        except Exception as e:
+            logger.warning(f"Caught exception {e} when loading image {ann['image']}")
+            index = np.random.randint(0, len(self))
+            return self.__getitem__(index)
+class ITVidTrainDataset(ITImgTrainDataset):
+    media_type = "video"
+    def __init__(
+        self, ann_file, transform,
+        num_frames=4, video_reader_type="decord", sample_type="rand", num_tries=3,
+        mm_alone=True,
+        system="", role=("Human", "Assistant"),
+        start_token="<Video>", end_token="</Video>",
+        add_second_msg=True,
+        random_shuffle=True,
+        begin_signal=None,
+        end_signal=None,
+        clip_transform=False,
+        skip_short_sample=False,
+    ):
+        # "id index file for webvid"
+        if "webvid" in ann_file[1]:
+            with open("/mnt/bn/dq-storage-ckpt/xulin/datasets/videos/webvid_10m/keys_indexfile.json") as f:
+                self.keys_indexfile = json.load(f) # the correponding index file for each webvid id
+        super().__init__(
+            ann_file, transform,
+            system=system, role=role,
+            mm_alone=mm_alone,
+            start_token=start_token, end_token=end_token,
+            random_shuffle=random_shuffle,
+            begin_signal=begin_signal,
+            end_signal=end_signal,
+            clip_transform=clip_transform,
+            skip_short_sample=skip_short_sample,
+        )
+        self.num_frames = num_frames
+        self.video_reader_type = video_reader_type
+        self.video_reader = VIDEO_READER_FUNCS[video_reader_type]
+        self.sample_type = sample_type
+        self.num_tries = num_tries
+        self.add_second_msg = add_second_msg
+        logger.info(f"Use {video_reader_type} for data in {ann_file}")
+        if add_second_msg:
+            logger.info(f"Add second message: The video contains X frames sampled at T seconds.")
+    def __getitem__(self, index):
+        try:
+            ann = self.get_anno(index)
+            msg = ""
+            clip = None
+            if "start" in ann and "end" in ann:
+                clip = [ann["start"], ann["end"]]
+            video, index, sec = self.load_and_transform_media_data_video(index, ann["image"], return_fps=True, clip=clip, clip_transform=self.clip_transform)
+            if self.add_second_msg:
+                # " " should be added in the start and end
+                msg = f" The video contains {len(sec)} frames sampled at {', '.join(sec)} seconds. "
+            conversation, instruction = self.process_qa(ann["qa"], msg)
+            return video, conversation, instruction, index
+        except Exception as e:
+            logger.warning(f"Caught exception {e} when loading video {ann['image']}")
+            index = np.random.randint(0, len(self))
+            return self.__getitem__(index)

dataset/utils.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from utils.distributed import is_main_process, get_rank, get_world_size
+import io
+import json
+import re
+import numpy as np
+from os.path import join
+from tqdm import trange
+from PIL import Image
+from PIL import ImageFile
+from torchvision.transforms import PILToTensor
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+Image.MAX_IMAGE_PIXELS = None
+def load_image_from_path(image_path, client):
+    if image_path.startswith('s3') or image_path.startswith('p2'):
+        value = client.Get(image_path)
+        img_bytes = np.frombuffer(value, dtype=np.uint8)
+        buff = io.BytesIO(img_bytes)
+        image = Image.open(buff).convert('RGB')
+    else:
+        image = Image.open(image_path).convert('RGB')  # PIL Image
+    image = PILToTensor()(image).unsqueeze(0)  # (1, C, H, W), torch.uint8
+    return image
+def pre_text(text, max_l=None, pre_text=True):
+    if pre_text:
+        text = re.sub(r"([,.'!?\"()*#:;~])", '', text.lower())
+        text = text.replace('-', ' ').replace('/', ' ').replace('<person>', 'person')
+        text = re.sub(r"\s{2,}", ' ', text)
+        text = text.rstrip('\n').strip(' ')
+        if max_l:  # truncate
+            words = text.split(' ')
+            if len(words) > max_l:
+                text = ' '.join(words[:max_l])
+    else:
+        pass
+    return text

dataset/video_utils.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""
+Modified from https://github.com/m-bain/frozen-in-time/blob/22a91d78405ec6032fdf521ae1ff5573358e632f/base/base_dataset.py
+"""
+import random
+import io
+import os
+import av
+import cv2
+import decord
+import imageio
+from decord import VideoReader
+# from dataloader import KVReader
+import torch
+import numpy as np
+import math
+# import tensorflow as tf
+decord.bridge.set_bridge("torch")
+import logging
+logger = logging.getLogger(__name__)
+def pts_to_secs(pts: int, time_base: float, start_pts: int) -> float:
+    """
+    Converts a present time with the given time base and start_pts offset to seconds.
+    Returns:
+        time_in_seconds (float): The corresponding time in seconds.
+    https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo/data/utils.py#L54-L64
+    """
+    if pts == math.inf:
+        return math.inf
+    return int(pts - start_pts) * time_base
+def get_pyav_video_duration(video_reader):
+    video_stream = video_reader.streams.video[0]
+    video_duration = pts_to_secs(
+        video_stream.duration,
+        video_stream.time_base,
+        video_stream.start_time
+    )
+    return float(video_duration)
+def get_frame_indices_by_fps():
+    pass
+def get_frame_indices(num_frames, vlen, sample='rand', fix_start=None, input_fps=1, max_num_frames=-1):
+    if sample in ["rand", "middle"]: # uniform sampling
+        acc_samples = min(num_frames, vlen)
+        # split the video into `acc_samples` intervals, and sample from each interval.
+        intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
+        ranges = []
+        for idx, interv in enumerate(intervals[:-1]):
+            ranges.append((interv, intervals[idx + 1] - 1))
+        if sample == 'rand':
+            try:
+                frame_indices = [random.choice(range(x[0], x[1])) for x in ranges]
+            except:
+                frame_indices = np.random.permutation(vlen)[:acc_samples]
+                frame_indices.sort()
+                frame_indices = list(frame_indices)
+        elif fix_start is not None:
+            frame_indices = [x[0] + fix_start for x in ranges]
+        elif sample == 'middle':
+            frame_indices = [(x[0] + x[1]) // 2 for x in ranges]
+        else:
+            raise NotImplementedError
+        if len(frame_indices) < num_frames:  # padded with last frame
+            padded_frame_indices = [frame_indices[-1]] * num_frames
+            padded_frame_indices[:len(frame_indices)] = frame_indices
+            frame_indices = padded_frame_indices
+    elif "fps" in sample:  # fps0.5, sequentially sample frames at 0.5 fps
+        output_fps = float(sample[3:])
+        duration = float(vlen) / input_fps
+        delta = 1 / output_fps  # gap between frames, this is also the clip length each frame represents
+        frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta)
+        frame_indices = np.around(frame_seconds * input_fps).astype(int)
+        frame_indices = [e for e in frame_indices if e < vlen]
+        if max_num_frames > 0 and len(frame_indices) > max_num_frames:
+            frame_indices = frame_indices[:max_num_frames]
+            # frame_indices = np.linspace(0 + delta / 2, duration + delta / 2, endpoint=False, num=max_num_frames)
+    else:
+        raise ValueError
+    return frame_indices
+def read_frames_av(
+        video_path, num_frames, sample='rand', fix_start=None,
+        max_num_frames=-1, client=None, clip=None,
+    ):
+    reader = av.open(video_path)
+    frames = [torch.from_numpy(f.to_rgb().to_ndarray()) for f in reader.decode(video=0)]
+    vlen = len(frames)
+    duration = get_pyav_video_duration(reader)
+    fps = vlen / float(duration)
+    frame_indices = get_frame_indices(
+        num_frames, vlen, sample=sample, fix_start=fix_start,
+        input_fps=fps, max_num_frames=max_num_frames
+    )
+    frames = torch.stack([frames[idx] for idx in frame_indices])  # (T, H, W, C), torch.uint8
+    frames = frames.permute(0, 3, 1, 2)  # (T, C, H, W), torch.uint8
+    return frames, frame_indices, fps
+def read_frames_gif(
+        video_path, num_frames, sample='rand', fix_start=None,
+        max_num_frames=-1, client=None, clip=None,
+    ):
+    if video_path.startswith('s3') or video_path.startswith('p2'):
+        video_bytes = client.get(video_path)
+        gif = imageio.get_reader(io.BytesIO(video_bytes))
+    else:
+        gif = imageio.get_reader(video_path)
+    vlen = len(gif)
+    frame_indices = get_frame_indices(
+        num_frames, vlen, sample=sample, fix_start=fix_start,
+        max_num_frames=max_num_frames
+    )
+    frames = []
+    for index, frame in enumerate(gif):
+        # for index in frame_idxs:
+        if index in frame_indices:
+            frame = cv2.cvtColor(frame, cv2.COLOR_RGBA2RGB)
+            frame = torch.from_numpy(frame).byte()
+            # # (H x W x C) to (C x H x W)
+            frame = frame.permute(2, 0, 1)
+            frames.append(frame)
+    frames = torch.stack(frames)  # .float() / 255
+    return frames, frame_indices, 25. # for tgif
+def read_frames_hdfs(ind_file, vid, num_frames, sample='rand',fix_start=None,
+        max_num_frames=-1, client=None, clip=None):
+    _context_features = {'title': tf.io.FixedLenFeature([], dtype=tf.string)}
+    _sequence_features = {'data': tf.io.FixedLenSequenceFeature([], dtype=tf.string)}
+    num_parallel_reader = 1
+    filename, extension = os.path.splitext(ind_file)
+    reader = KVReader(filename, num_parallel_reader)
+    key = vid
+    values = reader.read_many([key])
+    item = values[0]
+    contexts, sequences = tf.io.parse_single_sequence_example(
+            serialized=item,
+            context_features=_context_features,
+            sequence_features=_sequence_features)
+    # text = contexts['title'].numpy().decode("utf-8")
+    rawframes = sequences['data']
+    vlen = len(rawframes)
+    sample="rand"
+    frame_indices = get_frame_indices(num_frames, vlen, sample=sample,
+                                      fix_start=fix_start,
+                                      max_num_frames=max_num_frames)
+    def read_image(raw_data):
+        return tf.image.decode_jpeg(raw_data, channels=3, dct_method='INTEGER_ACCURATE').numpy()
+    frames = []
+    for index, frame in enumerate(rawframes):
+        if index in frame_indices:
+            frame = read_image(frame)
+            frame = torch.as_tensor(frame)
+            frames.append(frame)
+    frames = torch.stack(frames)
+    # print("in hdfs========>",frames[0])
+    frames = frames.permute(0, 3, 1, 2)
+    return frames, frame_indices, 25 # don't know the fps for index
+def read_frames_decord(
+        video_path, num_frames, sample='rand', fix_start=None,
+        max_num_frames=-1, client=None, clip=None
+    ):
+    if video_path.startswith('s3') or video_path.startswith('p2'):
+        video_bytes = client.get(video_path)
+        video_reader = VideoReader(io.BytesIO(video_bytes), num_threads=1)
+    else:
+        video_reader = VideoReader(video_path, num_threads=1)
+    vlen = len(video_reader)
+    fps = video_reader.get_avg_fps()
+    duration = vlen / float(fps)
+    if clip:
+        start, end = clip
+        duration = end - start
+        vlen = int(duration * fps)
+        start_index = int(start * fps)
+    frame_indices = get_frame_indices(
+        num_frames, vlen, sample=sample, fix_start=fix_start,
+        input_fps=fps, max_num_frames=max_num_frames
+    )
+    if clip:
+        frame_indices = [f + start_index for f in frame_indices]
+    frames = video_reader.get_batch(frame_indices)  # (T, H, W, C), torch.uint8
+    frames = frames.permute(0, 3, 1, 2)  # (T, C, H, W), torch.uint8
+    return frames, frame_indices, float(fps)
+VIDEO_READER_FUNCS = {
+    'av': read_frames_av,
+    'decord': read_frames_decord,
+    'gif': read_frames_gif,
+    'hdfs': read_frames_hdfs,
+}

docs/PoolLLaVA_Report.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6b9f175bd915cdc6f9791a95149992fde1f48ebfffa6c8bff9e6365b7186c57d
+size 3850702

example/1917.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:99f5f2a10985964ddc0555a8fa12b9d41f130b49ad62879a9e150d91834e93d5
+size 1535936

example/bear.jpg ADDED Viewed

Git LFS Details

SHA256: 286b3a5693322edf01870a561e35016ed46a7cb4b9194c58e2f3526eab1f9efc
Pointer size: 131 Bytes
Size of remote file: 376 kB

example/cooking.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a1395530cc13c0441ae99ce66477f533f6009ebdb913064aec91e38eaf3b8e9
+size 876622

example/dog.png ADDED Viewed

Git LFS Details

SHA256: 919b6e24d3cc7d7998181029fb76e94d8149e6a9d2c4930445fa217f6715716d
Pointer size: 131 Bytes
Size of remote file: 563 kB

example/jesse_dance.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f1fc41c6ebae0692726ea56b33ba711f21186fd4203ac54cd43a5cd898be4350
+size 1221420

example/working.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:09372cdb6b0ea272868b4469d5067674670a948962f1236196e8f23e1f7ce764
+size 4718899

example/yoga.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74b65d9bec7f83e487b7f923076c01d476dd2ef7ed83928a696ab6f88c7751b7
+size 776184

models/__init__.py ADDED Viewed

File without changes

models/pllava/__init__.py ADDED Viewed

	@@ -0,0 +1,55 @@

+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+from transformers.utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+_import_structure = {"configuration_pllava": ["PLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP", "PllavaConfig"]}
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_pllava"] = [
+        "PLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "PllavaForConditionalGeneration",
+        "PllavaPreTrainedModel",
+    ]
+    _import_structure["processing_pllava"] = ["PllavaProcessor"]
+if TYPE_CHECKING:
+    from .configuration_pllava import PLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP, PllavaConfig
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_pllava import (
+            PLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST,
+            PllavaForConditionalGeneration,
+            PllavaPreTrainedModel,
+        )
+        from .processing_pllava import PllavaProcessor
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)

models/pllava/configuration_pllava.py ADDED Viewed

	@@ -0,0 +1,149 @@

+# coding=utf-8
+# Copyright 2023 Microsoft Research & University of Wisconsin-Madison and the HuggingFace Inc. team. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Llava model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+from transformers.models.auto import CONFIG_MAPPING
+logger = logging.get_logger(__name__)
+PLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "llava-hf/llava-v1.5-7b": "https://huggingface.co/llava-hf/llava-v1.5-7b/resolve/main/config.json",
+}
+class PllavaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LlavaForConditionalGeneration`]. It is used to instantiate an
+    Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the Llava-9B.
+    e.g. [llava-hf/llava-9b](https://huggingface.co/llava-hf/llava-9b)
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vision_config (`LlavaVisionConfig`,  *optional*):
+            Custom vision config or dict
+        text_config (`Union[AutoConfig, dict]`, *optional*):
+            The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`.
+        ignore_index (`int`, *optional*, defaults to -100):
+            The ignore index for the loss function.
+        image_token_index (`int`, *optional*, defaults to 32000):
+            The image token index to encode the image prompt.
+        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
+            The activation function used by the multimodal projector.
+        vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
+            The feature selection strategy used to select the vision feature from the CLIP backbone.
+        vision_feature_layer (`int`, *optional*, defaults to -2):
+            The index of the layer to select the vision feature.
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the Llava model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~LlavaForConditionalGeneration`]
+    Example:
+    ```python
+    >>> from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
+    >>> # Initializing a CLIP-vision config
+    >>> vision_config = CLIPVisionConfig()
+    >>> # Initializing a Llama config
+    >>> text_config = LlamaConfig()
+    >>> # Initializing a Llava llava-1.5-7b style configuration
+    >>> configuration = LlavaConfig(vision_config, text_config)
+    >>> # Initializing a model from the llava-1.5-7b style configuration
+    >>> model = LlavaForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "llava"
+    is_composition = False
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        ignore_index=-100,
+        image_token_index=32000,
+        projector_hidden_act="gelu",
+        vision_feature_select_strategy="default",
+        vision_feature_layer=-2,
+        vocab_size=32000,
+        pooling_method='avg',
+        pooling_shape=(8, 16, 16),
+        frame_shape=(24, 24), # llava 1.5 pretrained frame shape
+        num_frames=1, # llava 1.5 pretrained frame shape
+        use_pooling=True,
+        gradient_checkpointing=False,
+        **kwargs,
+    ):
+        self.ignore_index = ignore_index
+        self.image_token_index = image_token_index
+        self.projector_hidden_act = projector_hidden_act
+        self.vision_feature_select_strategy = vision_feature_select_strategy
+        self.vision_feature_layer = vision_feature_layer
+        self.vocab_size = vocab_size
+        self.use_pooling = use_pooling
+        self.gradient_checkpointing = gradient_checkpointing
+        self.vision_config = vision_config
+        self.pooling_method = pooling_method # should be in 'max', 'avg'
+        self.pooling_shape = pooling_shape #
+        self.frame_shape = frame_shape #
+        self.num_frames = num_frames
+        if isinstance(self.vision_config, dict):
+            vision_config["model_type"] = (
+                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
+            )
+            self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
+        elif vision_config is None:
+            self.vision_config = CONFIG_MAPPING["clip_vision_model"](
+                intermediate_size=4096,
+                hidden_size=1024,
+                patch_size=14,
+                image_size=336,
+                num_hidden_layers=24,
+                num_attention_heads=16,
+                vocab_size=32000,
+                projection_dim=768,
+            )
+        self.vocab_size = self.vocab_size
+        self.text_config = text_config
+        if isinstance(self.text_config, dict):
+            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
+            self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
+            self.vocab_size = self.text_config.vocab_size
+            self.text_config.gradient_checkpointing = self.gradient_checkpointing
+        elif text_config is None:
+            tmp_config = {"_attn_implementation":"flash_attention_2",
+                          "gradient_checkpointing": self.gradient_checkpointing}
+            self.text_config = CONFIG_MAPPING["llama"](**tmp_config)
+            self.text_config.gradient_checkpointing = self.gradient_checkpointing
+        # self.text_config["_attn_implementation"]="flash_attention_2"  # xl: temporal hard code
+        super().__init__(**kwargs)

models/pllava/convert_pllava_weights_to_hf.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Not yet

models/pllava/modeling_pllava.py ADDED Viewed

	@@ -0,0 +1,626 @@

+# coding=utf-8
+# Copyright 2023 the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Llava model."""
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import math
+import torch
+import torch.utils.checkpoint
+from torch import nn
+import os
+from transformers import PreTrainedModel
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache
+from transformers.modeling_outputs import ModelOutput
+from transformers.utils import (
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.models.auto import AutoModel, AutoModelForCausalLM
+import einops
+from .configuration_pllava import PllavaConfig
+import pickle
+logger = logging.get_logger(__name__)
+_CONFIG_FOR_DOC = "LlavaConfig"
+PLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "",
+    "",
+    "",
+    # See all Llava models at https://huggingface.co/models?filter=llava
+]
+@dataclass
+# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Llava
+class PllavaCausalLMOutputWithPast(ModelOutput):
+    """
+    Base class for Llava causal language model (or autoregressive) outputs.
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Language modeling loss (for next-token prediction).
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
+            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
+            `past_key_values` input) to speed up sequential decoding.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
+            Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
+            sequence_length, hidden_size)`.
+            image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    past_key_values: Optional[List[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    image_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+class PllavaMultiModalProjector(nn.Module):
+    supported_highres = ['pad_crop_four', 'slide', ]
+    def __init__(self, config: PllavaConfig):
+        super().__init__()
+        self.use_pooling = config.use_pooling
+        self.frame_shape=config.frame_shape
+        self.num_frames = config.num_frames
+        self.pooling_shape = config.pooling_shape
+        self.pooling = nn.AdaptiveAvgPool3d(config.pooling_shape)
+        self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
+        self.act = ACT2FN[config.projector_hidden_act]
+        self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)
+    def convert_Fembeddings2video(self, input, num_videos, frame_shape):
+        input = einops.rearrange(input,
+                                '(num_videos num_frames) (h w) embed_dims -> num_videos embed_dims num_frames h w',
+                                num_videos=num_videos, h=frame_shape[0])
+        return input
+    def convert_video2Fembeddings(self, input):
+        input = einops.rearrange(input, 'num_videos embed_dims num_frames h w -> (num_videos num_frames) (h w) embed_dims ', )
+        return input
+    def convert_video2MMembeddings(self, input):
+        input = einops.rearrange(input, 'num_videos embed_dims num_frames h w -> num_videos (num_frames h w) embed_dims ', )
+        return input
+    def forward(self, image_features, media_type, batch_size=None, num_videos=None):
+        frame_shape = self.frame_shape
+        num_frames = self.num_frames
+        assert media_type in ( 'video', 'image'), f'only image or video, but got media_type {media_type}'
+        hidden_states = image_features
+        if media_type == 'image':
+            hidden_states = hidden_states.repeat(num_frames, 1, 1)
+        total_frames, spatial_seqlen, embed_dims = hidden_states.shape
+        #TODO: temporal code, should ensure num_frames == total frames in data loading later
+        if total_frames < num_frames and self.use_pooling: #
+            multiplier = int(num_frames/total_frames)+1
+            hidden_states= hidden_states.repeat_interleave(multiplier, dim=0)[:num_frames]
+            total_frames, spatial_seqlen, embed_dims = hidden_states.shape
+        assert total_frames % num_frames == 0
+        assert frame_shape[0] * frame_shape[1] == spatial_seqlen
+        hidden_states = self.linear_1(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        hidden_states_videos = self.convert_Fembeddings2video(hidden_states, num_videos * batch_size, frame_shape)
+        hidden_states_videos = self.pooling(hidden_states_videos)
+        hidden_states = einops.rearrange(hidden_states_videos, 'batch_size_num_videos embed_dims num_frames h w -> batch_size_num_videos num_frames (h w) embed_dims', )
+        hidden_states = einops.rearrange(hidden_states, 'batch_size_num_videos num_frames hw embed_dims -> batch_size_num_videos (num_frames hw) embed_dims ')
+        return hidden_states
+PLLAVA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`LlavaConfig`] or [`LlavaVisionConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    PLLAVA_START_DOCSTRING,
+)
+class PllavaPreTrainedModel(PreTrainedModel):
+    config_class = PllavaConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["LlavaVisionAttention"]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn_2 = True
+    def _init_weights(self, module):
+        # important: this ported version of Llava isn't meant for training from scratch - only
+        # inference and fine-tuning - so the proper init weights code has been removed - the original codebase
+        # https://github.com/haotian-liu/LLaVA/tree/main/llava should serve for that purpose
+        std = (
+            self.config.initializer_range
+            if hasattr(self.config, "initializer_range")
+            else self.config.text_config.initializer_range
+        )
+        if hasattr(module, "class_embedding"):
+            module.class_embedding.data.normal_(mean=0.0, std=std)
+        # if isinstance(module, (nn.Linear, nn.Conv2d)):
+        #     module.weight.data.normal_(mean=0.0, std=std)
+        #     if module.bias is not None:
+        #         module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, PllavaMultiModalProjector):
+            # module.register_embed.data.normal_(mean=0.0, std=std)
+            if self.config.register:
+                module.register_embed.data.zero_()
+    @property
+    def _supports_sdpa(self):
+        """
+        Retrieve language_model's attribute to check whether the model supports
+        SDPA or not.
+        """
+        return self.language_model._supports_sdpa
+PLLAVA_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)):
+            The tensors corresponding to the input images. Pixel values can be obtained using
+            [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details ([]`LlavaProcessor`] uses
+            [`CLIPImageProcessor`] for processing images).
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids)
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+@add_start_docstrings(
+    """The LLAVA model which consists of a vision backbone and a language model.""",
+    PLLAVA_START_DOCSTRING,
+)
+class PllavaForConditionalGeneration(PllavaPreTrainedModel):
+    def __init__(self, config: PllavaConfig):
+        super().__init__(config)
+        self.config = config
+        self.vision_tower = AutoModel.from_config(config.vision_config)
+        self.multi_modal_projector = PllavaMultiModalProjector(config)
+        self.vocab_size = config.vocab_size
+        # self.language_model = AutoModelForCausalLM.from_config(config.text_config, torch_dtype=config.torch_dtype, attn_implementation="flash_attention_2")
+        self.language_model = AutoModelForCausalLM.from_config(config.text_config, torch_dtype=config.torch_dtype, attn_implementation="eager")
+        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else self.config.text_config.pad_token_id
+        assert self.pad_token_id is not None, 'provide the model with pad_token_id, this would be used to arranging new embedings'
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self):
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+    def set_decoder(self, decoder):
+        self.language_model.set_decoder(decoder)
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+    def tie_weights(self):
+        return self.language_model.tie_weights()
+    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
+        model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
+        # update vocab size
+        self.config.text_config.vocab_size = model_embeds.num_embeddings
+        self.config.vocab_size = model_embeds.num_embeddings
+        self.vocab_size = model_embeds.num_embeddings
+        return model_embeds
+    def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, input_ids, attention_mask, labels):
+        num_images, num_image_patches, embed_dim = image_features.shape
+        batch_size, sequence_length = input_ids.shape
+        left_padding = not torch.sum(input_ids[:, -1] == torch.tensor(self.pad_token_id))
+        # 1. Create a mask to know where special image tokens are
+        special_image_token_mask = input_ids == self.config.image_token_index
+        num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1)
+        # Compute the maximum embed dimension
+        max_embed_dim = (num_special_image_tokens.max() * (num_image_patches - 1)) + sequence_length
+        batch_indices, non_image_indices = torch.where(input_ids != self.config.image_token_index)
+        # 2. Compute the positions where text should be written
+        # Calculate new positions for text tokens in merged image-text sequence.
+        # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
+        # `torch.cumsum` computes how each image token shifts subsequent text token positions.
+        # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
+        new_token_positions = torch.cumsum((special_image_token_mask * (num_image_patches - 1) + 1), -1) - 1
+        nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
+        if left_padding:
+            new_token_positions += nb_image_pad[:, None]  # offset for left padding
+        text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
+        # 3. Create the full embedding, already padded to the maximum position
+        final_embedding = torch.zeros(
+            batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
+        )
+        final_attention_mask = torch.zeros(
+            batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
+        )
+        if labels is not None:
+            final_labels = torch.full(
+                (batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device
+            )
+        # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
+        # set the corresponding tensors into their correct target device.
+        target_device = inputs_embeds.device
+        batch_indices, non_image_indices, text_to_overwrite = (
+            batch_indices.to(target_device),
+            non_image_indices.to(target_device),
+            text_to_overwrite.to(target_device),
+        )
+        attention_mask = attention_mask.to(target_device)
+        # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
+        # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
+        final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
+        final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
+        if labels is not None:
+            final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
+        # 5. Fill the embeddings corresponding to the images. Anything that is still zeros needs filling
+        image_to_overwrite = torch.all(final_embedding == 0, dim=-1)
+        image_to_overwrite &= image_to_overwrite.cumsum(-1) > nb_image_pad[:, None].to(target_device)
+        # # somthing really weird here.
+        # temp1 = (image_to_overwrite.cumsum(-1) > nb_image_pad[:, None].to(target_device)) & image_to_overwrite
+        # # this is for right padding
+        # temp2 = (image_to_overwrite.cumsum(-1) <=  num_special_image_tokens.max() * num_image_patches - nb_image_pad[:, None]) & image_to_overwrite
+        if image_to_overwrite.sum() != image_features.shape[:-1].numel():
+            raise ValueError(
+                f"The input provided to the model are wrong. The number of image tokens is {torch.sum(special_image_token_mask)} while"
+                f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
+            )
+        final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
+        final_attention_mask |= image_to_overwrite
+        position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
+        if labels is None:
+            final_labels = None
+        return final_embedding, final_attention_mask, final_labels, position_ids
+    @add_start_docstrings_to_model_forward(PLLAVA_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=PllavaCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        pixel_values: torch.FloatTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        media_type: str = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        vision_feature_layer: Optional[int] = None,
+        vision_feature_select_strategy: Optional[str] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, PllavaCausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+        Returns:
+        Example:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, LlavaForConditionalGeneration
+        >>> model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
+        >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
+        >>> prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"
+        >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(**inputs, max_length=30)
+        >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "\nUSER: What's the content of the image?\nASSISTANT: The image features a stop sign on a street corner"
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        vision_feature_layer = (
+            vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
+        )
+        vision_feature_select_strategy = (
+            vision_feature_select_strategy
+            if vision_feature_select_strategy is not None
+            else self.config.vision_feature_select_strategy
+        )
+        if inputs_embeds is None:
+            # 1. Extra the input embeddings
+            no_img_input_ids = torch.where(input_ids!=self.config.image_token_index, input_ids, self.pad_token_id) # some model used up all the embeddings
+            inputs_embeds = self.get_input_embeddings()(no_img_input_ids)
+            batch_size = inputs_embeds.shape[0]
+            # 2. Merge text and images
+            if pixel_values is not None and input_ids.shape[1] != 1:
+                image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+                # this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
+                selected_image_feature = image_outputs.hidden_states[vision_feature_layer] #  ( b, img_seqlen, embed_dim)
+                if vision_feature_select_strategy == "default":
+                    selected_image_feature = selected_image_feature[:, 1:]
+                elif vision_feature_select_strategy == "full":
+                    raise ValueError("not implemented")
+                    selected_image_feature = selected_image_feature
+                else:
+                    raise ValueError(
+                        f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
+                    )
+                image_features = self.multi_modal_projector(selected_image_feature,
+                                                            media_type,
+                                                            batch_size=batch_size,
+                                                            num_videos=pixel_values.shape[0]//self.config.num_frames//batch_size,)
+                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
+                    image_features, inputs_embeds, input_ids, attention_mask, labels
+                )
+                if labels is None:
+                    labels = torch.full_like(attention_mask, self.config.ignore_index).to(torch.long)
+            else:
+                # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
+                # generation with cache
+                if past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
+                    # Retrieve the first layer to inspect the logits and mask out the hidden states
+                    # that are set to 0
+                    first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
+                    # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
+                    batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
+                    # Get the target length
+                    target_seqlen = first_layer_past_key_value.shape[-1] + 1
+                    extended_attention_mask = torch.ones(
+                        (attention_mask.shape[0], target_seqlen - attention_mask.shape[1]),
+                        dtype=attention_mask.dtype,
+                        device=attention_mask.device,
+                    )
+                    # Filter out only the tokens that can be un-attended, this can happen
+                    # if one uses Llava + Fused modules where the cache on the
+                    # first iteration is already big enough, or if one passes custom cache
+                    valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
+                    new_batch_index = batch_index[valid_indices]
+                    new_non_attended_tokens = non_attended_tokens[valid_indices]
+                    # Zero-out the places where we don't need to attend
+                    extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
+                    attention_mask = torch.cat((attention_mask, extended_attention_mask), dim=1)
+                    position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = outputs[0]
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            if attention_mask is not None:
+                shift_attention_mask = attention_mask[..., 1:]
+                shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
+                shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
+            else:
+                shift_logits = logits[..., :-1, :].contiguous()
+                shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
+            )
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return PllavaCausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs
+    ):
+        if past_key_values is not None:
+            if isinstance(past_key_values, Cache):
+                cache_length = past_key_values.get_seq_length()
+                past_length = past_key_values.seen_tokens
+            else:
+                cache_length = past_length = past_key_values[0][0].shape[2]
+            # Keep only the unprocessed tokens:
+            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
+            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
+            # input)
+            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
+                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
+            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
+            # input_ids based on the past_length.
+            elif past_length < input_ids.shape[1]:
+                input_ids = input_ids[:, past_length:]
+            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
+            elif self.config.image_token_index in input_ids:
+                input_ids = input_ids[:, input_ids.shape[1] - 1 :]
+            # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
+            # older attention values, as their corresponding values are not part of the input.
+            if cache_length < past_length and attention_mask is not None:
+                attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+        media_type = kwargs.get('media_type', None)
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+                "pixel_values": pixel_values,
+                "media_type": media_type,
+            }
+        )
+        return model_inputs
+    def _reorder_cache(self, *args, **kwargs):
+        return self.language_model._reorder_cache(*args, **kwargs)

models/pllava/processing_pllava.py ADDED Viewed

	@@ -0,0 +1,292 @@

+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for Llava.
+"""
+import itertools
+from typing import List, Optional, Union
+import PIL.Image
+import numpy as np
+from transformers import AutoTokenizer
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.image_utils import (
+    ImageInput,
+    make_list_of_images,
+    valid_images,
+    infer_channel_dimension_format,
+    to_numpy_array,
+    get_image_size,
+    ChannelDimension,
+)
+from transformers.image_processing_utils import get_size_dict
+from transformers.image_utils import PILImageResampling
+from transformers.processing_utils import ProcessorMixin
+from transformers.image_transforms import resize, pad, PaddingMode, to_channel_dimension_format, get_resize_output_image_size
+from transformers.tokenization_utils_base import PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
+from transformers.utils import TensorType
+class PllavaProcessor(ProcessorMixin):
+    r"""
+    Constructs a Llava processor which wraps a Llava image processor and a Llava tokenizer into a single processor.
+    [`LlavaProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`LlamaTokenizerFast`]. See the
+    [`~LlavaProcessor.__call__`] and [`~LlavaProcessor.decode`] for more information.
+    Args:
+        image_processor ([`CLIPImageProcessor`], *optional*):
+            The image processor is a required input.
+        tokenizer ([`LlamaTokenizerFast`], *optional*):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "CLIPImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(self, image_processor=None, tokenizer=None,
+                 shortest_edge=336,
+                 longest_edge=762,
+                 center_pad=False):
+        self.shortest_edge = shortest_edge
+        self.longest_edge = longest_edge
+        self.center_pad = center_pad
+        super().__init__(image_processor, tokenizer)
+    def resize_crop_longshort(self, videos: list[list[np.ndarray]], input_data_format):
+        video_spatial_sizes = [get_image_size(images[0], input_data_format) for images in videos]
+        long_short_rates = [max(size) / min(size) for size in video_spatial_sizes]
+        min_long_short_rate = min(long_short_rates)
+        min_long_short_video_idx = long_short_rates.index(min_long_short_rate)
+        clip_resolution = self.image_processor.size['shortest_edge']
+        out_video_spatial_size = video_spatial_sizes[min_long_short_video_idx]
+        out_videos_short_edge = max(min(size) for size in video_spatial_sizes)
+        resize_longest_edge = max(max(size) for size in video_spatial_sizes)
+        resize_longest_edge = min(640, resize_longest_edge)
+        out_videos_short_edge = min(out_videos_short_edge, int(resize_longest_edge / min_long_short_rate))
+        out_videos_short_edge = max(out_videos_short_edge, clip_resolution)
+        if out_video_spatial_size[0] > out_video_spatial_size[1]: # h > w:
+            out_video_spatial_size = (int(out_videos_short_edge * min_long_short_rate), out_videos_short_edge )
+        else:
+            out_video_spatial_size = ( out_videos_short_edge, int(out_videos_short_edge * min_long_short_rate) )
+        videos = [
+            [self.resize(frame, input_data_format=input_data_format, shortest_edge=out_videos_short_edge, longest_edge=9999) for frame in frames]
+            for frames in videos
+        ]
+        out_videos = []
+        for frames in videos:
+            out_frames = []
+            video_spatial_size = get_image_size(frames[0], input_data_format)
+            assert min(video_spatial_size) == out_videos_short_edge
+            overhead = (max(video_spatial_size) - max(out_video_spatial_size)) // 2
+            slice_start, slice_end = overhead // 2,   overhead // 2 + max(out_video_spatial_size)
+            hslice, wslice = (slice(slice_start, slice_end), slice(None, None)) if video_spatial_size[0] > video_spatial_size[1] \
+                             else (slice(None, None), slice(slice_start, slice_end)) # h > w
+            for frame in frames:
+                if input_data_format == ChannelDimension.FIRST:
+                    out_frames.append(frame[..., hslice, wslice])
+                elif input_data_format == ChannelDimension.LAST:
+                    out_frames.append(frame[..., hslice, wslice, :])
+            out_videos.append(out_frames)
+        return out_videos
+    @staticmethod
+    def _compute_num_blocks_and_overlaps(input_shape, resolution):
+        input_shape = np.array(input_shape)
+        resolution = np.array(resolution)
+        assert input_shape.max() >= resolution
+        num_blocks = np.ceil(input_shape / resolution).astype(np.int32).tolist()
+        overlaps = [0 if size % resolution==0
+                    else int(np.floor((resolution - size % resolution) / (num_block - 1))) for num_block, size in zip(num_blocks, input_shape)]
+        return num_blocks, overlaps
+    def resize(
+        self,
+        image: np.ndarray,
+        resample: PILImageResampling = PILImageResampling.BICUBIC, # type: ignore
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        shortest_edge: int = None,
+        longest_edge: int = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+        """
+        shortest_edge = getattr(self, 'shortest_edge', None) if shortest_edge is None else shortest_edge
+        longest_edge = getattr(self, 'longest_edge', None) if longest_edge is None else longest_edge
+        default_to_square = False
+        output_size = get_resize_output_image_size(
+            image,
+            size=shortest_edge,
+            default_to_square=default_to_square,
+            max_size=longest_edge,
+            input_data_format=input_data_format,
+        )
+        clip_resolution = self.image_processor.size['shortest_edge']
+        if min(output_size) < clip_resolution:
+            output_size = get_resize_output_image_size(
+                image,
+                size=shortest_edge,
+                default_to_square=default_to_square,
+                input_data_format=input_data_format,
+            )
+        return resize(
+            image,
+            size=output_size,
+            resample=resample,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
+        )
+    def __call__(
+        self,
+        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+        images: ImageInput = None,
+        center_pad = None,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = None,
+        max_length=None,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
+    ) -> BatchFeature:
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
+                Select a strategy to pad the returned sequences (according to the model's padding side and padding
+                index) among:
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            truncation (`bool`, *optional*):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+        data=dict()
+        if images is not None:
+            if isinstance(images, list) and isinstance(images[0], PIL.Image.Image):
+                videos = [images] # one video
+            else:
+                videos = images
+            pixel_values_list = []
+            videos = [[to_numpy_array(image) for image in make_list_of_images(images)] for images in videos]
+            # images = [self.resize(image, ) if min(get_image_size(image, input_data_format)) < clip_resolution else image for image in images]
+            input_data_format = infer_channel_dimension_format(videos[0][0])
+            videos = self.resize_crop_longshort(videos, input_data_format)
+            for images in videos:
+                if not valid_images(images):
+                    raise ValueError(
+                        "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                        "torch.Tensor, tf.Tensor or jax.ndarray."
+                    )
+                center_pad = center_pad if center_pad is not None else self.center_pad
+                if center_pad:
+                    images = [self.pad_to_square(image, 0, input_data_format, input_data_format) for image in images]
+                pixel_values = self.image_processor(images, return_tensors='np')["pixel_values"]
+                pixel_values_list.append(pixel_values)
+            pixel_values = np.concatenate(pixel_values_list)
+            data.update(pixel_values=pixel_values)
+        else:
+            data.update(pixel_values = None)
+        if text is not None:
+            text_inputs = self.tokenizer(
+                text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
+            )
+            data.update(**text_inputs)
+        return BatchFeature(data, tensor_type=return_tensors)
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+    @property
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

python_scripts/hf.py ADDED Viewed

	@@ -0,0 +1,80 @@

+import os.path as osp
+import os
+import re
+import multiprocessing
+import functools
+import huggingface_hub
+from huggingface_hub import snapshot_download
+def upload(repo_id, local_dir, path_in_repo, repo_type, token):
+    huggingface_hub.upload_folder(
+        repo_id=repo_id,
+        folder_path=local_dir,
+        path_in_repo=path_in_repo,
+        token=token,
+        repo_type=repo_type
+    )
+def download(repo_id, local_dir, repo_type, token, filter_re=None):
+    files = huggingface_hub.list_repo_files(repo_id, repo_type=repo_type, token=token)
+    if filter_re is not None:
+        files = [file for file in files if re.search(filter_re, file) is not None]
+    pool = multiprocessing.Pool(8)
+    download_func = functools.partial(
+        huggingface_hub.hf_hub_download,
+        repo_id,
+        repo_type=repo_type,
+        local_dir=local_dir,
+        local_dir_use_symlinks=True,
+        token=token
+    )
+    pool.map(download_func, files)
+    print(f'downloaded files {files}')
+def upload_file(repo_id, file_path, repo_type, token):
+    huggingface_hub.upload_file(
+        repo_id=repo_id,
+        path_or_fileobj=file_path,
+        path_in_repo=file_path,
+        token=token,
+        repo_type=repo_type,
+    )
+if __name__ == '__main__':
+    read_token = '...'
+    write_token = '...'
+    repo_id = '...'
+    local_dir = '...'
+    repo_type = '...'
+    # #############
+    # # Examples on most simple hf usage
+    # # downlaod
+    # filters = []
+    # for filter_re in filters:
+    #     download(repo_id,
+    #              local_dir,
+    #              repo_type,
+    #              filter_re)
+    # # upload
+    # upload(repo_id, local_dir, local_dir, repo_type, write_token)
+    # #############
+    # download models
+    repo_ids = [
+        'ermu2001/pllava-7b',
+        'ermu2001/pllava-13b',
+    ]
+    for repo_id in repo_ids:
+        local_dir = repo_id.replace('ermu2001', 'MODELS')
+        snapshot_download(
+            repo_id,
+            local_dir=local_dir,
+            repo_type='model',
+            local_dir_use_symlinks=True,
+            token=read_token,
+        )

requirements.no_torch.txt ADDED Viewed

	@@ -0,0 +1,244 @@

+absl-py==2.1.0
+accelerate==0.26.1
+addict==2.4.0
+aiofiles==23.2.1
+aliyun-python-sdk-core==2.15.0
+aliyun-python-sdk-kms==2.16.2
+altair==5.2.0
+annotated-types==0.6.0
+antlr4-python3-runtime==4.9.3
+anyio==4.3.0
+anykeystore==0.2
+apex==0.9.10.dev0
+appdirs==1.4.4
+argcomplete==3.2.3
+attrs==23.2.0
+av==10.0.0
+beautifulsoup4==4.12.3
+blessed==1.20.0
+blessings==1.7
+boto3==1.34.63
+botocore==1.34.63
+Brotli==1.1.0
+cachetools==5.3.3
+certifi==2024.2.2
+cffi==1.16.0
+charset-normalizer==3.3.2
+click==8.1.7
+colorama==0.4.6
+contourpy==1.2.0
+crcmod==1.7
+cryptacular==1.6.2
+cryptography==42.0.5
+cycler==0.12.1
+dacite==1.7.0
+decorator==4.4.2
+decord==0.6.0
+deepspeed==0.14.0
+defusedxml==0.7.1
+Deprecated==1.2.14
+dill==0.3.8
+distro==1.9.0
+dnspython==2.6.1
+docker-pycreds==0.4.0
+einops==0.6.1
+exceptiongroup==1.2.0
+fastapi==0.110.0
+ffmpeg==1.4
+ffmpy==0.3.2
+fiftyone==0.23.6
+fiftyone-brain==0.16.1
+fiftyone_db==1.1.2
+filelock==3.9.0
+flash-attn==2.5.6
+fonttools==4.49.0
+fsspec==2024.2.0
+ftfy==6.1.3
+future==1.0.0
+fvcore==0.1.5.post20221221
+gdown==5.1.0
+gitdb==4.0.11
+GitPython==3.1.42
+glob2==0.7
+google-auth==2.28.2
+google-auth-oauthlib==1.2.0
+gpustat==1.1.1
+gradio==4.21.0
+gradio_client==0.12.0
+graphql-core==3.2.3
+greenlet==3.0.3
+grpcio==1.62.1
+h11==0.14.0
+h2==4.1.0
+hjson==3.1.0
+hpack==4.0.0
+httpcore==1.0.4
+httpx==0.27.0
+huggingface-hub==0.21.4
+humanize==4.9.0
+hupper==1.12.1
+Hypercorn==0.16.0
+hyperframe==6.0.1
+idna==3.6
+idscheck==2.3.0
+imageio==2.27.0
+imageio-ffmpeg==0.4.9
+importlib_metadata==7.0.2
+importlib_resources==6.3.0
+inflate64==1.0.0
+iopath==0.1.10
+Jinja2==3.1.2
+jmespath==0.10.0
+joblib==1.3.2
+jsonlines==4.0.0
+jsonschema==4.21.1
+jsonschema-specifications==2023.12.1
+kaleido==0.2.1
+kiwisolver==1.4.5
+lazy_loader==0.3
+Markdown==3.6
+markdown-it-py==3.0.0
+MarkupSafe==2.1.3
+matplotlib==3.8.3
+mdurl==0.1.2
+mmcv-full==1.7.2
+model-index==0.1.11
+mongoengine==0.24.2
+motor==3.3.2
+moviepy==1.0.3
+mpmath==1.3.0
+multivolumefile==0.2.3
+networkx==3.2.1
+ninja==1.11.1.1
+numpy
+oauthlib==3.2.2
+omegaconf==2.3.0
+openai==1.14.0
+opencv-python==4.9.0.80
+opencv-python-headless==4.9.0.80
+opendatalab==0.0.10
+openmim==0.3.9
+openxlab==0.0.36
+ordered-set==4.1.0
+orjson==3.9.15
+oss2==2.17.0
+packaging==24.0
+pandas==1.5.3
+PasteDeploy==3.1.0
+pathtools==0.1.2
+pbkdf2==1.3
+peft==0.10.0
+pillow==10.2.0
+plaster==1.1.2
+plaster-pastedeploy==1.0.1
+platformdirs==4.2.0
+plotly==5.20.0
+portalocker==2.8.2
+pprintpp==0.4.0
+priority==2.0.0
+proglog==0.1.10
+protobuf==4.23.4
+psutil==5.9.4
+py-cpuinfo==9.0.0
+py7zr==0.21.0
+pyasn1==0.5.1
+pyasn1-modules==0.3.0
+pybcj==1.0.2
+pycparser==2.21
+pycryptodome==3.20.0
+pycryptodomex==3.20.0
+pydantic==2.6.4
+pydantic_core==2.16.3
+pydub==0.25.1
+Pygments==2.17.2
+pymongo==4.6.2
+pynvml==11.5.0
+pyparsing==3.1.2
+pyppmd==1.1.0
+pyramid==2.0.2
+pyramid-mailer==0.15.1
+PySocks==1.7.1
+python-dateutil==2.9.0.post0
+python-multipart==0.0.9
+python3-openid==3.2.0
+pytz==2023.4
+PyYAML==6.0
+pyzstd==0.15.9
+rarfile==4.1
+referencing==0.33.0
+regex==2023.12.25
+repoze.sendmail==4.4.1
+requests==2.28.2
+requests-oauthlib==1.4.0
+retrying==1.3.4
+rich==13.4.2
+rpds-py==0.18.0
+rsa==4.9
+ruff==0.3.2
+s3transfer==0.10.1
+safetensors==0.4.2
+scikit-image==0.22.0
+scikit-learn==1.4.1.post1
+scipy==1.10.1
+semantic-version==2.10.0
+sentencepiece==0.2.0
+sentry-sdk==1.42.0
+setproctitle==1.3.3
+shellingham==1.5.4
+six==1.16.0
+smmap==5.0.1
+sniffio==1.3.1
+sortedcontainers==2.4.0
+soupsieve==2.5
+SQLAlchemy==2.0.28
+sse-starlette==0.10.3
+sseclient-py==1.8.0
+starlette==0.36.3
+strawberry-graphql==0.138.1
+sympy==1.12
+tabulate==0.9.0
+taskgroup==0.0.0a4
+tenacity==8.2.3
+tensorboard==2.15.1
+tensorboard-data-server==0.7.2
+tensorboardX==2.6.2.2
+termcolor==2.3.0
+texttable==1.7.0
+threadpoolctl==3.3.0
+tifffile==2024.2.12
+timm==0.6.12
+tokenizers==0.15.2
+tomli==2.0.1
+tomlkit==0.12.0
+toolz==0.12.1
+tqdm==4.65.2
+transaction==4.0
+transformers==4.37.1
+translationstring==1.4
+triton==2.2.0
+typer==0.9.0
+typing_extensions==4.8.0
+tzdata==2024.1
+tzlocal==5.2
+universal-analytics-python3==1.1.1
+urllib3==1.26.18
+uvicorn==0.28.0
+velruse==1.1.1
+venusian==3.1.0
+voxel51-eta==0.12.6
+wandb==0.14.0
+wcwidth==0.2.13
+WebOb==1.8.7
+websockets==11.0.3
+Werkzeug==3.0.1
+wrapt==1.16.0
+wsproto==1.2.0
+WTForms==3.1.2
+wtforms-recaptcha==0.3.2
+xmltodict==0.13.0
+yacs==0.1.8
+yapf==0.40.2
+zipp==3.18.1
+zope.deprecation==5.0
+zope.interface==6.2
+zope.sqlalchemy==3.1

requirements.torch.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+--index-url https://download.pytorch.org/whl/cu118
+torch==2.2.1
+torchaudio==2.2.1
+torchvision==0.17.1

requirements.txt ADDED Viewed

	@@ -0,0 +1,246 @@

+absl-py==2.1.0
+accelerate==0.26.1
+addict==2.4.0
+aiofiles==23.2.1
+aliyun-python-sdk-core==2.15.0
+aliyun-python-sdk-kms==2.16.2
+altair==5.2.0
+annotated-types==0.6.0
+antlr4-python3-runtime==4.9.3
+anyio==4.3.0
+anykeystore==0.2
+apex==0.9.10.dev0
+appdirs==1.4.4
+argcomplete==3.2.3
+attrs==23.2.0
+av==10.0.0
+beautifulsoup4==4.12.3
+blessed==1.20.0
+blessings==1.7
+boto3==1.34.63
+botocore==1.34.63
+Brotli==1.1.0
+cachetools==5.3.3
+certifi==2024.2.2
+cffi==1.16.0
+charset-normalizer==3.3.2
+click==8.1.7
+colorama==0.4.6
+contourpy==1.2.0
+crcmod==1.7
+cryptacular==1.6.2
+cryptography==42.0.5
+cycler==0.12.1
+dacite==1.7.0
+decorator==4.4.2
+decord==0.6.0
+deepspeed==0.14.0
+defusedxml==0.7.1
+Deprecated==1.2.14
+dill==0.3.8
+distro==1.9.0
+dnspython==2.6.1
+docker-pycreds==0.4.0
+einops==0.6.1
+exceptiongroup==1.2.0
+fastapi==0.110.0
+ffmpeg==1.4
+ffmpy==0.3.2
+fiftyone==0.23.6
+fiftyone-brain==0.16.1
+fiftyone_db==1.1.2
+filelock==3.9.0
+fonttools==4.49.0
+fsspec==2024.2.0
+ftfy==6.1.3
+future==1.0.0
+fvcore==0.1.5.post20221221
+gdown==5.1.0
+gitdb==4.0.11
+GitPython==3.1.42
+glob2==0.7
+google-auth==2.28.2
+google-auth-oauthlib==1.2.0
+gpustat==1.1.1
+gradio==4.21.0
+gradio_client==0.12.0
+graphql-core==3.2.3
+greenlet==3.0.3
+grpcio==1.62.1
+h11==0.14.0
+h2==4.1.0
+hjson==3.1.0
+hpack==4.0.0
+httpcore==1.0.4
+httpx==0.27.0
+huggingface-hub==0.21.4
+humanize==4.9.0
+hupper==1.12.1
+Hypercorn==0.16.0
+hyperframe==6.0.1
+idna==3.6
+idscheck==2.3.0
+imageio==2.27.0
+imageio-ffmpeg==0.4.9
+importlib_metadata==7.0.2
+importlib_resources==6.3.0
+inflate64==1.0.0
+iopath==0.1.10
+Jinja2==3.1.2
+jmespath==0.10.0
+joblib==1.3.2
+jsonlines==4.0.0
+jsonschema==4.21.1
+jsonschema-specifications==2023.12.1
+kaleido==0.2.1
+kiwisolver==1.4.5
+lazy_loader==0.3
+Markdown==3.6
+markdown-it-py==3.0.0
+MarkupSafe==2.1.3
+matplotlib==3.8.3
+mdurl==0.1.2
+mmcv-full==1.7.2
+model-index==0.1.11
+mongoengine==0.24.2
+motor==3.3.2
+moviepy==1.0.3
+mpmath==1.3.0
+multivolumefile==0.2.3
+networkx==3.2.1
+ninja==1.11.1.1
+numpy==1.23.5
+oauthlib==3.2.2
+omegaconf==2.3.0
+openai==1.14.0
+opencv-python==4.9.0.80
+opencv-python-headless==4.9.0.80
+opendatalab==0.0.10
+openmim==0.3.9
+openxlab==0.0.36
+ordered-set==4.1.0
+orjson==3.9.15
+oss2==2.17.0
+packaging==24.0
+pandas==1.5.3
+PasteDeploy==3.1.0
+pathtools==0.1.2
+pbkdf2==1.3
+peft==0.10.0
+pillow==10.2.0
+plaster==1.1.2
+plaster-pastedeploy==1.0.1
+platformdirs==4.2.0
+plotly==5.20.0
+portalocker==2.8.2
+pprintpp==0.4.0
+priority==2.0.0
+proglog==0.1.10
+protobuf==4.23.4
+psutil==5.9.4
+py-cpuinfo==9.0.0
+py7zr==0.21.0
+pyasn1==0.5.1
+pyasn1-modules==0.3.0
+pybcj==1.0.2
+pycparser==2.21
+pycryptodome==3.20.0
+pycryptodomex==3.20.0
+pydantic==2.6.4
+pydantic_core==2.16.3
+pydub==0.25.1
+Pygments==2.17.2
+pymongo==4.6.2
+pynvml==11.5.0
+pyparsing==3.1.2
+pyppmd==1.1.0
+pyramid==2.0.2
+pyramid-mailer==0.15.1
+PySocks==1.7.1
+python-dateutil==2.9.0.post0
+python-multipart==0.0.9
+python3-openid==3.2.0
+pytz==2023.4
+PyYAML==6.0
+pyzstd==0.15.9
+rarfile==4.1
+referencing==0.33.0
+regex==2023.12.25
+repoze.sendmail==4.4.1
+requests==2.28.2
+requests-oauthlib==1.4.0
+retrying==1.3.4
+rich==13.4.2
+rpds-py==0.18.0
+rsa==4.9
+ruff==0.3.2
+s3transfer==0.10.1
+safetensors==0.4.2
+scikit-image==0.22.0
+scikit-learn==1.4.1.post1
+scipy==1.10.1
+semantic-version==2.10.0
+sentencepiece==0.2.0
+sentry-sdk==1.42.0
+setproctitle==1.3.3
+shellingham==1.5.4
+six==1.16.0
+smmap==5.0.1
+sniffio==1.3.1
+sortedcontainers==2.4.0
+soupsieve==2.5
+SQLAlchemy==2.0.28
+sse-starlette==0.10.3
+sseclient-py==1.8.0
+starlette==0.36.3
+strawberry-graphql==0.138.1
+sympy==1.12
+tabulate==0.9.0
+taskgroup==0.0.0a4
+tenacity==8.2.3
+tensorboard==2.15.1
+tensorboard-data-server==0.7.2
+tensorboardX==2.6.2.2
+termcolor==2.3.0
+texttable==1.7.0
+threadpoolctl==3.3.0
+tifffile==2024.2.12
+timm==0.6.12
+tokenizers==0.15.2
+tomli==2.0.1
+tomlkit==0.12.0
+toolz==0.12.1
+torch==2.2.1
+torchaudio==2.2.1
+torchvision==0.17.1
+tqdm==4.65.2
+transaction==4.0
+transformers
+translationstring==1.4
+triton==2.2.0
+typer==0.9.0
+typing_extensions==4.8.0
+tzdata==2024.1
+tzlocal==5.2
+universal-analytics-python3==1.1.1
+urllib3==1.26.18
+uvicorn==0.28.0
+velruse==1.1.1
+venusian==3.1.0
+voxel51-eta==0.12.6
+wandb==0.14.0
+wcwidth==0.2.13
+WebOb==1.8.7
+websockets==11.0.3
+Werkzeug==3.0.1
+wrapt==1.16.0
+wsproto==1.2.0
+WTForms==3.1.2
+wtforms-recaptcha==0.3.2
+xmltodict==0.13.0
+yacs==0.1.8
+yapf==0.40.2
+zipp==3.18.1
+zope.deprecation==5.0
+zope.interface==6.2
+zope.sqlalchemy==3.1

scripts/accel_config_deepspeed_zero2.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  gradient_accumulation_steps: 8
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_deepspeed_zero3_offload.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  gradient_accumulation_steps: 2
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_deepspeed_zero3_offload_multinode.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 2
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_process_ip: fdbd:dc61:18:8::20
+main_process_port: 6876
+main_training_function: main
+mixed_precision: bf16
+num_machines: 2
+num_processes: 16
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_deepspeed_zero3_offload_multinode_1.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 2
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_process_ip: fdbd:dc61:18:8::20
+main_process_port: 6876
+main_training_function: main
+mixed_precision: bf16
+num_machines: 2
+num_processes: 16
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_deepspeed_zero3_offload_multinode_2.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 2
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 1
+main_process_ip: fdbd:dc61:18:8::20
+main_process_port: 6876
+main_training_function: main
+mixed_precision: bf16
+num_machines: 2
+num_processes: 16
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_deepspeed_zero3_offload_singlegpu.yaml ADDED Viewed

	@@ -0,0 +1,23 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  gradient_accumulation_steps: 16
+  gradient_clipping: 1.0
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_multigpu.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: 2,3,4,5
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_multinode.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 1
+main_process_ip: 10.193.16.150
+main_process_port: 6784
+main_training_function: main
+mixed_precision: bf16
+num_machines: 2
+num_processes: 16
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/accel_config_singlegpu.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: 'NO'
+downcast_bf16: 'no'
+gpu_ids: '0'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

scripts/demo.sh ADDED Viewed

	@@ -0,0 +1,32 @@

+model_dir=${1:-"MODELS/pllava-7b"}
+weight_dir=${2:-"${model_dir}"}
+num_frames=16
+lora_alpha=4
+echo Running DEMO from model_dir: ${model_dir}
+echo Running DEMO from weights_dir: ${weight_dir}
+echo Running DEMO On Devices: ${CUDA_VISIBLE_DEVICES}
+# # 34B Need to Use dispatch for this large.
+# CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} python -m tasks.eval.demo.pllava_demo \
+#     --pretrained_model_name_or_path ${model_dir} \
+#     --num_frames ${num_frames} \
+#     --use_lora \
+#     --weight_dir ${weight_dir} \
+#     --lora_alpha ${lora_alpha} \
+#     --conv_mode eval_vcg_llava_next \
+#     --use_multi_gpus \
+# 7B and 13B, There are problem if Model was split around A100 40G... Probably because some unkown bug in accelerate dispatch
+CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-"0,1"} python -m tasks.eval.demo.pllava_demo \
+    --pretrained_model_name_or_path ${model_dir} \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --weight_dir ${weight_dir} \
+    --lora_alpha ${lora_alpha} \
+    --conv_mode plain \
+    --use_multi_gpus

scripts/eval.sh ADDED Viewed

	@@ -0,0 +1,104 @@

+# export CUDA_VISIBLE_DEVICES=2,6,7
+export OPENAI_API_KEY=...
+num_frames=16
+test_ratio=1
+# 13b, uses offload thus saving the full model
+model_dir=MODELS/pllava-13b
+weight_dir=MODELS/pllava-13b
+SAVE_DIR=test_results/test_pllava_13b
+lora_alpha=4
+conv_mode=eval_vcgbench
+python -m tasks.eval.vcgbench.pllava_eval_vcgbench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/vcgbench \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --pooling_shape 16-12-12 \
+    --test_ratio ${test_ratio} \
+    --conv_mode ${conv_mode}
+conv_mode=eval_mvbench
+python -m tasks.eval.mvbench.pllava_eval_mvbench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/mvbench \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --num_frames ${num_frames} \
+    --weight_dir ${weight_dir} \
+    --pooling_shape 16-12-12 \
+    --conv_mode ${conv_mode}
+onv_mode=eval_videoqabench
+python -m tasks.eval.videoqabench.pllava_eval_videoqabench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/videoqabench \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --test_ratio ${test_ratio} \
+    --conv_mode ${conv_mode}
+conv_mode=eval_recaption
+python -m tasks.eval.recaption.pllava_recaption \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/recaption \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --weight_dir ${weight_dir} \
+    --lora_alpha ${lora_alpha} \
+    --test_ratio ${test_ratio} \
+    --conv_mode ${conv_mode}
+model_dir=MODELS/pllava-7b
+weight_dir=MODELS/pllava-7b
+SAVE_DIR=test_results/test_pllava_7b
+lora_alpha=4
+conv_mode=eval_vcgbench
+python -m tasks.eval.vcgbench.pllava_eval_vcgbench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/vcgbench \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --pooling_shape 16-12-12 \
+    --test_ratio ${test_ratio}
+conv_mode=eval_mvbench
+python -m tasks.eval.mvbench.pllava_eval_mvbench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/mvbench \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --num_frames ${num_frames} \
+    --weight_dir ${weight_dir} \
+    --pooling_shape 16-12-12
+onv_mode=eval_videoqabench
+python -m tasks.eval.videoqabench.pllava_eval_videoqabench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/videoqabench \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --test_ratio ${test_ratio}
+conv_mode=eval_recaption
+python -m tasks.eval.recaption.pllava_recaption \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/recaption \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --test_ratio ${test_ratio}

scripts/eval_yiprompt.sh ADDED Viewed

	@@ -0,0 +1,53 @@

+# export CUDA_VISIBLE_DEVICES=0,3,4,5,6,7
+export OPENAI_API_KEY=...
+num_frames=16
+test_ratio=200
+model_dir=MODELS/pllava-34b
+weight_dir=MODELS/pllava-34b
+SAVE_DIR=test_results/test_pllava_34b
+lora_alpha=4
+conv_mode=eval_vcg_llavanext
+python -m tasks.eval.vcgbench.pllava_eval_vcgbench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/vcgbench \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --pooling_shape 16-12-12 \
+    --test_ratio ${test_ratio} \
+    --conv_mode $conv_mode
+conv_mode=eval_mvbench_llavanext
+python -m tasks.eval.mvbench.pllava_eval_mvbench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/mvbench \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --num_frames ${num_frames} \
+    --weight_dir ${weight_dir} \
+    --pooling_shape 16-12-12 \
+    --conv_mode $conv_mode
+conv_mode=eval_videoqa_llavanext
+python -m tasks.eval.videoqabench.pllava_eval_videoqabench \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/videoqabench \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --lora_alpha ${lora_alpha} \
+    --weight_dir ${weight_dir} \
+    --test_ratio ${test_ratio} \
+    --conv_mode ${conv_mode}
+conv_mode=eval_recaption_llavanext
+python -m tasks.eval.recaption.pllava_recaption \
+    --pretrained_model_name_or_path ${model_dir} \
+    --save_path ${SAVE_DIR}/recaption \
+    --num_frames ${num_frames} \
+    --use_lora \
+    --weight_dir ${weight_dir} \
+    --lora_alpha ${lora_alpha} \
+    --test_ratio ${test_ratio} \
+    --conv_mode $conv_mode

scripts/gallery.sh ADDED Viewed

	@@ -0,0 +1,11 @@

+export OPENAI_API_KEY=...
+SAVE_DIR=${1:-"test_results"}
+# # gallery view
+# python -m tasks.eval.show_gallery \
+#     --root_dir ${SAVE_DIR}
+# # compare view
+python -m tasks.eval.demo.show_compare \
+    --root_dir ${SAVE_DIR}

scripts/train_pllava.sh ADDED Viewed

	@@ -0,0 +1,34 @@

+echo "PYTHONPATH: ${PYTHONPATH}"
+which_python=$(which python)
+echo "which python: ${which_python}"
+export PYTHONPATH=${PYTHONPATH}:${which_python}
+export PYTHONPATH=${PYTHONPATH}:.
+echo "PYTHONPATH: ${PYTHONPATH}"
+OUTPUT_DIR=./pllava_video_outputs/test_train_7b_reconstruct
+# # Naive Env
+# rm -rf ${OUTPUT_DIR}
+pooling_shape=(16,12,12)
+accelerate launch --main_process_port 6876 --config_file scripts/accel_config_multigpu.yaml tasks/train/train_pllava_nframe_accel.py \
+    tasks/train/config_pllava_nframe.py \
+    output_dir ${OUTPUT_DIR} \
+    train_corpus videochat2_video \
+    save_steps 10000 \
+    num_workers 8 \
+    num_frames 16 \
+    model.pooling_method avg \
+    model.repo_id llava-hf/llava-v1.6-vicuna-7b-hf \
+    model.use_lora True \
+    model.pooling_shape $pooling_shape \
+    optimizer.lr 2e-5 \
+    scheduler.epochs 3 \
+    scheduler.warmup_ratio 0.2 \
+    scheduler.min_lr_multi 0.25 \
+    scheduler.is_videochat2_custom True \
+    preprocess.mm_alone False \
+    preprocess.random_shuffle False \
+    preprocess.add_second_msg False \
+    train_corpus videochat2_instruction_debug

scripts/train_pllava_13b.sh ADDED Viewed

	@@ -0,0 +1,50 @@

+echo "PYTHONPATH: ${PYTHONPATH}"
+which_python=$(which python)
+echo "which python: ${which_python}"
+export PYTHONPATH=${PYTHONPATH}:${which_python}
+export PYTHONPATH=${PYTHONPATH}:.
+echo "PYTHONPATH: ${PYTHONPATH}"
+OUTPUT_DIR=./pllava_video_outputs/pllava_13b
+pooling_shape=(16,12,12)
+num_save_samples=80000
+num_gpus=8
+full_batch_size=128
+batch_size=8
+save_steps=$[$num_save_samples/($batch_size*$num_gpus)]
+ckpt_steps=$[$save_steps/10]
+gradient_accumulation_steps=$[$full_batch_size/($batch_size*$num_gpus)]
+echo $batch_size
+echo $gradient_accumulation_steps
+repo_id=llava-hf/llava-v1.6-vicuna-13b-hf
+accelerate launch --main_process_port 6876 --config_file scripts/accel_config_deepspeed_zero3_offload.yaml tasks/train/train_pllava_nframe_accel.py \
+    tasks/train/config_pllava_nframe.py \
+    output_dir ${OUTPUT_DIR} \
+    train_corpus videochat2_instruction_debug \
+    save_steps $save_steps \
+    ckpt_steps $ckpt_steps \
+    num_workers 8 \
+    num_frames 16 \
+    gradient_accumulation_steps $gradient_accumulation_steps \
+    batch_size $batch_size \
+    deepspeed True \
+    model.pooling_method avg \
+    model.use_lora True \
+    model.use_pooling True \
+    model.repo_id $repo_id \
+    gradient_checkpointing True \
+    preprocess.center_pad False \
+    preprocess.clip_transform False \
+    optimizer.lr 2e-5 \
+    scheduler.epochs 3 \
+    scheduler.warmup_ratio 0.2 \
+    scheduler.min_lr_multi 0.25 \
+    model.pooling_shape $pooling_shape \
+    scheduler.is_videochat2_custom True \
+    preprocess.mm_alone False \
+    preprocess.random_shuffle False \
+    preprocess.add_second_msg False

scripts/train_pllava_34b.sh ADDED Viewed

	@@ -0,0 +1,50 @@

+echo "PYTHONPATH: ${PYTHONPATH}"
+which_python=$(which python)
+echo "which python: ${which_python}"
+export PYTHONPATH=${PYTHONPATH}:${which_python}
+export PYTHONPATH=${PYTHONPATH}:.
+echo "PYTHONPATH: ${PYTHONPATH}"
+machine_rank=${1:-"0"} # machine rank
+OUTPUT_DIR=./pllava_video_outputs/pllava_34b_videchat2-video
+pooling_shape=(16,12,12)
+num_save_samples=80000
+num_gpus=8
+full_batch_size=128
+batch_size=4
+save_steps=$[$num_save_samples/($batch_size*$num_gpus)]
+ckpt_steps=$[$save_steps/10]
+gradient_accumulation_steps=$[$full_batch_size/($batch_size*$num_gpus)]
+echo $batch_size
+echo $gradient_accumulation_steps
+repo_id=llava-hf/llava-v1.6-34b-hf
+accelerate launch --main_process_port 6876 --config_file scripts/accel_config_deepspeed_zero3_offload.yaml tasks/train/train_pllava_nframe_accel.py \
+    tasks/train/config_pllava_nframe_yiprompt.py \
+    output_dir ${OUTPUT_DIR} \
+    train_corpus videochat2_instruction_debug \
+    save_steps $save_steps \
+    ckpt_steps $ckpt_steps \
+    num_workers 8 \
+    num_frames 16 \
+    deepspeed True \
+    gradient_accumulation_steps $gradient_accumulation_steps \
+    batch_size $batch_size \
+    model.pooling_method avg \
+    model.use_lora True \
+    model.use_pooling True \
+    model.repo_id $repo_id \
+    gradient_checkpointing True \
+    preprocess.center_pad False \
+    preprocess.clip_transform True \
+    optimizer.lr 2e-5 \
+    scheduler.epochs 3 \
+    scheduler.warmup_ratio 0.2 \
+    scheduler.min_lr_multi 0.25 \
+    model.pooling_shape $pooling_shape \
+    scheduler.is_videochat2_custom True \
+    preprocess.image_token_index 64002 \
+    preprocess.mm_alone False \
+    preprocess.random_shuffle False \
+    preprocess.add_second_msg False