LinB203 commited on
Commit
0e023c7
1 Parent(s): 514c1e1
Files changed (4) hide show
  1. LICENSE +201 -0
  2. TRAIN_AND_VALIDATE.md +279 -0
  3. app.py +257 -0
  4. pyproject.toml +36 -0
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
TRAIN_AND_VALIDATE.md ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Data preparation
2
+
3
+ ### data for training
4
+ - The images pretraining dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA).
5
+ - The images tuning dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA).
6
+ - The videos pretraining dataset is from [Valley](https://github.com/RupertLuo/Valley).
7
+ - The videos tuning dataset is from [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT).
8
+ - Download the training annotations. You can download from [Baidu Disk](https://pan.baidu.com/s/1BipI3_f--GRWqaWTGYp-Jg?pwd=wkl0), [Google Disk](https://drive.google.com/file/d/11-1NBXNeiNQE2wPbue1dFph_Na_EHRYG/view?usp=drive_link) or [Peking University Disk](https://disk.pku.edu.cn:443/link/84783AB54553DFA150C1C5E82C16EB29)
9
+
10
+
11
+ We also provide the processed data as follows.
12
+ <div align="center">
13
+ <table border="1" width="100%">
14
+ <tr align="center">
15
+ <th>Datasets</th><th>Baidu Disk</th>
16
+ </tr>
17
+ <tr align="center">
18
+ <td>Image pretraining</td><td><a href="">Link</a></td>
19
+ </tr>
20
+ </tr>
21
+ <tr align="center">
22
+ <td>Image tuning</td><td><a href="">Link</a></td>
23
+ </tr>
24
+ </tr>
25
+ <tr align="center">
26
+ <td>Video pretraining</td><td><a href="">Link</a></td>
27
+ </tr>
28
+ </tr>
29
+ <tr align="center">
30
+ <td>Video tuning</td><td><a href="">Link</a></td>
31
+ </tr>
32
+ </table>
33
+ </div>
34
+
35
+ After downloading all of them, organize the data as follows in ```DATA_ROOT```.
36
+
37
+ ```Shell
38
+ DATA_ROOT
39
+ ├── llava_image
40
+ ├── llava_image_tune
41
+ ├── valley
42
+ └── videochatgpt_tune
43
+ ```
44
+
45
+ ### data for validating
46
+ - For image, follow LLaVA's instructions. ***You MUST first download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing)**. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `eval`. This also provides a general structure for all datasets.*
47
+ - For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows.
48
+ <div align="center">
49
+ <table border="1" width="100%">
50
+ <tr align="center">
51
+ <th>Datasets</th><th>Baidu Disk</th><th>Google Disk</th><th>Peking University Disk</th>
52
+ </tr>
53
+ <tr align="center">
54
+ <td>Activitynet_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/1d_AVx9Mz_57nA3exhQZGyA?pwd=9amr ">Link</a></td><td>-</td><td>-</td>
55
+ </tr>
56
+ </tr>
57
+ <tr align="center">
58
+ <td>MSRVTT_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/1QHUtwHXm4Vc-Wc12XFCFsA?pwd=1rj8">Link</a></td><td><a href="https://drive.google.com/file/d/1yXh9lz7flQ5Ui2IRSd6Qi6RqSEeUJwl3/view?usp=drive_link">Link</a></td><td>-</td>
59
+ </tr>
60
+ </tr>
61
+ <tr align="center">
62
+ <td>MSVD_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/1PJSHkjHG2BPl_ddUnBj9AA?pwd=jj34">Link</a></td><td><a href="https://drive.google.com/file/d/1_q4eiSdb7i8P3Hmh4lCfgY1uBGyzU_7X/view?usp=drive_link">Link</a></td><td><a href="https://disk.pku.edu.cn:443/link/8B0D01747D8AA65534820B7E60CBFEFC">Link</a></td>
63
+ </tr>
64
+ </tr>
65
+ <tr align="center">
66
+ <td>TGIF_Zero_Shot_QA</td><td><a href="https://pan.baidu.com/s/11ubtWbTtubyBmN9UPvAyow?pwd=98yr">Link</a></td><td><a href="https://drive.google.com/file/d/1so6L9rg_gdC8Segur7rKML-ffd4Ix_I6/view?usp=drive_link">Link</a></td><td><a href="https://disk.pku.edu.cn:443/link/B9AB387EFE8817158F181FF3D7A97163">Link</a></td>
67
+ </tr>
68
+ </table>
69
+ </div>
70
+
71
+ After downloading all of them, organize the data as follows in `eval`.
72
+
73
+ ```Shell
74
+ eval
75
+ ├── GPT_Zero_Shot_QA
76
+ │   ├── Activitynet_Zero_Shot_QA
77
+ │   ├── MSRVTT_Zero_Shot_QA
78
+ │   ├── MSVD_Zero_Shot_QA
79
+ │   └── TGIF_Zero_Shot_QA
80
+ ├── gqa
81
+ │   ├── answers
82
+ │   ├── data
83
+ │   └── llava_gqa_testdev_balanced.jsonl
84
+ ├── llava-bench-in-the-wild
85
+ │   ├── answers
86
+ │   ├── answers_gpt4.jsonl
87
+ │   ├── bard_0718.jsonl
88
+ │   ├── bing_chat_0629.jsonl
89
+ │   ├── context.jsonl
90
+ │   ├── images
91
+ │   ├── questions.jsonl
92
+ │   ├── README.md
93
+ │   └── reviews
94
+ ├── mmbench
95
+ │   ├── answers
96
+ │   ├── answers_upload
97
+ │   ├── mmbench_dev_20230712.tsv
98
+ │   └── mmbench_dev_en_20231003.tsv
99
+ ├── MME
100
+ │   ├── answers
101
+ │   ├── convert_answer_to_mme.py
102
+ │   └── llava_mme.jsonl
103
+ ├── mm-vet
104
+ │   ├── answers
105
+ │   ├── bard_set.json
106
+ │   ├── convert_answers.py
107
+ │   ├── images
108
+ │   ├── llava-mm-vet.jsonl
109
+ │   ├── mm-vet.json
110
+ │   └── results
111
+ ├── pope
112
+ │   ├── answers
113
+ │   ├── coco
114
+ │   ├── llava_pope_test.jsonl
115
+ │   └── val2014
116
+ ├── scienceqa
117
+ │   ├── answers
118
+ │   ├── images
119
+ │   ├── llava_test_CQM-A.json
120
+ │   ├── pid_splits.json
121
+ │   └── problems.json
122
+ ├── seed_bench
123
+ │   ├── answers
124
+ │   ├── answers_upload
125
+ │   ├── extract_video_frames.py
126
+ │   └── llava-seed-bench.jsonl
127
+ ├── textvqa
128
+ │   ├── answers
129
+ │   ├── llava_textvqa_val_v051_ocr.jsonl
130
+ │   ├── TextVQA_0.5.1_val.json
131
+ │   └── train_images
132
+ ├── vizwiz
133
+ │   ├── answers
134
+ │   ├── answers_upload
135
+ │   ├── llava_test.jsonl
136
+ │   ├── test
137
+ │   ├── test.json
138
+ │   ├── train.json
139
+ │   └── val.json
140
+ └── vqav2
141
+ ├── answers
142
+ ├── answers_upload
143
+ ├── llava_vqav2_mscoco_test2015.jsonl
144
+ ├── llava_vqav2_mscoco_test-dev2015.jsonl
145
+ └── test2015
146
+ ```
147
+
148
+ ## Training
149
+ Specify your `DATA_ROOT` according to the data preparation.
150
+ - Stage 1 pretraining script: [pretrain.sh](scripts/v1_5/pretrain.sh).
151
+ - Stage 2 tuning script: [finetune.sh](scripts/v1_5/finetune.sh).
152
+
153
+ ## Validating
154
+ Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution!
155
+
156
+ You can refer to the official repository for validation, but we also provide [off-the-shelf](scripts/v1_5/eval) scripts.
157
+
158
+
159
+ ### MSRVTT-QA
160
+ 1. Inference to get the result.
161
+ ```Shell
162
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh
163
+ ```
164
+
165
+ 2. GPT-Assistant evaluation.
166
+ ```Shell
167
+ bash scripts/v1_5/eval/eval_qa_msrvtt.sh
168
+ ```
169
+
170
+ ### MSVD-QA
171
+ 1. Inference to get the result.
172
+ ```Shell
173
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh
174
+ ```
175
+
176
+ 2. GPT-Assistant evaluation.
177
+ ```Shell
178
+ bash scripts/v1_5/eval/eval_qa_msvd.sh
179
+ ```
180
+
181
+ ### TGIF-QA
182
+ 1. Inference to get the result.
183
+ ```Shell
184
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh
185
+ ```
186
+
187
+ 2. GPT-Assistant evaluation.
188
+ ```Shell
189
+ bash scripts/v1_5/eval/eval_qa_tgif.sh
190
+ ```
191
+
192
+ ### ActivityNet-QA
193
+ 1. Inference to get the result.
194
+ ```Shell
195
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh
196
+ ```
197
+
198
+ 2. GPT-Assistant evaluation.
199
+ ```Shell
200
+ bash scripts/v1_5/eval/eval_qa_activitynet.sh
201
+ ```
202
+
203
+
204
+ ### VQAv2
205
+
206
+ 1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `eval/vqav2`.
207
+ 2. Multi-GPU inference.
208
+ ```Shell
209
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh
210
+ ```
211
+ 3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/830/my-submission): `eval/vqav2/answers_upload`.
212
+
213
+ ### GQA
214
+
215
+ 1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `eval/gqa/data`.
216
+ 2. Multi-GPU inference.
217
+ ```Shell
218
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh
219
+ ```
220
+
221
+ ### VisWiz
222
+
223
+ 1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `eval/vizwiz`.
224
+ 2. Single-GPU inference.
225
+ ```Shell
226
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh
227
+ ```
228
+ 3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/1911/my-submission): `eval/vizwiz/answers_upload`.
229
+
230
+ ### ScienceQA
231
+
232
+ 1. Under `eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA).
233
+ 2. Single-GPU inference and evaluate.
234
+ ```Shell
235
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh
236
+ ```
237
+
238
+ ### TextVQA
239
+
240
+ 1. Download [`TextVQA_0.5.1_val.json`](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `eval/textvqa`.
241
+ 2. Single-GPU inference and evaluate.
242
+ ```Shell
243
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh
244
+ ```
245
+
246
+ ### POPE
247
+
248
+ 1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `eval/pope`.
249
+ 2. Single-GPU inference and evaluate.
250
+ ```Shell
251
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh
252
+ ```
253
+
254
+ ### MMBench
255
+
256
+ 1. Download [`mmbench_dev_20230712.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv) and put under `eval/mmbench`.
257
+ 2. Single-GPU inference.
258
+ ```Shell
259
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh
260
+ ```
261
+ 3. Submit the results to the [evaluation server](https://opencompass.org.cn/leaderboard-multimodal): `eval/mmbench/answers_upload/mmbench_dev_20230712`.
262
+
263
+ ### LLaVA-Bench-in-the-Wild
264
+
265
+ 1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `eval/llava-bench-in-the-wild`.
266
+ 2. Single-GPU inference and evaluate.
267
+ ```Shell
268
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh
269
+ ```
270
+
271
+ ### MM-Vet
272
+
273
+ 1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `eval/mmvet`.
274
+ 2. Single-GPU inference.
275
+ ```Shell
276
+ CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh
277
+ ```
278
+
279
+
app.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import shutil
2
+ import subprocess
3
+
4
+ import torch
5
+ import gradio as gr
6
+ from fastapi import FastAPI
7
+ import os
8
+ from PIL import Image
9
+ import tempfile
10
+ from decord import VideoReader, cpu
11
+ from transformers import TextStreamer
12
+
13
+ from llava.constants import DEFAULT_X_TOKEN, X_TOKEN_INDEX
14
+ from llava.conversation import conv_templates, SeparatorStyle, Conversation
15
+ from llava.serve.gradio_utils import Chat, tos_markdown, learn_more_markdown, title_markdown, block_css
16
+
17
+
18
+ def save_image_to_local(image):
19
+ filename = os.path.join('temp', next(tempfile._get_candidate_names()) + '.jpg')
20
+ image = Image.open(image)
21
+ image.save(filename)
22
+ # print(filename)
23
+ return filename
24
+
25
+
26
+ def save_video_to_local(video_path):
27
+ filename = os.path.join('temp', next(tempfile._get_candidate_names()) + '.mp4')
28
+ shutil.copyfile(video_path, filename)
29
+ return filename
30
+
31
+
32
+ def generate(image1, video, textbox_in, first_run, state, state_, images_tensor):
33
+ flag = 1
34
+ if not textbox_in:
35
+ if len(state_.messages) > 0:
36
+ textbox_in = state_.messages[-1][1]
37
+ state_.messages.pop(-1)
38
+ flag = 0
39
+ else:
40
+ return "Please enter instruction"
41
+
42
+ image1 = image1 if image1 else "none"
43
+ video = video if video else "none"
44
+ # assert not (os.path.exists(image1) and os.path.exists(video))
45
+
46
+ if type(state) is not Conversation:
47
+ state = conv_templates[conv_mode].copy()
48
+ state_ = conv_templates[conv_mode].copy()
49
+ images_tensor = [[], []]
50
+
51
+ first_run = False if len(state.messages) > 0 else True
52
+
53
+ text_en_in = textbox_in.replace("picture", "image")
54
+
55
+ # images_tensor = [[], []]
56
+ image_processor = handler.image_processor
57
+ if os.path.exists(image1) and not os.path.exists(video):
58
+ tensor = image_processor.preprocess(image1, return_tensors='pt')['pixel_values'][0]
59
+ # print(tensor.shape)
60
+ tensor = tensor.to(handler.model.device, dtype=dtype)
61
+ images_tensor[0] = images_tensor[0] + [tensor]
62
+ images_tensor[1] = images_tensor[1] + ['image']
63
+ video_processor = handler.video_processor
64
+ if not os.path.exists(image1) and os.path.exists(video):
65
+ tensor = video_processor(video, return_tensors='pt')['pixel_values'][0]
66
+ # print(tensor.shape)
67
+ tensor = tensor.to(handler.model.device, dtype=dtype)
68
+ images_tensor[0] = images_tensor[0] + [tensor]
69
+ images_tensor[1] = images_tensor[1] + ['video']
70
+ if os.path.exists(image1) and os.path.exists(video):
71
+ tensor = video_processor(video, return_tensors='pt')['pixel_values'][0]
72
+ # print(tensor.shape)
73
+ tensor = tensor.to(handler.model.device, dtype=dtype)
74
+ images_tensor[0] = images_tensor[0] + [tensor]
75
+ images_tensor[1] = images_tensor[1] + ['video']
76
+
77
+
78
+ tensor = image_processor.preprocess(image1, return_tensors='pt')['pixel_values'][0]
79
+ # print(tensor.shape)
80
+ tensor = tensor.to(handler.model.device, dtype=dtype)
81
+ images_tensor[0] = images_tensor[0] + [tensor]
82
+ images_tensor[1] = images_tensor[1] + ['image']
83
+
84
+
85
+
86
+ if os.path.exists(image1) and not os.path.exists(video):
87
+ text_en_in = DEFAULT_X_TOKEN['IMAGE'] + '\n' + text_en_in
88
+ if not os.path.exists(image1) and os.path.exists(video):
89
+ text_en_in = DEFAULT_X_TOKEN['VIDEO'] + '\n' + text_en_in
90
+ if os.path.exists(image1) and os.path.exists(video):
91
+ text_en_in = DEFAULT_X_TOKEN['VIDEO'] + '\n' + text_en_in + '\n' + DEFAULT_X_TOKEN['IMAGE']
92
+
93
+ text_en_out, state_ = handler.generate(images_tensor, text_en_in, first_run=first_run, state=state_)
94
+ state_.messages[-1] = (state_.roles[1], text_en_out)
95
+
96
+ text_en_out = text_en_out.split('#')[0]
97
+ textbox_out = text_en_out
98
+
99
+ show_images = ""
100
+ if os.path.exists(image1):
101
+ filename = save_image_to_local(image1)
102
+ show_images += f'<img src="./file={filename}" style="display: inline-block;width: 250px;max-height: 400px;">'
103
+ if os.path.exists(video):
104
+ filename = save_video_to_local(video)
105
+ show_images += f'<video controls playsinline width="500" style="display: inline-block;" src="./file={filename}"></video>'
106
+
107
+ if flag:
108
+ state.append_message(state.roles[0], textbox_in + "\n" + show_images)
109
+ state.append_message(state.roles[1], textbox_out)
110
+
111
+ return (state, state_, state.to_gradio_chatbot(), False, gr.update(value=None, interactive=True), images_tensor, gr.update(value=image1 if os.path.exists(image1) else None, interactive=True), gr.update(value=video if os.path.exists(video) else None, interactive=True))
112
+
113
+ def regenerate(state, state_):
114
+ state.messages.pop(-1)
115
+ state_.messages.pop(-1)
116
+ if len(state.messages) > 0:
117
+ return state, state_, state.to_gradio_chatbot(), False
118
+ return (state, state_, state.to_gradio_chatbot(), True)
119
+
120
+
121
+ def clear_history(state, state_):
122
+ state = conv_templates[conv_mode].copy()
123
+ state_ = conv_templates[conv_mode].copy()
124
+ return (gr.update(value=None, interactive=True),
125
+ gr.update(value=None, interactive=True),\
126
+ gr.update(value=None, interactive=True),\
127
+ True, state, state_, state.to_gradio_chatbot(), [[], []])
128
+
129
+
130
+
131
+ conv_mode = "llava_v1"
132
+ model_path = 'LanguageBind/Video-LLaVA-7B'
133
+ device = 'cuda'
134
+ load_8bit = False
135
+ load_4bit = True
136
+ dtype = torch.float16
137
+ handler = Chat(model_path, conv_mode=conv_mode, load_8bit=load_8bit, load_4bit=load_8bit, device=device)
138
+ # handler.model.to(dtype=dtype)
139
+ if not os.path.exists("temp"):
140
+ os.makedirs("temp")
141
+
142
+ app = FastAPI()
143
+
144
+ textbox = gr.Textbox(
145
+ show_label=False, placeholder="Enter text and press ENTER", container=False
146
+ )
147
+ with gr.Blocks(title='Video-LLaVA🚀', theme=gr.themes.Default(), css=block_css) as demo:
148
+ gr.Markdown(title_markdown)
149
+ state = gr.State()
150
+ state_ = gr.State()
151
+ first_run = gr.State()
152
+ images_tensor = gr.State()
153
+
154
+ with gr.Row():
155
+ with gr.Column(scale=3):
156
+ image1 = gr.Image(label="Input Image", type="filepath")
157
+ video = gr.Video(label="Input Video")
158
+
159
+ cur_dir = os.path.dirname(os.path.abspath(__file__))
160
+ gr.Examples(
161
+ examples=[
162
+ [
163
+ f"{cur_dir}/examples/extreme_ironing.jpg",
164
+ "What is unusual about this image?",
165
+ ],
166
+ [
167
+ f"{cur_dir}/examples/waterview.jpg",
168
+ "What are the things I should be cautious about when I visit here?",
169
+ ],
170
+ [
171
+ f"{cur_dir}/examples/glove.jpg",
172
+ "What happens when the glove drops?",
173
+ ],
174
+ [
175
+ f"{cur_dir}/examples/desert.jpg",
176
+ "If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert?",
177
+ ],
178
+ ],
179
+ inputs=[image1, textbox],
180
+ )
181
+
182
+ with gr.Column(scale=7):
183
+ chatbot = gr.Chatbot(label="Video-LLaVA", bubble_full_width=True).style(height=850)
184
+ with gr.Row():
185
+ with gr.Column(scale=8):
186
+ textbox.render()
187
+ with gr.Column(scale=1, min_width=50):
188
+ submit_btn = gr.Button(
189
+ value="Send", variant="primary", interactive=True
190
+ )
191
+ with gr.Row(elem_id="buttons") as button_row:
192
+ upvote_btn = gr.Button(value="👍 Upvote", interactive=True)
193
+ downvote_btn = gr.Button(value="👎 Downvote", interactive=True)
194
+ flag_btn = gr.Button(value="⚠️ Flag", interactive=True)
195
+ # stop_btn = gr.Button(value="⏹️ Stop Generation", interactive=False)
196
+ regenerate_btn = gr.Button(value="🔄 Regenerate", interactive=True)
197
+ clear_btn = gr.Button(value="🗑️ Clear history", interactive=True)
198
+
199
+ with gr.Row():
200
+ gr.Examples(
201
+ examples=[
202
+ [
203
+ f"{cur_dir}/examples/sample_img_22.png",
204
+ f"{cur_dir}/examples/sample_demo_22.mp4",
205
+ "Are the instruments in the pictures used in the video?",
206
+ ],
207
+ [
208
+ f"{cur_dir}/examples/sample_img_13.png",
209
+ f"{cur_dir}/examples/sample_demo_13.mp4",
210
+ "Does the flag in the image appear in the video?",
211
+ ],
212
+ [
213
+ f"{cur_dir}/examples/sample_img_8.png",
214
+ f"{cur_dir}/examples/sample_demo_8.mp4",
215
+ "Are the image and the video depicting the same place?",
216
+ ],
217
+ ],
218
+ inputs=[image1, video, textbox],
219
+ )
220
+ gr.Examples(
221
+ examples=[
222
+ [
223
+ f"{cur_dir}/examples/sample_demo_1.mp4",
224
+ "Why is this video funny?",
225
+ ],
226
+ [
227
+ f"{cur_dir}/examples/sample_demo_3.mp4",
228
+ "Can you identify any safety hazards in this video?"
229
+ ],
230
+ [
231
+ f"{cur_dir}/examples/sample_demo_9.mp4",
232
+ "Describe the video.",
233
+ ],
234
+ [
235
+ f"{cur_dir}/examples/sample_demo_22.mp4",
236
+ "Describe the activity in the video.",
237
+ ],
238
+ ],
239
+ inputs=[video, textbox],
240
+ )
241
+ gr.Markdown(tos_markdown)
242
+ gr.Markdown(learn_more_markdown)
243
+
244
+ submit_btn.click(generate, [image1, video, textbox, first_run, state, state_, images_tensor],
245
+ [state, state_, chatbot, first_run, textbox, images_tensor, image1, video])
246
+
247
+ regenerate_btn.click(regenerate, [state, state_], [state, state_, chatbot, first_run]).then(
248
+ generate, [image1, video, textbox, first_run, state, state_, images_tensor], [state, state_, chatbot, first_run, textbox, images_tensor, image1, video])
249
+
250
+ clear_btn.click(clear_history, [state, state_],
251
+ [image1, video, textbox, first_run, state, state_, chatbot, images_tensor])
252
+
253
+ # app = gr.mount_gradio_app(app, demo, path="/")
254
+ demo.launch()
255
+
256
+
257
+ # uvicorn llava.serve.gradio_web_server:app
pyproject.toml ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "llava"
7
+ version = "1.1.3"
8
+ description = "Towards GPT-4 like large language and visual assistant."
9
+ readme = "README.md"
10
+ requires-python = ">=3.8"
11
+ classifiers = [
12
+ "Programming Language :: Python :: 3",
13
+ "License :: OSI Approved :: Apache Software License",
14
+ ]
15
+ dependencies = [
16
+ "torch==2.0.1", "torchvision==0.15.2",
17
+ "transformers==4.31.0", "tokenizers>=0.12.1,<0.14", "sentencepiece==0.1.99", "shortuuid",
18
+ "accelerate==0.21.0", "peft==0.4.0", "bitsandbytes==0.41.0",
19
+ "pydantic<2,>=1", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
20
+ "gradio==3.35.2", "gradio_client==0.2.9",
21
+ "requests", "httpx==0.24.0", "uvicorn", "fastapi",
22
+ "einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
23
+ ]
24
+
25
+ [project.optional-dependencies]
26
+ train = ["deepspeed==0.9.5", "ninja", "wandb"]
27
+
28
+ [project.urls]
29
+ "Homepage" = "https://llava-vl.github.io"
30
+ "Bug Tracker" = "https://github.com/haotian-liu/LLaVA/issues"
31
+
32
+ [tool.setuptools.packages.find]
33
+ exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
34
+
35
+ [tool.wheel]
36
+ exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]