Transformers
Safetensors
English
Chinese
llava
pretraining
vision-language
llm
lmm
Inference Endpoints
File size: 15,326 Bytes
064e744
7c46f5f
064e744
1eb14fd
064e744
fedff3e
064e744
 
 
1eb14fd
 
 
 
 
fd01149
1eb14fd
fd01149
1eb14fd
543bb42
1eb14fd
fd01149
1eb14fd
 
 
 
 
 
 
 
 
 
b83932c
1eb14fd
 
 
 
 
 
 
 
 
 
 
 
 
3420145
1eb14fd
3420145
1eb14fd
b83932c
1eb14fd
 
 
 
 
 
 
 
 
b83932c
fd01149
1eb14fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd01149
 
507d5f6
6846de5
fd01149
6846de5
507d5f6
fd01149
 
 
 
 
 
 
 
 
 
 
 
20abc8b
b08ccca
fd01149
 
507d5f6
fd01149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
543bb42
 
7c46f5f
 
 
 
 
 
 
 
 
 
 
 
 
1eb14fd
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
license: apache-2.0
datasets:
- Lin-Chen/ShareGPT4V
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
language:
- en
- zh
tags:
- llava
- vision-language
- llm
- lmm
---
<h2 align="center"> <a href="https://arxiv.org/abs/2402.14289">TinyLLaVA: A Framework of Small-scale Large Multimodal Models</a>

<h5 align="center">

[![github](https://img.shields.io/badge/GitHub-TinyLLaVA-blue)](https://github.com/DLCV-BUAA/TinyLLaVABench) [![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2402.14289) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/LICENSE) 

## &#x1F389; News
* **[2024.03.10]**  base recipe out!
* **[2024.03.10]**  Finetune scripts out!
* **[2024.02.25]**  Update evaluation scripts and docs!
* **[2024.02.25]**  Data descriptions out. Release TinyLLaVA-1.5B and TinyLLaVA-2.0B!
* **[2024.02.24]**  Example code on inference and model loading added!
* **[2024.02.23]**  Evaluation code and scripts released!
* **[2024.02.21]**  Creating the [TinyLLaVABench](https://github.com/DLCV-BUAA/TinyLLavaBench) repository on GitHub!
* **[2024.02.21]**  Our paper: [TinyLLaVA: A Framework of Small-scale Large Multimodal Models](https://arxiv.org/abs/2402.14289) is out!
* **[2024.01.11]**  Our fist model [TinyLLaVA-1.4B](https://huggingface.co/bczhou/tiny-llava-v1-hf) is out!

## &#x231B; TODO
- [ ] Add support for Ollama and llama.cpp.
- [x] Developers' guide / How to build demo locally.
- [x] Training and custom finetuning docs.
- [x] Model Zoo descriptions.
- [x] Examples and inference.
- [x] Release code for training.
- [x] Add descriptions for evaluation.
- [x] Add descriptions for data preparation.
- [x] Release TinyLLaVA-1.5B and TinyLLaVA-2.0B.
- [x] Release TinyLLaVA-3.1B.
- [x] Release the evaluation code and weights today(2024.2.23).
### &#x1F525; High performance, but with fewer parameters

- Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.

## Contents

- [Install](#x1f527-requirements-and-installation)
- [Model Zoo](#x1f433-model-zoo)
- [Demo](#Demo)
- [Quick Start](#x1f527-quick-start)
- [Run Inference](#x1f527-run-inference)
- [Evaluation](#evaluation)
- [Data](#data-preparation)
- [Train](#train)
- [Custom Finetune](#custom-finetune)


## &#x1F527; Requirements and Installation

We recommend the requirements as follows.

1. Clone this repository and navigate to LLaVA folder
```bash
git clone https://github.com/DLCV-BUAA/TinyLLaVABench.git
cd TinyLLaVABench
```

2. Install Package
```Shell
conda create -n tinyllava python=3.10 -y
conda activate tinyllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```

3. Install additional packages for training cases
```Shell
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```
### Upgrade to the latest code base

```Shell
git pull
pip install -e .

# if you see some import errors when you upgrade, please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir
```

## &#x1F433; Model Zoo
### Legacy Model
- [tiny-llava-hf](https://huggingface.co/bczhou/tiny-llava-v1-hf)

### Pretrained Models
- [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B)
- [TinyLLaVA-2.0B](https://huggingface.co/bczhou/TinyLLaVA-2.0B)
- [TinyLLaVA-1.5B](https://huggingface.co/bczhou/TinyLLaVA-1.5B)

### Model Details
| Name          | LLM               | Checkpoint                                     | LLaVA-Bench-Wild | MME      | MMBench | MM-Vet | SQA-image | VQA-v2 | GQA   | TextVQA |
|---------------|-------------------|------------------------------------------------|------------------|----------|---------|--------|-----------|--------|-------|---------|
| TinyLLaVA-3.1B | Phi-2             | [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B) | 75.8             | 1464.9   | 66.9    | 32.0   | 69.1      | 79.9   | 62.0  | 59.1    |
| TinyLLaVA-2.0B | StableLM-2-1.6B   | [TinyLLaVA-2.0B](https://huggingface.co/bczhou/TinyLLaVA-2.0B) | 66.4             | 1433.8     | 63.3    | 32.6   | 64.7      | 78.9   | 61.9  | 56.4    |
| TinyLLaVA-1.5B | TinyLlama         | [TinyLLaVA-1.5B](https://huggingface.co/bczhou/TinyLLaVA-1.5B) | 60.8             | 1276.5     | 55.2     | 25.8   | 60.3      | 76.9   | 60.3  | 51.7    |


## Demo

### Gradio Web Demo

Launch a local web demo by running:
```shell
python tinyllava/serve/app.py --model-path bczhou/TinyLLaVA-3.1B --model-name TinyLLaVA-3.1B
```

### CLI Inference

We also support running inference with CLI. To use our model, run:
```shell
python -m tinyllava.serve.cli \
    --model-path bczhou/TinyLLaVA-3.1B \
    --image-file "./tinyllava/serve/examples/extreme_ironing.jpg" 
```


## &#x1F527; Quick Start

<details>
<summary>Load model</summary>
    
```Python
from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)
```
</details>

## &#x1F527; Run Inference
Here's an example of running inference with [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B)
<details>
<summary>Run Inference</summary>
    
```Python
from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": "phi",
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)
```
</details>

### Important
We use different `conv_mode` for different models. Replace the `conv_mode` in `args` according to this table:
| model          	| conv_mode 	|
|----------------	|-----------	|
| TinyLLaVA-3.1B 	| phi       	|
| TinyLLaVA-2.0B 	| phi       	|
| TinyLLaVA-1.5B 	| v1        	|

## Evaluation
To ensure the reproducibility, we evaluate the models with greedy decoding.

See [Evaluation.md](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/docs/Evaluation.md)

## Data Preparation

In our paper, we used two different datasets: the [LLaVA dataset](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#pretrain-feature-alignment) and the [ShareGPT4V dataset](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md), and compared their differences. In this section, we provide information on data preparation.

### Pretraining Images
* LLaVA: The pretraining images of LLaVA is from the 558K subset of the LAION-CC-SBU dataset.
* ShareGPT4V: The pretraining images of ShareGPT4V is a mixture of 558K LAION-CC-SBU subset, SAM dataset, and COCO dataset.

### Pretraining Annotations
* LLaVA: The pretraining annotations of LLaVA are [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
* ShareGPT4V: The pretraining annotations of ShareGPT4V are [here](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json).


### SFT Images & Annotations
The majority of the two SFT datasets are the same, with the exception that the 23K detailed description data in LLaVA-1.5-SFT being replaced with detailed captions randomly sampled from the [100K ShareGPT4V data](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json).

### Download data

1. Download relevant images

- LAION-CC-SBU-558K: [images.zip](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip)
- COCO: This dataset is from the [COCO2017 challenge](https://cocodataset.org/). Download: [train2017](http://images.cocodataset.org/zips/train2017.zip)
- WebData: This dataset is curated by the [ShareGPT4V project](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V). Download: [images](https://drive.google.com/drive/folders/1tCUQ-sq6vdshZVkF0ZeF3K4eztkXJgax?usp=sharing). Only for academic usage.
- SAM: This dataset is collected by [Meta](https://ai.meta.com/datasets/segment-anything-downloads/). Download: [images](https://ai.meta.com/datasets/segment-anything-downloads/). We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from [here](https://drive.google.com/file/d/1dKumdOKSXtV7lIXdrG7jsIK_z2vZv2gs/view?usp=drive_link).
- GQA: [GQA project page](https://cs.stanford.edu/people/dorarad/gqa/about.html). Download: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- OCR-VQA: [OCR-VQA project page](https://ocr-vqa.github.io/). Download: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing). We save all files as `.jpg`
- TextVQA: [TextVQA project page](https://textvqa.org/). Download: [trainvalimages](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
- VisualGenome: [VisualGenome project page](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html). Download: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)


2. Download relevant annotations

- LLaVA's pretraining annotations: [blip_laion_cc_sbu_558k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
- LLaVA's SFT annotations: [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json)
- ShareGPT4V's pretraining annotations: [share-captioner_coco_lcs_sam_1246k_1107.json](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json)
- ShareGPT4V's SFT annotations: [sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json)


### Organize Data

Organize the image files and annotation files as follows in `path/to/your/data`:

```none
data
β”œβ”€β”€ llava
β”‚   β”œβ”€β”€ llava_pretrain
β”‚   β”‚   β”œβ”€β”€ images
β”‚   β”‚   β”œβ”€β”€ blip_laion_cc_sbu_558k.json
β”œβ”€β”€ coco
β”‚   β”œβ”€β”€ train2017
β”œβ”€β”€ sam
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ gqa
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ ocr_vqa
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ textvqa
β”‚   β”œβ”€β”€ train_images
β”œβ”€β”€ vg
β”‚   β”œβ”€β”€ VG_100K
β”‚   β”œβ”€β”€ VG_100K_2
β”œβ”€β”€ share_textvqa
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ web-celebrity
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ web-landmark
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ wikiart
β”‚   β”œβ”€β”€ images
β”œβ”€β”€ text_files
β”‚   β”œβ”€β”€ llava_v1_5_mix665k.json
β”‚   β”œβ”€β”€ share-captioner_coco_lcs_sam_1246k_1107.json
β”‚   β”œβ”€β”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
```

## Train

**This section we describe the base recipe.**
### Hyperparameters
Both hyperparameters used in pretraining and finetuning are provided below.

1. Pretraining

| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|----------------| ---: | ---: | ---: |-----------:| ---: |
| TinyLLaVA-3.1B | 256 | 1e-3 | 1 |       3072 | 0 |

2. Finetuning

| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|----------------| ---: | ---: | ---: |-----------:| ---: |
| TinyLLaVA-3.1B | 128 | 2e-5 | 1 |       3072 | 0 |

### Pretrain

**Replace paths to your paths**

Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/scripts/tiny_llava/pretrain.sh).

### Finetune

**Replace paths to your paths**

Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/scripts/tiny_llava/finetune.sh).

## Custom-Finetune

Check out our custom finetune using LoRA [here](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/dev/docs/CUTOM_FINETUNE.md).


#### - Prompt Template
The model supports multi-image and multi-prompt generation. When using the model, make sure to follow the correct prompt template (`USER: <image>xxx\nASSISTANT:`), where `<image>` token is a place-holding special token for image embeddings.

## Model Inference from `pipeline` and `transformers`
#### - Using `pipeline`:
Below we used [`"bczhou/tiny-llava-v1-hf"`](https://huggingface.co/bczhou/tiny-llava-v1-hf) checkpoint.

```python
from transformers import pipeline
from PIL import Image
import requests
model_id = "bczhou/tiny-llava-v1-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0])
>>> {"generated_text': 'USER:  \nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT: The label 15 represents lava, which is a type of volcanic rock."}
```

#### - Using pure `transformers`:
Below is an example script to run generation in `float16` precision on a GPU device:

```python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "bczhou/tiny-llava-v1-hf"
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
```

## &#x270F; Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

```BibTeX
@misc{zhou2024tinyllava,
      title={TinyLLaVA: A Framework of Small-scale Large Multimodal Models}, 
      author={Baichuan Zhou and Ying Hu and Xi Weng and Junlong Jia and Jie Luo and Xien Liu and Ji Wu and Lei Huang},
      year={2024},
      eprint={2402.14289},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```


## ❀️ Community efforts
* Our codebase is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) project. Great work!
* Our project uses data from the [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) project. Great work!