alessandro trinca tornidor commited on
Commit
f5f1590
·
1 Parent(s): a170680

[feat] update README.md to prepare merge

Browse files
Files changed (1) hide show
  1. README.md +314 -1
README.md CHANGED
@@ -7,4 +7,317 @@ sdk: docker
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ (Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference)
11
+
12
+ [![Gradio](https://img.shields.io/badge/Gradio-Online%20Demo-blue)](http://103.170.5.190:7860/)
13
+ [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/openxlab-app/LISA)
14
+
15
+ # LISA: Reasoning Segmentation via Large Language Model
16
+
17
+ <font size=7><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font>
18
+
19
+ <font size=7><div align='center' > <a href=https://arxiv.org/pdf/2308.00692.pdf>**Paper**</a> | <a href="https://huggingface.co/xinlai">**Models**</a> | [**Training**](#training) | [**Inference**](#inference) | [**Local Deployment**](#deployment) | [**Dataset**](#dataset) | <a href="http://103.170.5.190:7860/">**Online Demo**</a></div></font>
20
+
21
+ <!-- <p align="center"> <img src="resources/imgs/teaser.jpg" width="100%"> </p> -->
22
+
23
+ <table class="center">
24
+ <tr>
25
+ <td style="text-align:center;"><b>Input</b></td>
26
+ <td style="text-align:center;"><b>Output</b></td>
27
+ <td style="text-align:center;"><b>Input</b></td>
28
+ <td style="text-align:center;"><b>Output</b></td>
29
+ </tr>
30
+ <tr>
31
+ <td><img src="resources/imgs/obama.jpg"></td>
32
+ <td><img src="vis_output/obama.jpg"></td>
33
+ <td><img src="resources/imgs/trump.jpg"></td>
34
+ <td><img src="vis_output/trump.jpg"></td>
35
+ </tr>
36
+ <tr>
37
+ <td width=25% style="text-align:center;color:gray;">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain the reason."</td>
38
+ <td width=25% style="text-align:center;">"Sure, the segmentation result is [SEG]. The President of the United States in the image is <ins>President Obama.</ins>”</td>
39
+ <td width=25% style="text-align:center;color:gray">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain why."</td>
40
+ <td width=25% style="text-align:center;">"Sure, [SEG]. In the image, the President of the United States is <ins>President Trump.</ins>"</td>
41
+ </tr>
42
+
43
+ <tr>
44
+ <td><img src="resources/imgs/stand_higher.jpg"></td>
45
+ <td><img src="vis_output/stand_higher.jpg"></td>
46
+ <td><img src="resources/imgs/camera_lens.jpg"></td>
47
+ <td><img src="vis_output/camera_lens.jpg"></td>
48
+ </tr>
49
+ <tr>
50
+ <td width=25% style="text-align:center;color:gray;">"What can <ins>make the woman stand higher</ins>? Please output segmentation mask and explain why."</td>
51
+ <td width=25% style="text-align:center;">"Sure, [SEG]. The woman is <ins>standing higher by using a ladder</ins>..."</td>
52
+ <td width=25% style="text-align:center;color:gray">"Can you segment <ins>the camera lens that is more suitable for photographing nearby objects</ins> in this image?"</td>
53
+ <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
54
+ </tr>
55
+
56
+ <tr>
57
+ <td><img src="resources/imgs/dog_with_horn.jpg"></td>
58
+ <td><img src="vis_output/dog_with_horn.jpg"></td>
59
+ <td><img src="resources/imgs/wash_hands.jpg"></td>
60
+ <td><img src="vis_output/wash_hands.jpg"></td>
61
+ </tr>
62
+ <tr>
63
+ <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the unusual part</ins> in this image and explain why."</td>
64
+ <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the unusual part is <ins>the dog wearing a reindeer antler headband</ins>..."</td>
65
+ <td width=25% style="text-align:center;color:gray">"Where to <ins>wash hands</ins> in this image? Please output segmentation mask."</td>
66
+ <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
67
+ </tr>
68
+
69
+ <tr>
70
+ <td><img src="resources/imgs/jackma.jpg"></td>
71
+ <td><img src="vis_output/jackma.jpg"></td>
72
+ <td><img src="resources/imgs/blackpink.jpg"></td>
73
+ <td><img src="vis_output/blackpink.jpg"></td>
74
+ </tr>
75
+ <tr>
76
+ <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the founder of Alibaba</ins> in this image and explain why?"</td>
77
+ <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is <ins>Jack Ma</ins>, the co-founder of Alibaba Group..."</td>
78
+ <td width=25% style="text-align:center;color:gray">"Please segment <ins>Lisa</ins> in this figure."</td>
79
+ <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
80
+ </tr>
81
+ </table>
82
+
83
+ <p align="center"> <img src="resources/imgs/fig_overview.jpg" width="100%"> </p>
84
+
85
+ ## News
86
+ - [x] [2023.8.30] Release three new models [LISA-7B-v1](https://huggingface.co/xinlai/LISA-7B-v1), [LISA-7B-v1-explanatory](https://huggingface.co/xinlai/LISA-7B-v1-explanatory), and [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory). Welcome to check them out!
87
+ - [x] [2023.8.23] Refactor code, and release new model [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1). Welcome to check it out!
88
+ - [x] [2023.8.9] Training code is released!
89
+ - [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
90
+ - [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released!
91
+ - [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check them out!
92
+ - [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
93
+
94
+ **LISA: Reasoning Segmentation via Large Language Model [[Paper](https://arxiv.org/abs/2308.00692)]** <br />
95
+ [Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl=zh-CN),
96
+ [Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ&hl=en),
97
+ [Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ&hl=en),
98
+ [Yanwei Li](https://scholar.google.com/citations?user=I-UCPPcAAAAJ&hl=zh-CN),
99
+ [Yuhui Yuan](https://scholar.google.com/citations?user=PzyvzksAAAAJ&hl=en),
100
+ [Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ&hl=zh-CN),
101
+ [Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)<br />
102
+
103
+ ## Abstract
104
+ In this work, we propose a new segmentation task --- ***reasoning segmentation***. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks.
105
+ For more details, please refer to the [paper](https://arxiv.org/abs/2308.00692).
106
+
107
+ ## Highlights
108
+ **LISA** unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving:
109
+ 1. complex reasoning;
110
+ 2. world knowledge;
111
+ 3. explanatory answers;
112
+ 4. multi-turn conversation.
113
+
114
+ **LISA** also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.
115
+
116
+ ## Experimental results
117
+ <p align="center"> <img src="resources/imgs/table1.jpg" width="80%"> </p>
118
+
119
+ ## Installation
120
+ ```
121
+ pip install -r requirements.txt
122
+ pip install flash-attn --no-build-isolation
123
+ ```
124
+
125
+ ## Training
126
+ ### Training Data Preparation
127
+ The training data consists of 4 types of data:
128
+
129
+ 1. Semantic segmentation datasets: [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip), [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip), [Mapillary](https://www.mapillary.com/dataset/vistas), [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup), [PASCAL-Part](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part), [COCO Images](http://images.cocodataset.org/zips/train2017.zip)
130
+
131
+ Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the `dataset/coco/` directory.
132
+
133
+ 3. Referring segmentation datasets: [refCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip), [refCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip), [refCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip), [refCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip) ([saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip))
134
+
135
+ Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a [OneDrive link](https://mycuhk-my.sharepoint.com/:f:/g/personal/1155154502_link_cuhk_edu_hk/Em5yELVBvfREodKC94nOFLoBLro_LPxsOxNV44PHRWgLcA?e=zQPjsc) to download. **You must also follow the rules that the original datasets require.**
136
+
137
+ 4. Visual Question Answering dataset: [LLaVA-Instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json)
138
+
139
+ 5. Reasoning segmentation dataset: [ReasonSeg](https://github.com/dvlab-research/LISA#dataset)
140
+
141
+ Download them from the above links, and organize them as follows.
142
+
143
+ ```
144
+ ├── dataset
145
+ │ ├── ade20k
146
+ │ │ ├── annotations
147
+ │ │ └── images
148
+ │ ├── coco
149
+ │ │ └── train2017
150
+ │ │ ├── 000000000009.jpg
151
+ │ │ └── ...
152
+ │ ├── cocostuff
153
+ │ │ └── train2017
154
+ │ │ ├── 000000000009.png
155
+ │ │ └── ...
156
+ │ ├── llava_dataset
157
+ │ │ └── llava_instruct_150k.json
158
+ │ ├── mapillary
159
+ │ │ ├── config_v2.0.json
160
+ │ │ ├── testing
161
+ │ │ ├── training
162
+ │ │ └── validation
163
+ │ ├── reason_seg
164
+ │ │ └── ReasonSeg
165
+ │ │ ├── train
166
+ │ │ ├── val
167
+ │ │ └── explanatory
168
+ │ ├── refer_seg
169
+ │ │ ├── images
170
+ │ │ | ├── saiapr_tc-12
171
+ │ │ | └── mscoco
172
+ │ │ | └── images
173
+ │ │ | └── train2014
174
+ │ │ ├── refclef
175
+ │ │ ├── refcoco
176
+ │ │ ├── refcoco+
177
+ │ │ └── refcocog
178
+ │ └── vlpart
179
+ │ ├── paco
180
+ │ │ └── annotations
181
+ │ └── pascal_part
182
+ │ ├── train.json
183
+ │ └── VOCdevkit
184
+ ```
185
+
186
+ ### Pre-trained weights
187
+
188
+ #### LLaVA
189
+ To train LISA-7B or 13B, you need to follow the [instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to merge the LLaVA delta weights. Typically, we use the final weights `LLaVA-Lightning-7B-v1-1` and `LLaVA-13B-v1-1` merged from `liuhaotian/LLaVA-Lightning-7B-delta-v1-1` and `liuhaotian/LLaVA-13b-delta-v1-1`, respectively. For Llama2, we can directly use the LLaVA full weights `liuhaotian/llava-llama-2-13b-chat-lightning-preview`.
190
+
191
+ #### SAM ViT-H weights
192
+ Download SAM ViT-H pre-trained weights from the [link](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth).
193
+
194
+ ### Training
195
+ ```
196
+ deepspeed --master_port=24999 train_ds.py \
197
+ --version="PATH_TO_LLaVA" \
198
+ --dataset_dir='./dataset' \
199
+ --vision_pretrained="PATH_TO_SAM" \
200
+ --dataset="sem_seg||refer_seg||vqa||reason_seg" \
201
+ --sample_rates="9,3,3,1" \
202
+ --exp_name="lisa-7b"
203
+ ```
204
+ When training is finished, to get the full model weight:
205
+ ```
206
+ cd ./runs/lisa-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
207
+ ```
208
+
209
+ ### Merge LoRA Weight
210
+ Merge the LoRA weights of `pytorch_model.bin`, save the resulting model into your desired path in the Hugging Face format:
211
+ ```
212
+ CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
213
+ --version="PATH_TO_LLaVA" \
214
+ --weight="PATH_TO_pytorch_model.bin" \
215
+ --save_path="PATH_TO_SAVED_MODEL"
216
+ ```
217
+
218
+ For example:
219
+ ```
220
+ CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \
221
+ --version="./LLaVA/LLaVA-Lightning-7B-v1-1" \
222
+ --weight="lisa-7b/pytorch_model.bin" \
223
+ --save_path="./LISA-7B"
224
+ ```
225
+
226
+ ### Validation
227
+ ```
228
+ deepspeed --master_port=24999 train_ds.py \
229
+ --version="PATH_TO_LISA_HF_Model_Directory" \
230
+ --dataset_dir='./dataset' \
231
+ --vision_pretrained="PATH_TO_SAM" \
232
+ --exp_name="lisa-7b" \
233
+ --eval_only
234
+ ```
235
+
236
+ Note: the `v1` model is trained using both `train+val` sets, so please use the `v0` model to reproduce the validation results. (To use the `v0` models, please first checkout to the legacy version repo with `git checkout 0e26916`.)
237
+
238
+
239
+ ## Inference
240
+
241
+ To chat with [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1) or [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory):
242
+ (Note that `chat.py` currently does not support `v0` models (i.e., `LISA-13B-llama2-v0` and `LISA-13B-llama2-v0-explanatory`), if you want to use the `v0` models, please first checkout to the legacy version repo `git checkout 0e26916`.)
243
+ ```
244
+ CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1'
245
+ CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1-explanatory'
246
+ ```
247
+ To use `bf16` or `fp16` data type for inference:
248
+ ```
249
+ CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='bf16'
250
+ ```
251
+ To use `8bit` or `4bit` data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality):
252
+ ```
253
+ CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_8bit
254
+ CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_4bit
255
+ ```
256
+ Hint: for 13B model, 16-bit inference consumes 30G VRAM with a single GPU, 8-bit inference consumes 16G, and 4-bit inference consumes 9G.
257
+
258
+ After that, input the text prompt and then the image path. For example,
259
+ ```
260
+ - Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask.
261
+ - Please input the image path: imgs/example1.jpg
262
+
263
+ - Please input your prompt: Can you segment the food that tastes spicy and hot?
264
+ - Please input the image path: imgs/example2.jpg
265
+ ```
266
+ The results should be like:
267
+ <p align="center"> <img src="resources/imgs/example1.jpg" width="22%"> <img src="vis_output/example1_masked_img_0.jpg" width="22%"> <img src="resources/imgs/example2.jpg" width="25%"> <img src="vis_output/example2_masked_img_0.jpg" width="25%"> </p>
268
+
269
+ ## Deployment
270
+ ```
271
+ CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1 --load_in_4bit'
272
+ CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1-explanatory --load_in_4bit'
273
+ ```
274
+ By default, we use 4-bit quantization. Feel free to delete the `--load_in_4bit` argument for 16-bit inference or replace it with `--load_in_8bit` argument for 8-bit inference.
275
+
276
+
277
+ ## Dataset
278
+ In ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from <a href="https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing">**this link**</a>.
279
+
280
+ Each image is provided with an annotation JSON file:
281
+ ```
282
+ image_1.jpg, image_1.json
283
+ image_2.jpg, image_2.json
284
+ ...
285
+ image_n.jpg, image_n.json
286
+ ```
287
+ Important keys contained in JSON files:
288
+ ```
289
+ - "text": text instructions.
290
+ - "is_sentence": whether the text instructions are long sentences.
291
+ - "shapes": target polygons.
292
+ ```
293
+
294
+ The elements of the "shapes" exhibit two categories, namely **"target"** and **"ignore"**. The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process.
295
+
296
+ We provide a <a href="https://github.com/dvlab-research/LISA/blob/main/utils/data_processing.py">**script**</a> that demonstrates how to process the annotations:
297
+ ```
298
+ python3 utils/data_processing.py
299
+ ```
300
+
301
+ Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have **more than one instructions (but fewer than six)** in the "text" field. During training, users may randomly select one as the text query to obtain a better model.
302
+
303
+
304
+ ## Citation
305
+ If you find this project useful in your research, please consider citing:
306
+
307
+ ```
308
+ @article{lai2023lisa,
309
+ title={LISA: Reasoning Segmentation via Large Language Model},
310
+ author={Lai, Xin and Tian, Zhuotao and Chen, Yukang and Li, Yanwei and Yuan, Yuhui and Liu, Shu and Jia, Jiaya},
311
+ journal={arXiv preprint arXiv:2308.00692},
312
+ year={2023}
313
+ }
314
+ @article{yang2023improved,
315
+ title={An Improved Baseline for Reasoning Segmentation with Large Language Model},
316
+ author={Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya},
317
+ journal={arXiv preprint arXiv:2312.17240},
318
+ year={2023}
319
+ }
320
+ ```
321
+
322
+ ## Acknowledgement
323
+ - This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) and [SAM](https://github.com/facebookresearch/segment-anything).