Spaces:
Paused
Paused
# LISA: Reasoning Segmentation via Large Language Model | |
<font size=10><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font> | |
<font size=10><div align='center' > <a href=https://arxiv.org/pdf/2308.00692.pdf>**Paper**</a> | <a href="https://huggingface.co/xinlai">**Models**</a> | [**Inference**](#inference) | [**Dataset**](#dataset) | <a href="http://103.170.5.190:7860/">**Online Demo**</a></div></font> | |
<p align="center"> <img src="imgs/fig_overview.jpg" width="100%"> </p> | |
<p align="center"> <img src="imgs/teaser.jpg" width="100%"> </p> | |
## News | |
- [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released! | |
- [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explainatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explainatory) model are released! | |
- [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check out! | |
- [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created. | |
## TODO | |
- [ ] Training Code Release | |
**LISA: Reasoning Segmentation Via Large Language Model [[Paper](https://arxiv.org/abs/2308.00692)]** <br /> | |
[Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl=zh-CN), | |
[Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ&hl=en), | |
[Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ&hl=en), | |
[Yanwei Li](https://scholar.google.com/citations?user=I-UCPPcAAAAJ&hl=zh-CN), | |
[Yuhui Yuan](https://scholar.google.com/citations?user=PzyvzksAAAAJ&hl=en), | |
[Shu Liu](https://scholar.google.com.hk/citations?user=BUEDUFkAAAAJ&hl=zh-CN), | |
[Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)<br /> | |
## Abstract | |
In this work, we propose a new segmentation task --- ***reasoning segmentation***. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. | |
For more details, please refer to the [paper](https://arxiv.org/abs/2308.00692). | |
## Highlights | |
**LISA** unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving: | |
1. complex reasoning; | |
2. world knowledge; | |
3. explanatory answers; | |
4. multi-turn conversation. | |
**LISA** also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. | |
## Experimental results | |
<p align="center"> <img src="imgs/table1.jpg" width="80%"> </p> | |
## Installation | |
``` | |
pip install -r requirements.txt | |
``` | |
## Training | |
### Training Data Preparation | |
The training data consists of 4 types of data: | |
1. Semantic segmentation datasets: [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip), [COCO-Stuff](https://github.com/nightrome/cocostuff#downloads), [Mapillary](https://www.mapillary.com/dataset/vistas), [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup), [PASCAL-Part](http://roozbehm.info/pascal-parts/pascal-parts.html) | |
2. Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg [\[Download\]](https://github.com/lichengunc/refer#download) | |
3. Visual Question Answering dataset: [LLaVA-Instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json) | |
4. Reasoning segmentation dataset: [ReasonSeg](https://github.com/dvlab-research/LISA#dataset) | |
Download them from the above links, and organize them as follows. | |
``` | |
├── dataset | |
│ ├── ade20k | |
│ │ ├── annotations | |
│ │ └── images | |
│ ├── coco | |
│ │ └── train2017 | |
│ ├── cocostuff | |
│ │ ├── annotations | |
│ │ └── train2017 | |
│ ├── llava_dataset | |
│ │ └── llava_instruct_150k.json | |
│ ├── mapillary | |
│ │ ├── config_v2.0.json | |
│ │ ├── testing | |
│ │ ├── training | |
│ │ └── validation | |
│ ├── reason_seg | |
│ │ └── ReasonSeg | |
│ │ ├── train | |
│ │ ├── val | |
│ │ └── explanatory | |
│ ├── refer_seg | |
│ │ ├── images | |
│ │ | ├── saiapr_tc-12 | |
│ │ | └── mscoco | |
│ │ | └── images | |
│ │ | └── train2014 | |
│ │ ├── refclef | |
│ │ ├── refcoco | |
│ │ ├── refcoco+ | |
│ │ └── refcocog | |
│ └── vlpart | |
│ ├── paco | |
│ │ └── annotations | |
│ └── pascal_part | |
│ ├── train.json | |
│ └── VOCdevkit | |
``` | |
### Pre-trained weights | |
#### LLaVA | |
To train LISA-7B or 13B, you need to follow the [instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to merge the LLaVA delta weights. Typically, we use the final weights `LLaVA-Lightning-7B-v1-1` and `LLaVA-13B-v1-1` merged from `liuhaotian/LLaVA-Lightning-7B-delta-v1-1` and `liuhaotian/LLaVA-13b-delta-v1-1`, respectively. For Llama2, we can directly use the LLaVA full weights `liuhaotian/llava-llama-2-13b-chat-lightning-preview`. | |
#### SAM ViT-H weights | |
Download SAM ViT-H pre-trained weights from the [link](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth). | |
### Training | |
``` | |
deepspeed --master_port=24999 train_ds.py --version="PATH_TO_LLaVA_Wegihts" --dataset_dir='./dataset' --vision_pretrained="PATH_TO_SAM_Weights" --exp_name="lisa-7b" | |
``` | |
When training is finished, to get the full model weight: | |
``` | |
cd ./runs/lisa-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin | |
``` | |
### Validation | |
``` | |
deepspeed --master_port=24999 train_ds.py --version="PATH_TO_LLaVA_Wegihts" --dataset_dir='./dataset' --vision_pretrained="PATH_TO_SAM_Weights" --exp_name="lisa-7b" --weight='PATH_TO_pytorch_model.bin' --eval_only | |
``` | |
## Inference | |
To chat with [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) or [LISA-13B-llama2-v0-explainatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explainatory): (Note that LISA-13B-llama2-v0 currently does not support explanatory answers.) | |
``` | |
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' | |
``` | |
To use `bf16` or `fp16` data type for inference: | |
``` | |
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' --precision='bf16' | |
``` | |
To use `8bit` or `4bit` data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality): | |
``` | |
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' --precision='fp16' --load_in_8bit | |
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' --precision='fp16' --load_in_4bit | |
``` | |
After that, input the text prompt and then the image path. For example, | |
``` | |
- Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask. | |
- Please input the image path: imgs/example1.jpg | |
- Please input your prompt: Can you segment the food that tastes spicy and hot? | |
- Please input the image path: imgs/example2.jpg | |
``` | |
The results should be like: | |
<p align="center"> <img src="imgs/example1.jpg" width="22%"> <img src="vis_output/example1_masked_img_0.jpg" width="22%"> <img src="imgs/example2.jpg" width="25%"> <img src="vis_output/example2_masked_img_0.jpg" width="25%"> </p> | |
## Dataset | |
In ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from <a href="https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing">**this link**</a>. | |
Each image is provided with an annotation JSON file: | |
``` | |
image_1.jpg, image_1.json | |
image_2.jpg, image_2.json | |
... | |
image_n.jpg, image_n.json | |
``` | |
Important keys contained in JSON files: | |
``` | |
- "text": text instructions. | |
- "is_sentence": whether the text instructions are long sentences. | |
- "shapes": target polygons. | |
``` | |
The elements of the "shapes" exhibit two categories, namely **"target"** and **"ignore"**. The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process. | |
We provide a <a href="https://github.com/dvlab-research/LISA/blob/main/utils/data_processing.py">**script**</a> that demonstrates how to process the annotations: | |
``` | |
python3 utils/data_processing.py | |
``` | |
Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have **more than one instructions (but fewer than six)** in the "text" field. During training, users may randomly select one as the text query to obtain a better model. | |
## Citation | |
If you find this project useful in your research, please consider citing: | |
``` | |
@article{reason-seg, | |
title={LISA: Reasoning Segmentation via Large Language Model}, | |
author={Xin Lai and Zhuotao Tian and Yukang Chen and Yanwei Li and Yuhui Yuan and Shu Liu and Jiaya Jia}, | |
journal={arXiv:2308.00692}, | |
year={2023} | |
} | |
``` | |
## Acknowledgement | |
- This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) and [SAM](https://github.com/facebookresearch/segment-anything). | |