Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,102 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
<div align="center" xmlns="http://www.w3.org/1999/html">
|
| 5 |
+
<h1 align="center">
|
| 6 |
+
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
|
| 7 |
+
</h1>
|
| 8 |
+
|
| 9 |
+
[](https://arxiv.org/abs/2507.06272)
|
| 10 |
+
[](https://huggingface.co/echo840/LIRA)
|
| 11 |
+
[](https://github.com/echo840/LIRA/issues?q=is%3Aopen+is%3Aissue)
|
| 12 |
+
[](https://github.com/echo840/LIRA/issues?q=is%3Aissue+is%3Aclosed)
|
| 13 |
+
[](https://github.com/echo840/LIRA)
|
| 14 |
+
</div>
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
> **LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance**<br>
|
| 18 |
+
> Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai <br>
|
| 19 |
+
[](https://arxiv.org/abs/2507.06272)
|
| 20 |
+
[](https://github.com/echo840/LIRA/edit/main/README.md)
|
| 21 |
+
[](https://huggingface.co/echo840/LIRA)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
## Abstract
|
| 25 |
+
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
## Overview
|
| 29 |
+
<a href="https://zimgs.com/i/EjHWis"><img src="https://v1.ax1x.com/2025/09/26/EjHWis.png" alt="EjHWis.png" border="0" /></a>
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
## Results
|
| 33 |
+
<a href="https://zimgs.com/i/EjHv7a"><img src="https://v1.ax1x.com/2025/09/26/EjHv7a.jpg" alt="EjHv7a.jpg" border="0" /></a>
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Weights
|
| 38 |
+
1. Download model
|
| 39 |
+
```python
|
| 40 |
+
python download_model.py -n echo840/LIRA
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
2. Download InternVL
|
| 44 |
+
```python
|
| 45 |
+
python download_model.py -n OpenGVLab/InternVL2-2B # OpenGVLab/InternVL2-8B
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## Demo
|
| 50 |
+
```python
|
| 51 |
+
python ./omg_llava/tools/app_lira.py ./omg_llava/configs/finetune/LIRA-2B.py ./model_weight/LIRA-2B.pth
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Train
|
| 55 |
+
|
| 56 |
+
1. Pretrain
|
| 57 |
+
```python
|
| 58 |
+
bash ./scripts/pretrain.sh
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
2. After train, please use the tools to convert deepspeed chekpoint to pth format
|
| 62 |
+
```python
|
| 63 |
+
python omg_llava/tools/convert_deepspeed2pth.py
|
| 64 |
+
${PATH_TO_CONFIG} \
|
| 65 |
+
${PATH_TO_DeepSpeed_PTH} \
|
| 66 |
+
--save-path ./pretrained/${PTH_NAME.pth}
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
3. Finetune
|
| 70 |
+
```python
|
| 71 |
+
bash ./scripts/finetune.sh
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
## Evaluation
|
| 76 |
+
```python
|
| 77 |
+
bash ./scripts/eval_gcg.sh # Evaluation on Grounded Conversation Generation Tasks.
|
| 78 |
+
|
| 79 |
+
bash ./scripts/eval_refseg.sh # Evaluation on Referring Segmentation Tasks.
|
| 80 |
+
|
| 81 |
+
bash ./scripts/eval_vqa.sh # Evaluation on Comprehension Tasks.
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
## Acknowledgments
|
| 86 |
+
Our code is built upon [OMGLLaVA](https://github.com/lxtGH/OMG-Seg) and [InternVL2](https://github.com/OpenGVLab/InternVL), and we sincerely thank them for providing the code and base models. We also thank [OPERA](https://github.com/shikiw/OPERA) for providing the evaluation code for chair.
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
## Citation
|
| 90 |
+
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
|
| 91 |
+
```BibTeX
|
| 92 |
+
@misc{li2025lirainferringsegmentationlarge,
|
| 93 |
+
title={LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance},
|
| 94 |
+
author={Zhang Li and Biao Yang and Qiang Liu and Shuo Zhang and Zhiyin Ma and Liang Yin and Linger Deng and Yabo Sun and Yuliang Liu and Xiang Bai},
|
| 95 |
+
year={2025},
|
| 96 |
+
eprint={2507.06272},
|
| 97 |
+
archivePrefix={arXiv},
|
| 98 |
+
primaryClass={cs.CV},
|
| 99 |
+
url={https://arxiv.org/abs/2507.06272},
|
| 100 |
+
}
|
| 101 |
+
```
|
| 102 |
+
|