echo840 commited on
Commit
8c224a2
·
verified ·
1 Parent(s): 4e09d24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <div align="center" xmlns="http://www.w3.org/1999/html">
5
+ <h1 align="center">
6
+ LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
7
+ </h1>
8
+
9
+ [![arXiv](https://img.shields.io/badge/Arxiv-LIRA-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2507.06272)
10
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace-black.svg?logo=HuggingFace)](https://huggingface.co/echo840/LIRA)
11
+ [![GitHub issues](https://img.shields.io/github/issues/echo840/LIRA?color=critical&label=Issues)](https://github.com/echo840/LIRA/issues?q=is%3Aopen+is%3Aissue)
12
+ [![GitHub closed issues](https://img.shields.io/github/issues-closed/echo840/LIRA?color=success&label=Issues)](https://github.com/echo840/LIRA/issues?q=is%3Aissue+is%3Aclosed)
13
+ [![GitHub views](https://komarev.com/ghpvc/?username=echo840&repo=LIRA&color=brightgreen&label=Views)](https://github.com/echo840/LIRA)
14
+ </div>
15
+
16
+
17
+ > **LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance**<br>
18
+ > Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai <br>
19
+ [![arXiv](https://img.shields.io/badge/Arxiv-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2507.06272)
20
+ [![Source_code](https://img.shields.io/badge/Code-Available-white)](https://github.com/echo840/LIRA/edit/main/README.md)
21
+ [![Model Weight](https://img.shields.io/badge/HuggingFace-gray)](https://huggingface.co/echo840/LIRA)
22
+
23
+
24
+ ## Abstract
25
+ While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
26
+
27
+
28
+ ## Overview
29
+ <a href="https://zimgs.com/i/EjHWis"><img src="https://v1.ax1x.com/2025/09/26/EjHWis.png" alt="EjHWis.png" border="0" /></a>
30
+
31
+
32
+ ## Results
33
+ <a href="https://zimgs.com/i/EjHv7a"><img src="https://v1.ax1x.com/2025/09/26/EjHv7a.jpg" alt="EjHv7a.jpg" border="0" /></a>
34
+
35
+
36
+
37
+ ## Weights
38
+ 1. Download model
39
+ ```python
40
+ python download_model.py -n echo840/LIRA
41
+ ```
42
+
43
+ 2. Download InternVL
44
+ ```python
45
+ python download_model.py -n OpenGVLab/InternVL2-2B # OpenGVLab/InternVL2-8B
46
+ ```
47
+
48
+
49
+ ## Demo
50
+ ```python
51
+ python ./omg_llava/tools/app_lira.py ./omg_llava/configs/finetune/LIRA-2B.py ./model_weight/LIRA-2B.pth
52
+ ```
53
+
54
+ ## Train
55
+
56
+ 1. Pretrain
57
+ ```python
58
+ bash ./scripts/pretrain.sh
59
+ ```
60
+
61
+ 2. After train, please use the tools to convert deepspeed chekpoint to pth format
62
+ ```python
63
+ python omg_llava/tools/convert_deepspeed2pth.py
64
+ ${PATH_TO_CONFIG} \
65
+ ${PATH_TO_DeepSpeed_PTH} \
66
+ --save-path ./pretrained/${PTH_NAME.pth}
67
+ ```
68
+
69
+ 3. Finetune
70
+ ```python
71
+ bash ./scripts/finetune.sh
72
+ ```
73
+
74
+
75
+ ## Evaluation
76
+ ```python
77
+ bash ./scripts/eval_gcg.sh # Evaluation on Grounded Conversation Generation Tasks.
78
+
79
+ bash ./scripts/eval_refseg.sh # Evaluation on Referring Segmentation Tasks.
80
+
81
+ bash ./scripts/eval_vqa.sh # Evaluation on Comprehension Tasks.
82
+ ```
83
+
84
+
85
+ ## Acknowledgments
86
+ Our code is built upon [OMGLLaVA](https://github.com/lxtGH/OMG-Seg) and [InternVL2](https://github.com/OpenGVLab/InternVL), and we sincerely thank them for providing the code and base models. We also thank [OPERA](https://github.com/shikiw/OPERA) for providing the evaluation code for chair.
87
+
88
+
89
+ ## Citation
90
+ If you wish to refer to the baseline results published here, please use the following BibTeX entries:
91
+ ```BibTeX
92
+ @misc{li2025lirainferringsegmentationlarge,
93
+ title={LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance},
94
+ author={Zhang Li and Biao Yang and Qiang Liu and Shuo Zhang and Zhiyin Ma and Liang Yin and Linger Deng and Yabo Sun and Yuliang Liu and Xiang Bai},
95
+ year={2025},
96
+ eprint={2507.06272},
97
+ archivePrefix={arXiv},
98
+ primaryClass={cs.CV},
99
+ url={https://arxiv.org/abs/2507.06272},
100
+ }
101
+ ```
102
+