Vily1998 commited on
Commit
f0cfe85
·
1 Parent(s): 8457d7e

update README

Browse files
README.md CHANGED
@@ -1,3 +1,138 @@
1
  ---
2
  license: gpl-3.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: gpl-3.0
3
  ---
4
+ # LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
5
+
6
+ [![arXiv](https://img.shields.io/badge/arXiv-2501.03895-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.03895)
7
+ [![model](https://img.shields.io/badge/%F0%9F%A4%97%20huggingface%20-llava--mini--llama--3.1--8b-orange.svg)](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)
8
+
9
+ > **[Shaolei Zhang](https://zhangshaolei1998.github.io/), [Qingkai Fang](https://fangqingkai.github.io/), [Zhe Yang](https://nlp.ict.ac.cn/yjdw/xs/ssyjs/202210/t20221020_52708.html), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**
10
+
11
+
12
+ LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. [Code](https://github.com/ictnlp/LLaVA-Mini), [model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b) and [demo](https://github.com/ictnlp/LLaVA-Mini#-demo) of LLaVA-Mini are available now!
13
+
14
+ Refer to our [GitHub repo]((https://github.com/ictnlp/LLaVA-Mini)) for details of LLaVA-Mini!
15
+
16
+ > [!Note]
17
+ > LLaVA-Mini only requires **1 token** to represent each image, which improves the efficiency of image and video understanding, including:
18
+ > - **Computational effort**: 77% FLOPs reduction
19
+ > - **Response latency**: reduce from 100 milliseconds to 40 milliseconds
20
+ > - **VRAM memory usage**: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing
21
+
22
+
23
+ <p align="center" width="100%">
24
+ <img src="./assets/performance.png" alt="performance" style="width: 100%; min-width: 300px; display: block; margin: auto;">
25
+ </p>
26
+
27
+ 💡**Highlight**:
28
+ 1. **Good Performance**: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
29
+ 2. **High Efficiency**: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
30
+ 3. **Insights**: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our [paper](https://arxiv.org/pdf/2501.03895) for a detailed analysis and our conclusions.
31
+
32
+ ## 🖥 Demo
33
+ <p align="center" width="100%">
34
+ <img src="./assets/llava_mini.gif" alt="llava_mini" style="width: 100%; min-width: 300px; display: block; margin: auto;">
35
+ </p>
36
+
37
+ - Download LLaVA-Mini model from [here](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b).
38
+
39
+ - Run these scripts and Interact with LLaVA-Mini in your browser:
40
+
41
+ ```bash
42
+ # Launch a controller
43
+ python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &
44
+
45
+ # Build the API of LLaVA-Mini
46
+ CUDA_VISIBLE_DEVICES=0 python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &
47
+
48
+ # Start the interactive interface
49
+ python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --port 7860
50
+ ```
51
+
52
+ ## 🔥 Quick Start
53
+ ### Requirements
54
+ - Install packages:
55
+
56
+ ```bash
57
+ conda create -n llavamini python=3.10 -y
58
+ conda activate llavamini
59
+ pip install -e .
60
+ pip install -e ".[train]"
61
+ pip install flash-attn --no-build-isolation
62
+ ```
63
+
64
+ ### Command Interaction
65
+ - Image understanding, using `--image-file `:
66
+
67
+ ```bash
68
+ # Image Understanding
69
+ CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
70
+ --model-path ICTNLP/llava-mini-llama-3.1-8b \
71
+ --image-file llavamini/serve/examples/baby_cake.png \
72
+ --conv-mode llava_llama_3_1 --model-name "llava-mini" \
73
+ --query "What's the text on the cake?"
74
+ ```
75
+
76
+ - Video understanding, using `--video-file `:
77
+
78
+ ```bash
79
+ # Video Understanding
80
+ CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
81
+ --model-path ICTNLP/llava-mini-llama-3.1-8b \
82
+ --video-file llavamini/serve/examples/fifa.mp4 \
83
+ --conv-mode llava_llama_3_1 --model-name "llava-mini" \
84
+ --query "What happened in this video?"
85
+ ```
86
+
87
+ ### Reproduction and Evaluation
88
+
89
+ - Refer to [Evaluation.md](docs/Evaluation.md) for the evaluation of LLaVA-Mini on image/video benchmarks.
90
+
91
+ ### Cases
92
+ - LLaVA-Mini achieves high-quality image understanding and video understanding.
93
+
94
+ <p align="center" width="100%">
95
+ <img src="./assets/case1.png" alt="case1" style="width: 100%; min-width: 300px; display: block; margin: auto;">
96
+ </p>
97
+
98
+ <details>
99
+ <summary>More cases</summary>
100
+ <p align="center" width="100%">
101
+ <img src="./assets/case2.png" alt="case2" style="width: 100%; min-width: 300px; display: block; margin: auto;">
102
+ </p>
103
+
104
+ <p align="center" width="100%">
105
+ <img src="./assets/case3.png" alt="case3" style="width: 100%; min-width: 300px; display: block; margin: auto;">
106
+ </p>
107
+
108
+ <p align="center" width="100%">
109
+ <img src="./assets/case4.png" alt="case4" style="width: 100%; min-width: 300px; display: block; margin: auto;">
110
+ </p>
111
+
112
+ </details>
113
+
114
+ - LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).
115
+
116
+ <p align="center" width="100%">
117
+ <img src="./assets/compression.png" alt="compression" style="width: 100%; min-width: 300px; display: block; margin: auto;">
118
+ </p>
119
+
120
+
121
+
122
+ ## 🖋Citation
123
+
124
+ If this repository is useful for you, please cite as:
125
+
126
+ ```
127
+ @misc{llavamini,
128
+ title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token},
129
+ author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
130
+ year={2025},
131
+ eprint={2501.03895},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CV},
134
+ url={https://arxiv.org/abs/2501.03895},
135
+ }
136
+ ```
137
+
138
+ If you have any questions, please feel free to submit an issue or contact `zhangshaolei20z@ict.ac.cn`.
assets/case1.png ADDED

Git LFS Details

  • SHA256: b0da055cc3134f68748ef7cfa7c8bf06b16e02286be063fb2de13d05b8b4b6ab
  • Pointer size: 132 Bytes
  • Size of remote file: 5.94 MB
assets/case2.png ADDED

Git LFS Details

  • SHA256: d39b0f85b628bad800845aee1937f19af50202be7f1e01c85c512c3e54ab0693
  • Pointer size: 132 Bytes
  • Size of remote file: 2.98 MB
assets/case3.png ADDED

Git LFS Details

  • SHA256: bb4903a350872b085ddfac202af3aa8bbd8cd1c5d0fe31f1ef08699d07d6df2b
  • Pointer size: 132 Bytes
  • Size of remote file: 2.04 MB
assets/case4.png ADDED

Git LFS Details

  • SHA256: 1ba17b87d19d495d3df2b94376ab3a459ed1ae9de250bf76edcddd1f533715e0
  • Pointer size: 132 Bytes
  • Size of remote file: 2.52 MB
assets/compression.png ADDED

Git LFS Details

  • SHA256: 8b7f4bd208b13136e38a19fd663fe951f5b236d9ecd83bea4980eadc3fbed3b8
  • Pointer size: 132 Bytes
  • Size of remote file: 3 MB
assets/llava_mini.gif ADDED

Git LFS Details

  • SHA256: 84701c8a76d64452ac4cf4a6bc58909be639278a385bd7acac963750365089ee
  • Pointer size: 132 Bytes
  • Size of remote file: 6.39 MB
assets/performance.png ADDED

Git LFS Details

  • SHA256: eac59679d2eea4fe584c2cb09648398ede644eb1e9fc97ebc1a24007c62a2b63
  • Pointer size: 132 Bytes
  • Size of remote file: 1.41 MB