zhangshaolei nielsr HF staff commited on
Commit
690d4c0
·
verified ·
1 Parent(s): f0cfe85

Add pipeline tag (#1)

Browse files

- Add pipeline tag (e40d08a44399d83c7db779f22f76c7ab2958bc59)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +138 -137
README.md CHANGED
@@ -1,138 +1,139 @@
1
- ---
2
- license: gpl-3.0
3
- ---
4
- # LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
5
-
6
- [![arXiv](https://img.shields.io/badge/arXiv-2501.03895-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.03895)
7
- [![model](https://img.shields.io/badge/%F0%9F%A4%97%20huggingface%20-llava--mini--llama--3.1--8b-orange.svg)](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)
8
-
9
- > **[Shaolei Zhang](https://zhangshaolei1998.github.io/), [Qingkai Fang](https://fangqingkai.github.io/), [Zhe Yang](https://nlp.ict.ac.cn/yjdw/xs/ssyjs/202210/t20221020_52708.html), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**
10
-
11
-
12
- LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. [Code](https://github.com/ictnlp/LLaVA-Mini), [model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b) and [demo](https://github.com/ictnlp/LLaVA-Mini#-demo) of LLaVA-Mini are available now!
13
-
14
- Refer to our [GitHub repo]((https://github.com/ictnlp/LLaVA-Mini)) for details of LLaVA-Mini!
15
-
16
- > [!Note]
17
- > LLaVA-Mini only requires **1 token** to represent each image, which improves the efficiency of image and video understanding, including:
18
- > - **Computational effort**: 77% FLOPs reduction
19
- > - **Response latency**: reduce from 100 milliseconds to 40 milliseconds
20
- > - **VRAM memory usage**: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing
21
-
22
-
23
- <p align="center" width="100%">
24
- <img src="./assets/performance.png" alt="performance" style="width: 100%; min-width: 300px; display: block; margin: auto;">
25
- </p>
26
-
27
- 💡**Highlight**:
28
- 1. **Good Performance**: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
29
- 2. **High Efficiency**: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
30
- 3. **Insights**: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our [paper](https://arxiv.org/pdf/2501.03895) for a detailed analysis and our conclusions.
31
-
32
- ## 🖥 Demo
33
- <p align="center" width="100%">
34
- <img src="./assets/llava_mini.gif" alt="llava_mini" style="width: 100%; min-width: 300px; display: block; margin: auto;">
35
- </p>
36
-
37
- - Download LLaVA-Mini model from [here](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b).
38
-
39
- - Run these scripts and Interact with LLaVA-Mini in your browser:
40
-
41
- ```bash
42
- # Launch a controller
43
- python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &
44
-
45
- # Build the API of LLaVA-Mini
46
- CUDA_VISIBLE_DEVICES=0 python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &
47
-
48
- # Start the interactive interface
49
- python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --port 7860
50
- ```
51
-
52
- ## 🔥 Quick Start
53
- ### Requirements
54
- - Install packages:
55
-
56
- ```bash
57
- conda create -n llavamini python=3.10 -y
58
- conda activate llavamini
59
- pip install -e .
60
- pip install -e ".[train]"
61
- pip install flash-attn --no-build-isolation
62
- ```
63
-
64
- ### Command Interaction
65
- - Image understanding, using `--image-file `:
66
-
67
- ```bash
68
- # Image Understanding
69
- CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
70
- --model-path ICTNLP/llava-mini-llama-3.1-8b \
71
- --image-file llavamini/serve/examples/baby_cake.png \
72
- --conv-mode llava_llama_3_1 --model-name "llava-mini" \
73
- --query "What's the text on the cake?"
74
- ```
75
-
76
- - Video understanding, using `--video-file `:
77
-
78
- ```bash
79
- # Video Understanding
80
- CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
81
- --model-path ICTNLP/llava-mini-llama-3.1-8b \
82
- --video-file llavamini/serve/examples/fifa.mp4 \
83
- --conv-mode llava_llama_3_1 --model-name "llava-mini" \
84
- --query "What happened in this video?"
85
- ```
86
-
87
- ### Reproduction and Evaluation
88
-
89
- - Refer to [Evaluation.md](docs/Evaluation.md) for the evaluation of LLaVA-Mini on image/video benchmarks.
90
-
91
- ### Cases
92
- - LLaVA-Mini achieves high-quality image understanding and video understanding.
93
-
94
- <p align="center" width="100%">
95
- <img src="./assets/case1.png" alt="case1" style="width: 100%; min-width: 300px; display: block; margin: auto;">
96
- </p>
97
-
98
- <details>
99
- <summary>More cases</summary>
100
- <p align="center" width="100%">
101
- <img src="./assets/case2.png" alt="case2" style="width: 100%; min-width: 300px; display: block; margin: auto;">
102
- </p>
103
-
104
- <p align="center" width="100%">
105
- <img src="./assets/case3.png" alt="case3" style="width: 100%; min-width: 300px; display: block; margin: auto;">
106
- </p>
107
-
108
- <p align="center" width="100%">
109
- <img src="./assets/case4.png" alt="case4" style="width: 100%; min-width: 300px; display: block; margin: auto;">
110
- </p>
111
-
112
- </details>
113
-
114
- - LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).
115
-
116
- <p align="center" width="100%">
117
- <img src="./assets/compression.png" alt="compression" style="width: 100%; min-width: 300px; display: block; margin: auto;">
118
- </p>
119
-
120
-
121
-
122
- ## 🖋Citation
123
-
124
- If this repository is useful for you, please cite as:
125
-
126
- ```
127
- @misc{llavamini,
128
- title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token},
129
- author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
130
- year={2025},
131
- eprint={2501.03895},
132
- archivePrefix={arXiv},
133
- primaryClass={cs.CV},
134
- url={https://arxiv.org/abs/2501.03895},
135
- }
136
- ```
137
-
 
138
  If you have any questions, please feel free to submit an issue or contact `zhangshaolei20z@ict.ac.cn`.
 
1
+ ---
2
+ license: gpl-3.0
3
+ pipeline_tag: image-text-to-text
4
+ ---
5
+ # LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
6
+
7
+ [![arXiv](https://img.shields.io/badge/arXiv-2501.03895-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.03895)
8
+ [![model](https://img.shields.io/badge/%F0%9F%A4%97%20huggingface%20-llava--mini--llama--3.1--8b-orange.svg)](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)
9
+
10
+ > **[Shaolei Zhang](https://zhangshaolei1998.github.io/), [Qingkai Fang](https://fangqingkai.github.io/), [Zhe Yang](https://nlp.ict.ac.cn/yjdw/xs/ssyjs/202210/t20221020_52708.html), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**
11
+
12
+
13
+ LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. [Code](https://github.com/ictnlp/LLaVA-Mini), [model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b) and [demo](https://github.com/ictnlp/LLaVA-Mini#-demo) of LLaVA-Mini are available now!
14
+
15
+ Refer to our [GitHub repo]((https://github.com/ictnlp/LLaVA-Mini)) for details of LLaVA-Mini!
16
+
17
+ > [!Note]
18
+ > LLaVA-Mini only requires **1 token** to represent each image, which improves the efficiency of image and video understanding, including:
19
+ > - **Computational effort**: 77% FLOPs reduction
20
+ > - **Response latency**: reduce from 100 milliseconds to 40 milliseconds
21
+ > - **VRAM memory usage**: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing
22
+
23
+
24
+ <p align="center" width="100%">
25
+ <img src="./assets/performance.png" alt="performance" style="width: 100%; min-width: 300px; display: block; margin: auto;">
26
+ </p>
27
+
28
+ 💡**Highlight**:
29
+ 1. **Good Performance**: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
30
+ 2. **High Efficiency**: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
31
+ 3. **Insights**: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our [paper](https://arxiv.org/pdf/2501.03895) for a detailed analysis and our conclusions.
32
+
33
+ ## 🖥 Demo
34
+ <p align="center" width="100%">
35
+ <img src="./assets/llava_mini.gif" alt="llava_mini" style="width: 100%; min-width: 300px; display: block; margin: auto;">
36
+ </p>
37
+
38
+ - Download LLaVA-Mini model from [here](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b).
39
+
40
+ - Run these scripts and Interact with LLaVA-Mini in your browser:
41
+
42
+ ```bash
43
+ # Launch a controller
44
+ python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &
45
+
46
+ # Build the API of LLaVA-Mini
47
+ CUDA_VISIBLE_DEVICES=0 python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &
48
+
49
+ # Start the interactive interface
50
+ python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --port 7860
51
+ ```
52
+
53
+ ## 🔥 Quick Start
54
+ ### Requirements
55
+ - Install packages:
56
+
57
+ ```bash
58
+ conda create -n llavamini python=3.10 -y
59
+ conda activate llavamini
60
+ pip install -e .
61
+ pip install -e ".[train]"
62
+ pip install flash-attn --no-build-isolation
63
+ ```
64
+
65
+ ### Command Interaction
66
+ - Image understanding, using `--image-file `:
67
+
68
+ ```bash
69
+ # Image Understanding
70
+ CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
71
+ --model-path ICTNLP/llava-mini-llama-3.1-8b \
72
+ --image-file llavamini/serve/examples/baby_cake.png \
73
+ --conv-mode llava_llama_3_1 --model-name "llava-mini" \
74
+ --query "What's the text on the cake?"
75
+ ```
76
+
77
+ - Video understanding, using `--video-file `:
78
+
79
+ ```bash
80
+ # Video Understanding
81
+ CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
82
+ --model-path ICTNLP/llava-mini-llama-3.1-8b \
83
+ --video-file llavamini/serve/examples/fifa.mp4 \
84
+ --conv-mode llava_llama_3_1 --model-name "llava-mini" \
85
+ --query "What happened in this video?"
86
+ ```
87
+
88
+ ### Reproduction and Evaluation
89
+
90
+ - Refer to [Evaluation.md](docs/Evaluation.md) for the evaluation of LLaVA-Mini on image/video benchmarks.
91
+
92
+ ### Cases
93
+ - LLaVA-Mini achieves high-quality image understanding and video understanding.
94
+
95
+ <p align="center" width="100%">
96
+ <img src="./assets/case1.png" alt="case1" style="width: 100%; min-width: 300px; display: block; margin: auto;">
97
+ </p>
98
+
99
+ <details>
100
+ <summary>More cases</summary>
101
+ <p align="center" width="100%">
102
+ <img src="./assets/case2.png" alt="case2" style="width: 100%; min-width: 300px; display: block; margin: auto;">
103
+ </p>
104
+
105
+ <p align="center" width="100%">
106
+ <img src="./assets/case3.png" alt="case3" style="width: 100%; min-width: 300px; display: block; margin: auto;">
107
+ </p>
108
+
109
+ <p align="center" width="100%">
110
+ <img src="./assets/case4.png" alt="case4" style="width: 100%; min-width: 300px; display: block; margin: auto;">
111
+ </p>
112
+
113
+ </details>
114
+
115
+ - LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).
116
+
117
+ <p align="center" width="100%">
118
+ <img src="./assets/compression.png" alt="compression" style="width: 100%; min-width: 300px; display: block; margin: auto;">
119
+ </p>
120
+
121
+
122
+
123
+ ## 🖋Citation
124
+
125
+ If this repository is useful for you, please cite as:
126
+
127
+ ```
128
+ @misc{llavamini,
129
+ title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token},
130
+ author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
131
+ year={2025},
132
+ eprint={2501.03895},
133
+ archivePrefix={arXiv},
134
+ primaryClass={cs.CV},
135
+ url={https://arxiv.org/abs/2501.03895},
136
+ }
137
+ ```
138
+
139
  If you have any questions, please feel free to submit an issue or contact `zhangshaolei20z@ict.ac.cn`.