xiechunyu commited on
Commit
1878da4
2 Parent(s): e982bb6 fc7b70e
Files changed (1) hide show
  1. README.md +124 -124
README.md CHANGED
@@ -1,124 +1,124 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - liuhaotian/LLaVA-CC3M-Pretrain-595K
5
- - liuhaotian/LLaVA-Instruct-150K
6
- - FreedomIntelligence/ALLaVA-4V-Chinese
7
- - shareAI/ShareGPT-Chinese-English-90k
8
- language:
9
- - zh
10
- - en
11
- pipeline_tag: visual-question-answering
12
- ---
13
- <br>
14
- <br>
15
-
16
- # Model Card for 360VL
17
- <p align="center">
18
- <img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
19
- </p>
20
-
21
- **360VL** is developed based on the LLama3 language model and is also the industry's first open source multi-modal large model based on **LLama3-70B**[[🤗Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.
22
-
23
- ## Model Zoo
24
-
25
- 360VL has released the following versions.
26
-
27
- Model | Download
28
- |---|---
29
- 360VL-8B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B)
30
- 360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)
31
- ## Features
32
-
33
- 360VL offers the following features:
34
-
35
- - Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
36
-
37
- - Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
38
-
39
- - Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
40
-
41
- - Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672&times;672.
42
-
43
- ## Performance
44
- | Model | Checkpoints | MMB<sub>T | MMB<sub>D|MMB-CN<sub>T | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
45
- |:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
46
- | QWen-VL-Chat | [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 | 56.3 | 56.7 |37| 32.9 | 1860 |
47
- | mPLUG-Owl2 | [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 | 60.3 | 59.5 |34.7| 32.1 | 1786.4 |
48
- | CogVLM | [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9 | 53.8 |37.3| 30.1 | 1736.6|
49
- | Monkey-Chat | [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5 | 65.8 |40.7| - | 1887.4|
50
- | MM1-7B-Chat | [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | - | - |37.0| 35.6 | 1858.2|
51
- | IDEFICS2-8B | [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6 | 67.3 |43.0| 37.7 |1847.6|
52
- | Honeybee | [LINK](https://github.com/kakaobrain/honeybee) | 74.3 | 74.3 | - | - |36.2| -|1950|
53
- | SVIT-v1.5-13B| [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1 | - | 38.0| 33.3|1889|
54
- | LLaVA-v1.5-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65 | 63.6 |36.4| 33.6 | 1826.7|
55
- | LLaVA-v1.6-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5 | 64.3 |36.2| - |1901|
56
- | YI-VL-34B | [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 | 70.7 | 71.4 |45.1| 41.6 |2050.2|
57
- | **360VL-8B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1 | 68.6 |39.7| 37.1 | 1899.1|
58
- | **360VL-70B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9 | 77.7 |50.8| 44.3 | 1983.2|
59
- ## Quick Start 🤗
60
-
61
- ```Shell
62
- from transformers import AutoModelForCausalLM, AutoTokenizer
63
- import torch
64
- from PIL import Image
65
-
66
- checkpoint = "qh360_vl-70B"
67
-
68
- model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
69
- tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
70
- vision_tower = model.get_vision_tower()
71
- vision_tower.load_model()
72
- vision_tower.to(device="cuda", dtype=torch.float16)
73
- image_processor = vision_tower.image_processor
74
- tokenizer.pad_token = tokenizer.eos_token
75
-
76
-
77
- image = Image.open("docs/008.jpg").convert('RGB')
78
- query = "Who is this cartoon character?"
79
- terminators = [
80
- tokenizer.convert_tokens_to_ids("<|eot_id|>",)
81
- ]
82
-
83
- inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
84
-
85
- input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
86
- images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
87
-
88
- output_ids = model.generate(
89
- input_ids,
90
- images=images,
91
- do_sample=False,
92
- eos_token_id=terminators,
93
- num_beams=1,
94
- max_new_tokens=512,
95
- use_cache=True)
96
-
97
- input_token_len = input_ids.shape[1]
98
- outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
99
- outputs = outputs.strip()
100
- print(outputs)
101
- ```
102
-
103
- **Model type:**
104
- 360VL-70B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
105
- It is an auto-regressive language model, based on the transformer architecture.
106
- Base LLM: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
107
-
108
- **Model date:**
109
- 360VL-70B was trained in May 2024.
110
-
111
-
112
-
113
- ## License
114
- This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
115
- The content of this project itself is licensed under the Apache license 2.0
116
-
117
- **Where to send questions or comments about the model:**
118
- https://github.com/360CVGroup/360VL
119
-
120
- ## Related Projects
121
- This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
122
- - [Meta Llama 3](https://github.com/meta-llama/llama3)
123
- - [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
124
- - [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - liuhaotian/LLaVA-CC3M-Pretrain-595K
5
+ - liuhaotian/LLaVA-Instruct-150K
6
+ - FreedomIntelligence/ALLaVA-4V-Chinese
7
+ - shareAI/ShareGPT-Chinese-English-90k
8
+ language:
9
+ - zh
10
+ - en
11
+ pipeline_tag: visual-question-answering
12
+ ---
13
+ <br>
14
+ <br>
15
+
16
+ # Model Card for 360VL
17
+ <p align="center">
18
+ <img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
19
+ </p>
20
+
21
+ **360VL** is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on **LLama3-70B**[[🤗Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.
22
+
23
+ ## Model Zoo
24
+
25
+ 360VL has released the following versions.
26
+
27
+ Model | Download
28
+ |---|---
29
+ 360VL-8B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B)
30
+ 360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)
31
+ ## Features
32
+
33
+ 360VL offers the following features:
34
+
35
+ - Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
36
+
37
+ - Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
38
+
39
+ - Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
40
+
41
+ - Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672&times;672.
42
+
43
+ ## Performance
44
+ | Model | Checkpoints | MMB<sub>T | MMB<sub>D|MMB-CN<sub>T | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
45
+ |:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
46
+ | QWen-VL-Chat | [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 | 56.3 | 56.7 |37| 32.9 | 1860 |
47
+ | mPLUG-Owl2 | [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 | 60.3 | 59.5 |34.7| 32.1 | 1786.4 |
48
+ | CogVLM | [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9 | 53.8 |37.3| 30.1 | 1736.6|
49
+ | Monkey-Chat | [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5 | 65.8 |40.7| - | 1887.4|
50
+ | MM1-7B-Chat | [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | - | - |37.0| 35.6 | 1858.2|
51
+ | IDEFICS2-8B | [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6 | 67.3 |43.0| 37.7 |1847.6|
52
+ | Honeybee | [LINK](https://github.com/kakaobrain/honeybee) | 74.3 | 74.3 | - | - |36.2| -|1950|
53
+ | SVIT-v1.5-13B| [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1 | - | 38.0| 33.3|1889|
54
+ | LLaVA-v1.5-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65 | 63.6 |36.4| 33.6 | 1826.7|
55
+ | LLaVA-v1.6-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5 | 64.3 |36.2| - |1901|
56
+ | YI-VL-34B | [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 | 70.7 | 71.4 |45.1| 41.6 |2050.2|
57
+ | **360VL-8B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1 | 68.6 |39.7| 37.1 | 1899.1|
58
+ | **360VL-70B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9 | 77.7 |50.8| 44.3 | 1983.2|
59
+ ## Quick Start 🤗
60
+
61
+ ```Shell
62
+ from transformers import AutoModelForCausalLM, AutoTokenizer
63
+ import torch
64
+ from PIL import Image
65
+
66
+ checkpoint = "qh360_vl-70B"
67
+
68
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
69
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
70
+ vision_tower = model.get_vision_tower()
71
+ vision_tower.load_model()
72
+ vision_tower.to(device="cuda", dtype=torch.float16)
73
+ image_processor = vision_tower.image_processor
74
+ tokenizer.pad_token = tokenizer.eos_token
75
+
76
+
77
+ image = Image.open("docs/008.jpg").convert('RGB')
78
+ query = "Who is this cartoon character?"
79
+ terminators = [
80
+ tokenizer.convert_tokens_to_ids("<|eot_id|>",)
81
+ ]
82
+
83
+ inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
84
+
85
+ input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
86
+ images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
87
+
88
+ output_ids = model.generate(
89
+ input_ids,
90
+ images=images,
91
+ do_sample=False,
92
+ eos_token_id=terminators,
93
+ num_beams=1,
94
+ max_new_tokens=512,
95
+ use_cache=True)
96
+
97
+ input_token_len = input_ids.shape[1]
98
+ outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
99
+ outputs = outputs.strip()
100
+ print(outputs)
101
+ ```
102
+
103
+ **Model type:**
104
+ 360VL-70B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
105
+ It is an auto-regressive language model, based on the transformer architecture.
106
+ Base LLM: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
107
+
108
+ **Model date:**
109
+ 360VL-70B was trained in May 2024.
110
+
111
+
112
+
113
+ ## License
114
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
115
+ The content of this project itself is licensed under the Apache license 2.0
116
+
117
+ **Where to send questions or comments about the model:**
118
+ https://github.com/360CVGroup/360VL
119
+
120
+ ## Related Projects
121
+ This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
122
+ - [Meta Llama 3](https://github.com/meta-llama/llama3)
123
+ - [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
124
+ - [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)