Text Generation
Transformers
Safetensors
imp_phi3
conversational
custom_code
Oyoy1235 commited on
Commit
686ce70
β€’
1 Parent(s): e37c108

update Imp-v1.5-4B-phi3

Browse files
Files changed (3) hide show
  1. README copy.md +0 -96
  2. README.md +92 -0
  3. config.json +1 -1
README copy.md DELETED
@@ -1,96 +0,0 @@
1
- ---
2
- license: apache-2.0
3
- pipeline_tag: text-generation
4
- datasets:
5
- - liuhaotian/LLaVA-Pretrain
6
- - liuhaotian/LLaVA-Instruct-150K
7
- ---
8
- # 😈 Imp
9
-
10
- > A very small man can cast a very large shadow.
11
- >
12
- >           β€”β€”*George R.R. Martin, A Clash of Kings*
13
-
14
-
15
- \[Technical report (coming soon)\]  [[Demo](https://xmbot.net/imp/)\]  [[Github](https://github.com/MILVLG/imp)\]
16
-
17
- ## Introduction
18
-
19
- The Imp project aims to provide a family of a strong multimodal `small` language models (MSLMs). Our `imp-v1.5-4b` is a strong MSLM with only **4B** parameters, which is build upon a small yet powerful SLM [Phi-3 ](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)(3.8B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
20
-
21
- As shown in the Table below, `imp-v1.5-4b` significantly outperforms the counterparts of similar model sizes on various multimodal benchmarks.
22
-
23
-
24
- We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
25
-
26
-
27
- ## How to use
28
-
29
-
30
- **Install dependencies**
31
- ```bash
32
- pip install transformers # latest version is ok, but we recommend v4.36.0
33
- pip install -q pillow accelerate einops
34
- ```
35
-
36
- You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). A Colab page to run this example is provided [here](https://colab.research.google.com/drive/1EBYky6xIPjnlPppo2gZaiNK6gEsjXgom?usp=drive_link#scrollTo=2-VpU6QzWCVZ). Note that the example can only be run on GPUs currently.
37
-
38
- ```Python
39
- import torch
40
- from transformers import AutoModelForCausalLM, AutoTokenizer
41
- from PIL import Image
42
-
43
- torch.set_default_device("cuda")
44
-
45
- #Create model
46
- model = AutoModelForCausalLM.from_pretrained(
47
- "MILVLG/imp-v1.5-4b",
48
- torch_dtype=torch.float16,
49
- device_map="auto",
50
- trust_remote_code=True)
51
- tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1.5-4b", trust_remote_code=True)
52
-
53
- #Set inputs
54
- text = "<|user|>\n<image>\nWhat are the colors of the bus in the image?\n<|end|>\n<|assistant|>\n"
55
- image = Image.open("images/bus.jpg")
56
-
57
- input_ids = tokenizer(text, return_tensors='pt').input_ids
58
- image_tensor = model.image_preprocess(image)
59
-
60
- #Generate the answer
61
- output_ids = model.generate(
62
- input_ids,
63
- max_new_tokens=100,
64
- images=image_tensor,
65
- use_cache=True)[0]
66
- print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
67
- ```
68
-
69
- ## Model evaluation
70
- We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
71
-
72
- | Models | Size | VQAv2 | GQA | SQA(IMG) | TextVQA | POPE | MME(P) | MMB |MMB_CN|MM-Vet|
73
- |:--------:|:-----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|:-------:|
74
- | imp-v1.5-4b| 4B | 81.46 | 63.51 | 77.99|60.16 | 86.86| 1507.7 |73.28 |61.08|44.6|
75
- <!-- | [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | 63.00| 68.40 |58.20| 86.40 | 1476.9 | 66.10 |- |30.2| -->
76
-
77
-
78
-
79
- ## License
80
- This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.
81
-
82
- ## About us
83
- This project is maintained by the [MILVLG](https://github.com/MILVLG)@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.
84
-
85
- ## Citation
86
-
87
- If you use our model or refer our work in your studies, please cite:
88
-
89
- ```bibtex
90
- @misc{imp2024,
91
- author = {Shao, Zhenwei and Ouyang, Xuecheng and Yu, Zhou and Yu, Jun},
92
- title = {Imp: An Emprical Study of Multimodal Small Language Models},
93
- year = {2024},
94
- url = {https://huggingface.co/MILVLG/imp-v1-3b}
95
- }
96
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,3 +1,95 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ datasets:
5
+ - liuhaotian/LLaVA-Pretrain
6
+ - liuhaotian/LLaVA-Instruct-150K
7
  ---
8
+ # 😈 Imp
9
+
10
+ > A very small man can cast a very large shadow.
11
+ >
12
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;β€”β€”*George R.R. Martin, A Clash of Kings*
13
+
14
+
15
+ \[Technical report (coming soon)\]&nbsp;&nbsp;[[Demo](https://xmbot.net/imp/)\]&nbsp;&nbsp;[[Github](https://github.com/MILVLG/imp)\]
16
+
17
+ ## Introduction
18
+
19
+ The Imp project aims to provide a family of a strong multimodal `small` language models (MSLMs). Our ``Imp-v1.5-4B-Phi3`` is a strong MSLM with only **4B** parameters, which is build upon a small yet powerful SLM [Phi-3 ](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)(3.8B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
20
+
21
+
22
+
23
+ We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
24
+
25
+
26
+ ## How to use
27
+
28
+
29
+ **Install dependencies**
30
+ ```bash
31
+ pip install transformers # latest version is ok, but we recommend v4.36.0
32
+ pip install -q pillow accelerate einops
33
+ ```
34
+
35
+ You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). A Colab page to run this example is provided [here](https://colab.research.google.com/drive/1EBYky6xIPjnlPppo2gZaiNK6gEsjXgom?usp=drive_link#scrollTo=2-VpU6QzWCVZ). Note that the example can only be run on GPUs currently.
36
+
37
+ ```Python
38
+ import torch
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+ from PIL import Image
41
+
42
+ torch.set_default_device("cuda")
43
+
44
+ #Create model
45
+ model = AutoModelForCausalLM.from_pretrained(
46
+ "MILVLG/imp-v1.5-4b",
47
+ torch_dtype=torch.float16,
48
+ device_map="auto",
49
+ trust_remote_code=True)
50
+ tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1.5-4b", trust_remote_code=True)
51
+
52
+ #Set inputs
53
+ text = "<|user|>\n<image>\nWhat are the colors of the bus in the image?\n<|end|>\n<|assistant|>\n"
54
+ image = Image.open("images/bus.jpg")
55
+
56
+ input_ids = tokenizer(text, return_tensors='pt').input_ids
57
+ image_tensor = model.image_preprocess(image)
58
+
59
+ #Generate the answer
60
+ output_ids = model.generate(
61
+ input_ids,
62
+ max_new_tokens=100,
63
+ images=image_tensor,
64
+ use_cache=True)[0]
65
+ print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
66
+ ```
67
+
68
+ ## Model evaluation
69
+ We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
70
+
71
+ | Models | Size | VQAv2 | GQA | SQA(IMG) | TextVQA | POPE | MME(P) | MMB |MMB_CN|MM-Vet|
72
+ |:--------:|:-----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|:-------:|
73
+ | Imp-v1.5-4B-Phi3| 4B | 81.46 | 63.51 | 77.99|60.16 | 86.86| 1507.7 |73.28 |61.08|44.6|
74
+ <!-- | [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | 63.00| 68.40 |58.20| 86.40 | 1476.9 | 66.10 |- |30.2| -->
75
+
76
+
77
+
78
+ ## License
79
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.
80
+
81
+ ## About us
82
+ This project is maintained by the [MILVLG](https://github.com/MILVLG)@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.
83
+
84
+ ## Citation
85
+
86
+ If you use our model or refer our work in your studies, please cite:
87
+
88
+ ```bibtex
89
+ @misc{imp2024,
90
+ author = {Shao, Zhenwei and Ouyang, Xuecheng and Yu, Zhou and Yu, Jun},
91
+ title = {Imp: An Emprical Study of Multimodal Small Language Models},
92
+ year = {2024},
93
+ url = {https://huggingface.co/MILVLG/imp-v1-3b}
94
+ }
95
+ ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "MILVLG/imp-v1.5-4b",
3
  "activation_function": "gelu_new",
4
  "architectures": [
5
  "ImpPhi3ForCausalLM"
 
1
  {
2
+ "_name_or_path": "MILVLG/Imp-v1.5-4B-Phi3",
3
  "activation_function": "gelu_new",
4
  "architectures": [
5
  "ImpPhi3ForCausalLM"