Text Generation
Transformers
Safetensors
imp_phi3
conversational
custom_code
Oyoy1235 commited on
Commit
e37c108
1 Parent(s): 20bc564

Imp-v1.5-4b update

Browse files
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README copy.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ datasets:
5
+ - liuhaotian/LLaVA-Pretrain
6
+ - liuhaotian/LLaVA-Instruct-150K
7
+ ---
8
+ # 😈 Imp
9
+
10
+ > A very small man can cast a very large shadow.
11
+ >
12
+ >           ——*George R.R. Martin, A Clash of Kings*
13
+
14
+
15
+ \[Technical report (coming soon)\]  [[Demo](https://xmbot.net/imp/)\]  [[Github](https://github.com/MILVLG/imp)\]
16
+
17
+ ## Introduction
18
+
19
+ The Imp project aims to provide a family of a strong multimodal `small` language models (MSLMs). Our `imp-v1.5-4b` is a strong MSLM with only **4B** parameters, which is build upon a small yet powerful SLM [Phi-3 ](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)(3.8B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
20
+
21
+ As shown in the Table below, `imp-v1.5-4b` significantly outperforms the counterparts of similar model sizes on various multimodal benchmarks.
22
+
23
+
24
+ We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
25
+
26
+
27
+ ## How to use
28
+
29
+
30
+ **Install dependencies**
31
+ ```bash
32
+ pip install transformers # latest version is ok, but we recommend v4.36.0
33
+ pip install -q pillow accelerate einops
34
+ ```
35
+
36
+ You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). A Colab page to run this example is provided [here](https://colab.research.google.com/drive/1EBYky6xIPjnlPppo2gZaiNK6gEsjXgom?usp=drive_link#scrollTo=2-VpU6QzWCVZ). Note that the example can only be run on GPUs currently.
37
+
38
+ ```Python
39
+ import torch
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ from PIL import Image
42
+
43
+ torch.set_default_device("cuda")
44
+
45
+ #Create model
46
+ model = AutoModelForCausalLM.from_pretrained(
47
+ "MILVLG/imp-v1.5-4b",
48
+ torch_dtype=torch.float16,
49
+ device_map="auto",
50
+ trust_remote_code=True)
51
+ tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1.5-4b", trust_remote_code=True)
52
+
53
+ #Set inputs
54
+ text = "<|user|>\n<image>\nWhat are the colors of the bus in the image?\n<|end|>\n<|assistant|>\n"
55
+ image = Image.open("images/bus.jpg")
56
+
57
+ input_ids = tokenizer(text, return_tensors='pt').input_ids
58
+ image_tensor = model.image_preprocess(image)
59
+
60
+ #Generate the answer
61
+ output_ids = model.generate(
62
+ input_ids,
63
+ max_new_tokens=100,
64
+ images=image_tensor,
65
+ use_cache=True)[0]
66
+ print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
67
+ ```
68
+
69
+ ## Model evaluation
70
+ We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
71
+
72
+ | Models | Size | VQAv2 | GQA | SQA(IMG) | TextVQA | POPE | MME(P) | MMB |MMB_CN|MM-Vet|
73
+ |:--------:|:-----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|:-------:|
74
+ | imp-v1.5-4b| 4B | 81.46 | 63.51 | 77.99|60.16 | 86.86| 1507.7 |73.28 |61.08|44.6|
75
+ <!-- | [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | 63.00| 68.40 |58.20| 86.40 | 1476.9 | 66.10 |- |30.2| -->
76
+
77
+
78
+
79
+ ## License
80
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.
81
+
82
+ ## About us
83
+ This project is maintained by the [MILVLG](https://github.com/MILVLG)@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.
84
+
85
+ ## Citation
86
+
87
+ If you use our model or refer our work in your studies, please cite:
88
+
89
+ ```bibtex
90
+ @misc{imp2024,
91
+ author = {Shao, Zhenwei and Ouyang, Xuecheng and Yu, Zhou and Yu, Jun},
92
+ title = {Imp: An Emprical Study of Multimodal Small Language Models},
93
+ year = {2024},
94
+ url = {https://huggingface.co/MILVLG/imp-v1-3b}
95
+ }
96
+ ```
added_tokens.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|assistant|>": 32001,
3
+ "<|endoftext|>": 32000,
4
+ "<|end|>": 32007,
5
+ "<|placeholder1|>": 32002,
6
+ "<|placeholder2|>": 32003,
7
+ "<|placeholder3|>": 32004,
8
+ "<|placeholder4|>": 32005,
9
+ "<|placeholder5|>": 32008,
10
+ "<|placeholder6|>": 32009,
11
+ "<|system|>": 32006,
12
+ "<|user|>": 32010,
13
+ "<image>": 32011
14
+ }
config.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "MILVLG/imp-v1.5-4b",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "ImpPhi3ForCausalLM"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "configuration_imp.ImpPhi3Config",
10
+ "AutoModelForCausalLM": "modeling_imp.ImpPhi3ForCausalLM"
11
+ },
12
+ "embd_pdrop": 0.0,
13
+ "bos_token_id": 1,
14
+ "eos_token_id": 32007,
15
+ "flash_attn": false,
16
+ "flash_rotary": false,
17
+ "freeze_mm_mlp_adapter": false,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 3072,
20
+ "image_aspect_ratio": "pad",
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 8192,
23
+ "max_position_embeddings": 4096,
24
+ "mm_hidden_size": 1152,
25
+ "mm_projector_lr": 2e-05,
26
+ "mm_projector_type": "mlp2x_gelu",
27
+ "mm_use_im_patch_token": false,
28
+ "mm_use_im_start_end": false,
29
+ "mm_vision_select_feature": "patch",
30
+ "mm_vision_select_layer": -2,
31
+ "mm_vision_tower": "google/siglip-so400m-patch14-384",
32
+ "model_type": "imp_phi3",
33
+ "num_attention_heads": 32,
34
+ "num_hidden_layers": 32,
35
+ "num_key_value_heads": 32,
36
+ "original_max_position_embeddings": 4096,
37
+ "pad_token_id": 32000,
38
+ "resid_pdrop": 0.0,
39
+ "rms_norm_eps": 1e-05,
40
+ "rope_scaling": null,
41
+ "rope_theta": 10000.0,
42
+ "sliding_window": 2047,
43
+ "tie_word_embeddings": false,
44
+ "tokenizer_model_max_length": 2560,
45
+ "tokenizer_padding_side": "right",
46
+ "torch_dtype": "bfloat16",
47
+ "transformers_version": "4.36.0",
48
+ "tune_mm_mlp_adapter": false,
49
+ "use_cache": true,
50
+ "use_mm_proj": true,
51
+
52
+
53
+ "fused_dense": false,
54
+ "image_token": "<image>",
55
+ "image_token_index": 32011,
56
+ "img_processor": null,
57
+ "layer_norm_epsilon": 1e-05,
58
+ "n_embd": 2560,
59
+ "n_head": 32,
60
+ "n_head_kv": null,
61
+ "n_inner": null,
62
+ "n_layer": 32,
63
+ "n_positions": 3072,
64
+ "rotary_dim": 32,
65
+ "vision_tower_config": {
66
+ "attention_dropout": 0.0,
67
+ "hidden_act": "gelu_pytorch_tanh",
68
+ "hidden_size": 1152,
69
+ "image_size": 384,
70
+ "intermediate_size": 4304,
71
+ "layer_norm_eps": 1e-06,
72
+ "model_type": "siglip_vision_model",
73
+ "num_attention_heads": 16,
74
+ "num_channels": 3,
75
+ "num_hidden_layers": 27,
76
+ "patch_size": 14
77
+ },
78
+ "vocab_size": 32064
79
+ }
configuration_imp.py ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) MILVLG team.
2
+ # Licensed under the Apache 2.0 license.
3
+ #
4
+ # Some code here is copied from the project Phi-2 (https://huggingface.co/microsoft/phi-2),
5
+ # SigLIP@transformers==4.37.0.dev0 (https://huggingface.co/google/siglip-so400m-patch14-384),
6
+ # and Llava (https://github.com/haotian-liu/LLaVA), and modified by
7
+ # Zhenwei Shao (shaozw@hdu.edu.cn) @ MILVLG. We thank them for their great works.
8
+ #
9
+ # We keep their original copyright statements as follows, which should be inherited:
10
+ # ------------------------------- Phi-2 ---------------------------------------------
11
+ # Copyright (c) Microsoft Corporation.
12
+ # Licensed under the MIT license.
13
+ # https://huggingface.co/google/siglip-so400m-patch14-384
14
+ #
15
+ # Copyright (c) 2022, Tri Dao, trid@cs.stanford.edu.
16
+ # Licensed under the BSD 3-Clause License.
17
+ # ------------------------------- SigLIP --------------------------------------------
18
+ # Copyright 2024 Google AI and The HuggingFace Team. All rights reserved.
19
+ #
20
+ # Licensed under the Apache License, Version 2.0 (the "License");
21
+ # you may not use this file except in compliance with the License.
22
+ # You may obtain a copy of the License at
23
+ #
24
+ # http://www.apache.org/licenses/LICENSE-2.0
25
+ #
26
+ # Unless required by applicable law or agreed to in writing, software
27
+ # distributed under the License is distributed on an "AS IS" BASIS,
28
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
29
+ # See the License for the specific language governing permissions and
30
+ # limitations under the License.
31
+ # ------------------------------- Llava ---------------------------------------------
32
+ # Copyright 2023 Haotian Liu
33
+ #
34
+ # Licensed under the Apache License, Version 2.0 (the "License");
35
+ # you may not use this file except in compliance with the License.
36
+ # You may obtain a copy of the License at
37
+ #
38
+ # http://www.apache.org/licenses/LICENSE-2.0
39
+ #
40
+ # Unless required by applicable law or agreed to in writing, software
41
+ # distributed under the License is distributed on an "AS IS" BASIS,
42
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
43
+ # See the License for the specific language governing permissions and
44
+ # limitations under the License.
45
+ # -----------------------------------------------------------------------------------
46
+
47
+
48
+ import os
49
+ import math
50
+ from typing import Optional, Union
51
+
52
+ from transformers import PretrainedConfig
53
+ from transformers.utils import logging
54
+
55
+ logger = logging.get_logger(__name__)
56
+
57
+
58
+ class Phi3Config(PretrainedConfig):
59
+ r"""
60
+ This is the configuration class to store the configuration of a [`Phi3Model`]. It is used to instantiate a Phi-3
61
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
62
+ defaults will yield a similar configuration to that of the
63
+ [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct).
64
+
65
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
66
+ documentation from [`PretrainedConfig`] for more information.
67
+
68
+ Args:
69
+ vocab_size (`int`, *optional*, defaults to 32064):
70
+ Vocabulary size of the Phi-3 model. Defines the number of different tokens that can be represented by the
71
+ `inputs_ids` passed when calling [`Phi3Model`].
72
+ hidden_size (`int`, *optional*, defaults to 3072):
73
+ Dimension of the hidden representations.
74
+ intermediate_size (`int`, *optional*, defaults to 8192):
75
+ Dimension of the MLP representations.
76
+ num_hidden_layers (`int`, *optional*, defaults to 32):
77
+ Number of hidden layers in the Transformer decoder.
78
+ num_attention_heads (`int`, *optional*, defaults to 32):
79
+ Number of attention heads for each attention layer in the Transformer decoder.
80
+ num_key_value_heads (`int`, *optional*):
81
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
82
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
83
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
84
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
85
+ by meanpooling all the original heads within that group. For more details checkout [this
86
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
87
+ `num_attention_heads`.
88
+ resid_pdrop (`float`, *optional*, defaults to 0.0):
89
+ Dropout probability for mlp outputs.
90
+ embd_pdrop (`int`, *optional*, defaults to 0.0):
91
+ The dropout ratio for the embeddings.
92
+ attention_dropout (`float`, *optional*, defaults to 0.0):
93
+ The dropout ratio after computing the attention scores.
94
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
95
+ The non-linear activation function (function or string) in the decoder.
96
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
97
+ The maximum sequence length that this model might ever be used with.
98
+ original_max_position_embeddings (`int`, *optional*, defaults to 4096):
99
+ The maximum sequence length that this model was trained with. This is used to determine the size of the
100
+ original RoPE embeddings when using long scaling.
101
+ initializer_range (`float`, *optional*, defaults to 0.02):
102
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
103
+ rms_norm_eps (`float`, *optional*, defaults to 1e-05):
104
+ The epsilon value used for the RMSNorm.
105
+ use_cache (`bool`, *optional*, defaults to `True`):
106
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
107
+ relevant if `config.is_decoder=True`. Whether to tie weight embeddings or not.
108
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
109
+ Whether to tie weight embeddings
110
+ rope_theta (`float`, *optional*, defaults to 10000.0):
111
+ The base period of the RoPE embeddings.
112
+ rope_scaling (`dict`, *optional*):
113
+ The scaling strategy for the RoPE embeddings. If `None`, no scaling is applied. If a dictionary, it must
114
+ contain the following keys: `type`, `short_factor` and `long_factor`. The `type` must be either `su` or `yarn` and
115
+ the `short_factor` and `long_factor` must be lists of numbers with the same length as the hidden size
116
+ divided by the number of attention heads divided by 2.
117
+ bos_token_id (`int`, *optional*, defaults to 1):
118
+ The id of the "beginning-of-sequence" token.
119
+ eos_token_id (`int`, *optional*, defaults to 32000):
120
+ The id of the "end-of-sequence" token.
121
+ pad_token_id (`int`, *optional*, defaults to 32000):
122
+ The id of the padding token.
123
+ sliding_window (`int`, *optional*):
124
+ Sliding window attention window size. If `None`, no sliding window is applied.
125
+
126
+ Example:
127
+
128
+ ```python
129
+ >>> from transformers import Phi3Model, Phi3Config
130
+
131
+ >>> # Initializing a Phi-3 style configuration
132
+ >>> configuration = Phi3Config.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
133
+
134
+ >>> # Initializing a model from the configuration
135
+ >>> model = Phi3Model(configuration)
136
+
137
+ >>> # Accessing the model configuration
138
+ >>> configuration = model.config
139
+ ```"""
140
+
141
+ model_type = "phi3"
142
+ keys_to_ignore_at_inference = ["past_key_values"]
143
+
144
+ def __init__(
145
+ self,
146
+ vocab_size=32064,
147
+ hidden_size=3072,
148
+ intermediate_size=8192,
149
+ num_hidden_layers=32,
150
+ num_attention_heads=32,
151
+ num_key_value_heads=None,
152
+ resid_pdrop=0.0,
153
+ embd_pdrop=0.0,
154
+ attention_dropout=0.0,
155
+ hidden_act="silu",
156
+ max_position_embeddings=4096,
157
+ original_max_position_embeddings=4096,
158
+ initializer_range=0.02,
159
+ rms_norm_eps=1e-5,
160
+ use_cache=True,
161
+ tie_word_embeddings=False,
162
+ rope_theta=10000.0,
163
+ rope_scaling=None,
164
+ bos_token_id=1,
165
+ eos_token_id=32000,
166
+ pad_token_id=32000,
167
+ sliding_window=None,
168
+ **kwargs,
169
+ ):
170
+ self.vocab_size = vocab_size
171
+ self.hidden_size = hidden_size
172
+ self.intermediate_size = intermediate_size
173
+ self.num_hidden_layers = num_hidden_layers
174
+ self.num_attention_heads = num_attention_heads
175
+
176
+ if num_key_value_heads is None:
177
+ num_key_value_heads = num_attention_heads
178
+
179
+ self.num_key_value_heads = num_key_value_heads
180
+ self.resid_pdrop = resid_pdrop
181
+ self.embd_pdrop = embd_pdrop
182
+ self.attention_dropout = attention_dropout
183
+ self.hidden_act = hidden_act
184
+ self.max_position_embeddings = max_position_embeddings
185
+ self.original_max_position_embeddings = original_max_position_embeddings
186
+ self.initializer_range = initializer_range
187
+ self.rms_norm_eps = rms_norm_eps
188
+ self.use_cache = use_cache
189
+ self.rope_theta = rope_theta
190
+ self.rope_scaling = rope_scaling
191
+ self._rope_scaling_validation()
192
+ self.sliding_window = sliding_window
193
+
194
+ super().__init__(
195
+ bos_token_id=bos_token_id,
196
+ eos_token_id=eos_token_id,
197
+ pad_token_id=pad_token_id,
198
+ tie_word_embeddings=tie_word_embeddings,
199
+ **kwargs,
200
+ )
201
+
202
+ def _rope_scaling_validation(self):
203
+ """
204
+ Validate the `rope_scaling` configuration.
205
+ """
206
+ if self.rope_scaling is None:
207
+ return
208
+
209
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 3:
210
+ raise ValueError(
211
+ "`rope_scaling` must be a dictionary with three fields, `type`, `short_factor` and `long_factor`, "
212
+ f"got {self.rope_scaling}"
213
+ )
214
+ rope_scaling_type = self.rope_scaling.get("type", None)
215
+ rope_scaling_short_factor = self.rope_scaling.get("short_factor", None)
216
+ rope_scaling_long_factor = self.rope_scaling.get("long_factor", None)
217
+ if rope_scaling_type is None or rope_scaling_type not in ["su", "yarn"]:
218
+ raise ValueError(f"`rope_scaling`'s type field must be one of ['su', 'yarn'], got {rope_scaling_type}")
219
+ if not (
220
+ isinstance(rope_scaling_short_factor, list)
221
+ and all(isinstance(x, (int, float)) for x in rope_scaling_short_factor)
222
+ ):
223
+ raise ValueError(
224
+ f"`rope_scaling`'s short_factor field must be a list of numbers, got {rope_scaling_short_factor}"
225
+ )
226
+ if not len(rope_scaling_short_factor) == self.hidden_size // self.num_attention_heads // 2:
227
+ raise ValueError(
228
+ f"`rope_scaling`'s short_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_short_factor)}"
229
+ )
230
+ if not (
231
+ isinstance(rope_scaling_long_factor, list)
232
+ and all(isinstance(x, (int, float)) for x in rope_scaling_long_factor)
233
+ ):
234
+ raise ValueError(
235
+ f"`rope_scaling`'s long_factor field must be a list of numbers, got {rope_scaling_long_factor}"
236
+ )
237
+ if not len(rope_scaling_long_factor) == self.hidden_size // self.num_attention_heads // 2:
238
+ raise ValueError(
239
+ f"`rope_scaling`'s long_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_long_factor)}"
240
+ )
241
+
242
+
243
+
244
+ class SiglipVisionConfig(PretrainedConfig):
245
+
246
+ model_type = "siglip_vision_model"
247
+
248
+ def __init__(
249
+ self,
250
+ hidden_size=768,
251
+ intermediate_size=3072,
252
+ num_hidden_layers=12,
253
+ num_attention_heads=12,
254
+ num_channels=3,
255
+ image_size=224,
256
+ patch_size=16,
257
+ hidden_act="gelu_pytorch_tanh",
258
+ layer_norm_eps=1e-6,
259
+ attention_dropout=0.0,
260
+ **kwargs,
261
+ ):
262
+ super().__init__(**kwargs)
263
+
264
+ self.hidden_size = hidden_size
265
+ self.intermediate_size = intermediate_size
266
+ self.num_hidden_layers = num_hidden_layers
267
+ self.num_attention_heads = num_attention_heads
268
+ self.num_channels = num_channels
269
+ self.patch_size = patch_size
270
+ self.image_size = image_size
271
+ self.attention_dropout = attention_dropout
272
+ self.layer_norm_eps = layer_norm_eps
273
+ self.hidden_act = hidden_act
274
+
275
+ @classmethod
276
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
277
+ cls._set_token_in_kwargs(kwargs)
278
+
279
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
280
+
281
+ # get the vision config dict if we are loading from SiglipConfig
282
+ if config_dict.get("model_type") == "siglip":
283
+ config_dict = config_dict["vision_config"]
284
+
285
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
286
+ logger.warning(
287
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
288
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
289
+ )
290
+
291
+ return cls.from_dict(config_dict, **kwargs)
292
+
293
+
294
+ class ImpPhi3Config(Phi3Config):
295
+ model_type = "imp_phi3"
296
+
297
+ def __init__(self, **kwargs):
298
+ super().__init__(**kwargs)
299
+ self.image_token_index = getattr(self, "image_token_index", 50296)
300
+ self.image_token = getattr(self, "image_token", "<image>")
301
+
302
+ if not hasattr(self, "vision_tower_config") and hasattr(self, "mm_vision_tower"):
303
+ vision_tower_config = SiglipVisionConfig.from_pretrained(self.mm_vision_tower)
304
+ self.vision_tower_config = vision_tower_config.to_diff_dict()
305
+
306
+ @property
307
+ def vision_tower_cfg(self):
308
+ cfg = SiglipVisionConfig.from_dict(self.vision_tower_config)
309
+ # imp-v1 only supports `patch` feature for now w/o cls token
310
+ # cfg.mm_vision_select_feature = self.mm_vision_select_feature
311
+ cfg.mm_vision_select_layer = self.mm_vision_select_layer
312
+ cfg.mm_vision_tower = self.mm_vision_tower
313
+ return cfg
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": [
5
+ 32000,
6
+ 32001,
7
+ 32007
8
+ ],
9
+ "max_new_tokens": 2000,
10
+ "pad_token_id": 2,
11
+ "transformers_version": "4.36.0"
12
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fc741406a274d9e77885cef89cc57d4e4ce75cb50c3b07d079e1835a27f2613
3
+ size 952015208
model-00002-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:582f5813afe4c1677272bc5812749ab2f3f108d3e2e75c4b440a180c050eb041
3
+ size 1006684976
model-00003-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65c38648b356ffba04485d18588eb042fb96b6f0c17716924f0e83711490835b
3
+ size 975240328
model-00004-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6810ac18f6536369f2967f69bc8063a95d45fe961ef18ef8ea0e7c38a5ce0917
3
+ size 962644808
model-00005-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7e18a5f170fddd42ea0ff4cc3765ec41f2661897ff3cbe20237b6ba2655152b
3
+ size 1006685000
model-00006-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:251073aa76eda06de96c588bd2baff4f707cbae92c057ba6bae3168f737acd60
3
+ size 975240344
model-00007-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6be89bb18770df00837859bdf4a41ae2bc8abf28b01b1edddf5aeb05cdf91325
3
+ size 962644808
model-00008-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c845e1653a1505dc8c39ea86e7ef14d5df40cb56e786a90e57cdfd6a56ee242c
3
+ size 1023878536
model-00009-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d38fcded882b7e25c4fa224bf1fea6a675dd8abf3db234645a1a81d471654930
3
+ size 598672880
model.safetensors.index.json ADDED
@@ -0,0 +1,627 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 8463619136
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00009-of-00009.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00009.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00009.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00009.safetensors",
10
+ "model.layers.0.mlp.gate_up_proj.weight": "model-00001-of-00009.safetensors",
11
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00009.safetensors",
12
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00009.safetensors",
13
+ "model.layers.0.self_attn.qkv_proj.weight": "model-00001-of-00009.safetensors",
14
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00009.safetensors",
15
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00009.safetensors",
16
+ "model.layers.1.mlp.gate_up_proj.weight": "model-00001-of-00009.safetensors",
17
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00009.safetensors",
18
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00009.safetensors",
19
+ "model.layers.1.self_attn.qkv_proj.weight": "model-00001-of-00009.safetensors",
20
+ "model.layers.10.input_layernorm.weight": "model-00003-of-00009.safetensors",
21
+ "model.layers.10.mlp.down_proj.weight": "model-00003-of-00009.safetensors",
22
+ "model.layers.10.mlp.gate_up_proj.weight": "model-00003-of-00009.safetensors",
23
+ "model.layers.10.post_attention_layernorm.weight": "model-00003-of-00009.safetensors",
24
+ "model.layers.10.self_attn.o_proj.weight": "model-00003-of-00009.safetensors",
25
+ "model.layers.10.self_attn.qkv_proj.weight": "model-00003-of-00009.safetensors",
26
+ "model.layers.11.input_layernorm.weight": "model-00003-of-00009.safetensors",
27
+ "model.layers.11.mlp.down_proj.weight": "model-00003-of-00009.safetensors",
28
+ "model.layers.11.mlp.gate_up_proj.weight": "model-00003-of-00009.safetensors",
29
+ "model.layers.11.post_attention_layernorm.weight": "model-00003-of-00009.safetensors",
30
+ "model.layers.11.self_attn.o_proj.weight": "model-00003-of-00009.safetensors",
31
+ "model.layers.11.self_attn.qkv_proj.weight": "model-00003-of-00009.safetensors",
32
+ "model.layers.12.input_layernorm.weight": "model-00004-of-00009.safetensors",
33
+ "model.layers.12.mlp.down_proj.weight": "model-00004-of-00009.safetensors",
34
+ "model.layers.12.mlp.gate_up_proj.weight": "model-00004-of-00009.safetensors",
35
+ "model.layers.12.post_attention_layernorm.weight": "model-00004-of-00009.safetensors",
36
+ "model.layers.12.self_attn.o_proj.weight": "model-00003-of-00009.safetensors",
37
+ "model.layers.12.self_attn.qkv_proj.weight": "model-00004-of-00009.safetensors",
38
+ "model.layers.13.input_layernorm.weight": "model-00004-of-00009.safetensors",
39
+ "model.layers.13.mlp.down_proj.weight": "model-00004-of-00009.safetensors",
40
+ "model.layers.13.mlp.gate_up_proj.weight": "model-00004-of-00009.safetensors",
41
+ "model.layers.13.post_attention_layernorm.weight": "model-00004-of-00009.safetensors",
42
+ "model.layers.13.self_attn.o_proj.weight": "model-00004-of-00009.safetensors",
43
+ "model.layers.13.self_attn.qkv_proj.weight": "model-00004-of-00009.safetensors",
44
+ "model.layers.14.input_layernorm.weight": "model-00004-of-00009.safetensors",
45
+ "model.layers.14.mlp.down_proj.weight": "model-00004-of-00009.safetensors",
46
+ "model.layers.14.mlp.gate_up_proj.weight": "model-00004-of-00009.safetensors",
47
+ "model.layers.14.post_attention_layernorm.weight": "model-00004-of-00009.safetensors",
48
+ "model.layers.14.self_attn.o_proj.weight": "model-00004-of-00009.safetensors",
49
+ "model.layers.14.self_attn.qkv_proj.weight": "model-00004-of-00009.safetensors",
50
+ "model.layers.15.input_layernorm.weight": "model-00004-of-00009.safetensors",
51
+ "model.layers.15.mlp.down_proj.weight": "model-00004-of-00009.safetensors",
52
+ "model.layers.15.mlp.gate_up_proj.weight": "model-00004-of-00009.safetensors",
53
+ "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00009.safetensors",
54
+ "model.layers.15.self_attn.o_proj.weight": "model-00004-of-00009.safetensors",
55
+ "model.layers.15.self_attn.qkv_proj.weight": "model-00004-of-00009.safetensors",
56
+ "model.layers.16.input_layernorm.weight": "model-00005-of-00009.safetensors",
57
+ "model.layers.16.mlp.down_proj.weight": "model-00005-of-00009.safetensors",
58
+ "model.layers.16.mlp.gate_up_proj.weight": "model-00005-of-00009.safetensors",
59
+ "model.layers.16.post_attention_layernorm.weight": "model-00005-of-00009.safetensors",
60
+ "model.layers.16.self_attn.o_proj.weight": "model-00004-of-00009.safetensors",
61
+ "model.layers.16.self_attn.qkv_proj.weight": "model-00004-of-00009.safetensors",
62
+ "model.layers.17.input_layernorm.weight": "model-00005-of-00009.safetensors",
63
+ "model.layers.17.mlp.down_proj.weight": "model-00005-of-00009.safetensors",
64
+ "model.layers.17.mlp.gate_up_proj.weight": "model-00005-of-00009.safetensors",
65
+ "model.layers.17.post_attention_layernorm.weight": "model-00005-of-00009.safetensors",
66
+ "model.layers.17.self_attn.o_proj.weight": "model-00005-of-00009.safetensors",
67
+ "model.layers.17.self_attn.qkv_proj.weight": "model-00005-of-00009.safetensors",
68
+ "model.layers.18.input_layernorm.weight": "model-00005-of-00009.safetensors",
69
+ "model.layers.18.mlp.down_proj.weight": "model-00005-of-00009.safetensors",
70
+ "model.layers.18.mlp.gate_up_proj.weight": "model-00005-of-00009.safetensors",
71
+ "model.layers.18.post_attention_layernorm.weight": "model-00005-of-00009.safetensors",
72
+ "model.layers.18.self_attn.o_proj.weight": "model-00005-of-00009.safetensors",
73
+ "model.layers.18.self_attn.qkv_proj.weight": "model-00005-of-00009.safetensors",
74
+ "model.layers.19.input_layernorm.weight": "model-00005-of-00009.safetensors",
75
+ "model.layers.19.mlp.down_proj.weight": "model-00005-of-00009.safetensors",
76
+ "model.layers.19.mlp.gate_up_proj.weight": "model-00005-of-00009.safetensors",
77
+ "model.layers.19.post_attention_layernorm.weight": "model-00005-of-00009.safetensors",
78
+ "model.layers.19.self_attn.o_proj.weight": "model-00005-of-00009.safetensors",
79
+ "model.layers.19.self_attn.qkv_proj.weight": "model-00005-of-00009.safetensors",
80
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00009.safetensors",
81
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00009.safetensors",
82
+ "model.layers.2.mlp.gate_up_proj.weight": "model-00001-of-00009.safetensors",
83
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00009.safetensors",
84
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00009.safetensors",
85
+ "model.layers.2.self_attn.qkv_proj.weight": "model-00001-of-00009.safetensors",
86
+ "model.layers.20.input_layernorm.weight": "model-00006-of-00009.safetensors",
87
+ "model.layers.20.mlp.down_proj.weight": "model-00006-of-00009.safetensors",
88
+ "model.layers.20.mlp.gate_up_proj.weight": "model-00005-of-00009.safetensors",
89
+ "model.layers.20.post_attention_layernorm.weight": "model-00006-of-00009.safetensors",
90
+ "model.layers.20.self_attn.o_proj.weight": "model-00005-of-00009.safetensors",
91
+ "model.layers.20.self_attn.qkv_proj.weight": "model-00005-of-00009.safetensors",
92
+ "model.layers.21.input_layernorm.weight": "model-00006-of-00009.safetensors",
93
+ "model.layers.21.mlp.down_proj.weight": "model-00006-of-00009.safetensors",
94
+ "model.layers.21.mlp.gate_up_proj.weight": "model-00006-of-00009.safetensors",
95
+ "model.layers.21.post_attention_layernorm.weight": "model-00006-of-00009.safetensors",
96
+ "model.layers.21.self_attn.o_proj.weight": "model-00006-of-00009.safetensors",
97
+ "model.layers.21.self_attn.qkv_proj.weight": "model-00006-of-00009.safetensors",
98
+ "model.layers.22.input_layernorm.weight": "model-00006-of-00009.safetensors",
99
+ "model.layers.22.mlp.down_proj.weight": "model-00006-of-00009.safetensors",
100
+ "model.layers.22.mlp.gate_up_proj.weight": "model-00006-of-00009.safetensors",
101
+ "model.layers.22.post_attention_layernorm.weight": "model-00006-of-00009.safetensors",
102
+ "model.layers.22.self_attn.o_proj.weight": "model-00006-of-00009.safetensors",
103
+ "model.layers.22.self_attn.qkv_proj.weight": "model-00006-of-00009.safetensors",
104
+ "model.layers.23.input_layernorm.weight": "model-00006-of-00009.safetensors",
105
+ "model.layers.23.mlp.down_proj.weight": "model-00006-of-00009.safetensors",
106
+ "model.layers.23.mlp.gate_up_proj.weight": "model-00006-of-00009.safetensors",
107
+ "model.layers.23.post_attention_layernorm.weight": "model-00006-of-00009.safetensors",
108
+ "model.layers.23.self_attn.o_proj.weight": "model-00006-of-00009.safetensors",
109
+ "model.layers.23.self_attn.qkv_proj.weight": "model-00006-of-00009.safetensors",
110
+ "model.layers.24.input_layernorm.weight": "model-00006-of-00009.safetensors",
111
+ "model.layers.24.mlp.down_proj.weight": "model-00006-of-00009.safetensors",
112
+ "model.layers.24.mlp.gate_up_proj.weight": "model-00006-of-00009.safetensors",
113
+ "model.layers.24.post_attention_layernorm.weight": "model-00006-of-00009.safetensors",
114
+ "model.layers.24.self_attn.o_proj.weight": "model-00006-of-00009.safetensors",
115
+ "model.layers.24.self_attn.qkv_proj.weight": "model-00006-of-00009.safetensors",
116
+ "model.layers.25.input_layernorm.weight": "model-00007-of-00009.safetensors",
117
+ "model.layers.25.mlp.down_proj.weight": "model-00007-of-00009.safetensors",
118
+ "model.layers.25.mlp.gate_up_proj.weight": "model-00007-of-00009.safetensors",
119
+ "model.layers.25.post_attention_layernorm.weight": "model-00007-of-00009.safetensors",
120
+ "model.layers.25.self_attn.o_proj.weight": "model-00006-of-00009.safetensors",
121
+ "model.layers.25.self_attn.qkv_proj.weight": "model-00007-of-00009.safetensors",
122
+ "model.layers.26.input_layernorm.weight": "model-00007-of-00009.safetensors",
123
+ "model.layers.26.mlp.down_proj.weight": "model-00007-of-00009.safetensors",
124
+ "model.layers.26.mlp.gate_up_proj.weight": "model-00007-of-00009.safetensors",
125
+ "model.layers.26.post_attention_layernorm.weight": "model-00007-of-00009.safetensors",
126
+ "model.layers.26.self_attn.o_proj.weight": "model-00007-of-00009.safetensors",
127
+ "model.layers.26.self_attn.qkv_proj.weight": "model-00007-of-00009.safetensors",
128
+ "model.layers.27.input_layernorm.weight": "model-00007-of-00009.safetensors",
129
+ "model.layers.27.mlp.down_proj.weight": "model-00007-of-00009.safetensors",
130
+ "model.layers.27.mlp.gate_up_proj.weight": "model-00007-of-00009.safetensors",
131
+ "model.layers.27.post_attention_layernorm.weight": "model-00007-of-00009.safetensors",
132
+ "model.layers.27.self_attn.o_proj.weight": "model-00007-of-00009.safetensors",
133
+ "model.layers.27.self_attn.qkv_proj.weight": "model-00007-of-00009.safetensors",
134
+ "model.layers.28.input_layernorm.weight": "model-00007-of-00009.safetensors",
135
+ "model.layers.28.mlp.down_proj.weight": "model-00007-of-00009.safetensors",
136
+ "model.layers.28.mlp.gate_up_proj.weight": "model-00007-of-00009.safetensors",
137
+ "model.layers.28.post_attention_layernorm.weight": "model-00007-of-00009.safetensors",
138
+ "model.layers.28.self_attn.o_proj.weight": "model-00007-of-00009.safetensors",
139
+ "model.layers.28.self_attn.qkv_proj.weight": "model-00007-of-00009.safetensors",
140
+ "model.layers.29.input_layernorm.weight": "model-00008-of-00009.safetensors",
141
+ "model.layers.29.mlp.down_proj.weight": "model-00008-of-00009.safetensors",
142
+ "model.layers.29.mlp.gate_up_proj.weight": "model-00008-of-00009.safetensors",
143
+ "model.layers.29.post_attention_layernorm.weight": "model-00008-of-00009.safetensors",
144
+ "model.layers.29.self_attn.o_proj.weight": "model-00007-of-00009.safetensors",
145
+ "model.layers.29.self_attn.qkv_proj.weight": "model-00007-of-00009.safetensors",
146
+ "model.layers.3.input_layernorm.weight": "model-00002-of-00009.safetensors",
147
+ "model.layers.3.mlp.down_proj.weight": "model-00002-of-00009.safetensors",
148
+ "model.layers.3.mlp.gate_up_proj.weight": "model-00002-of-00009.safetensors",
149
+ "model.layers.3.post_attention_layernorm.weight": "model-00002-of-00009.safetensors",
150
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00009.safetensors",
151
+ "model.layers.3.self_attn.qkv_proj.weight": "model-00001-of-00009.safetensors",
152
+ "model.layers.30.input_layernorm.weight": "model-00008-of-00009.safetensors",
153
+ "model.layers.30.mlp.down_proj.weight": "model-00008-of-00009.safetensors",
154
+ "model.layers.30.mlp.gate_up_proj.weight": "model-00008-of-00009.safetensors",
155
+ "model.layers.30.post_attention_layernorm.weight": "model-00008-of-00009.safetensors",
156
+ "model.layers.30.self_attn.o_proj.weight": "model-00008-of-00009.safetensors",
157
+ "model.layers.30.self_attn.qkv_proj.weight": "model-00008-of-00009.safetensors",
158
+ "model.layers.31.input_layernorm.weight": "model-00008-of-00009.safetensors",
159
+ "model.layers.31.mlp.down_proj.weight": "model-00008-of-00009.safetensors",
160
+ "model.layers.31.mlp.gate_up_proj.weight": "model-00008-of-00009.safetensors",
161
+ "model.layers.31.post_attention_layernorm.weight": "model-00008-of-00009.safetensors",
162
+ "model.layers.31.self_attn.o_proj.weight": "model-00008-of-00009.safetensors",
163
+ "model.layers.31.self_attn.qkv_proj.weight": "model-00008-of-00009.safetensors",
164
+ "model.layers.4.input_layernorm.weight": "model-00002-of-00009.safetensors",
165
+ "model.layers.4.mlp.down_proj.weight": "model-00002-of-00009.safetensors",
166
+ "model.layers.4.mlp.gate_up_proj.weight": "model-00002-of-00009.safetensors",
167
+ "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00009.safetensors",
168
+ "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00009.safetensors",
169
+ "model.layers.4.self_attn.qkv_proj.weight": "model-00002-of-00009.safetensors",
170
+ "model.layers.5.input_layernorm.weight": "model-00002-of-00009.safetensors",
171
+ "model.layers.5.mlp.down_proj.weight": "model-00002-of-00009.safetensors",
172
+ "model.layers.5.mlp.gate_up_proj.weight": "model-00002-of-00009.safetensors",
173
+ "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00009.safetensors",
174
+ "model.layers.5.self_attn.o_proj.weight": "model-00002-of-00009.safetensors",
175
+ "model.layers.5.self_attn.qkv_proj.weight": "model-00002-of-00009.safetensors",
176
+ "model.layers.6.input_layernorm.weight": "model-00002-of-00009.safetensors",
177
+ "model.layers.6.mlp.down_proj.weight": "model-00002-of-00009.safetensors",
178
+ "model.layers.6.mlp.gate_up_proj.weight": "model-00002-of-00009.safetensors",
179
+ "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00009.safetensors",
180
+ "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00009.safetensors",
181
+ "model.layers.6.self_attn.qkv_proj.weight": "model-00002-of-00009.safetensors",
182
+ "model.layers.7.input_layernorm.weight": "model-00003-of-00009.safetensors",
183
+ "model.layers.7.mlp.down_proj.weight": "model-00003-of-00009.safetensors",
184
+ "model.layers.7.mlp.gate_up_proj.weight": "model-00002-of-00009.safetensors",
185
+ "model.layers.7.post_attention_layernorm.weight": "model-00003-of-00009.safetensors",
186
+ "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00009.safetensors",
187
+ "model.layers.7.self_attn.qkv_proj.weight": "model-00002-of-00009.safetensors",
188
+ "model.layers.8.input_layernorm.weight": "model-00003-of-00009.safetensors",
189
+ "model.layers.8.mlp.down_proj.weight": "model-00003-of-00009.safetensors",
190
+ "model.layers.8.mlp.gate_up_proj.weight": "model-00003-of-00009.safetensors",
191
+ "model.layers.8.post_attention_layernorm.weight": "model-00003-of-00009.safetensors",
192
+ "model.layers.8.self_attn.o_proj.weight": "model-00003-of-00009.safetensors",
193
+ "model.layers.8.self_attn.qkv_proj.weight": "model-00003-of-00009.safetensors",
194
+ "model.layers.9.input_layernorm.weight": "model-00003-of-00009.safetensors",
195
+ "model.layers.9.mlp.down_proj.weight": "model-00003-of-00009.safetensors",
196
+ "model.layers.9.mlp.gate_up_proj.weight": "model-00003-of-00009.safetensors",
197
+ "model.layers.9.post_attention_layernorm.weight": "model-00003-of-00009.safetensors",
198
+ "model.layers.9.self_attn.o_proj.weight": "model-00003-of-00009.safetensors",
199
+ "model.layers.9.self_attn.qkv_proj.weight": "model-00003-of-00009.safetensors",
200
+ "model.mm_projector.0.bias": "model-00009-of-00009.safetensors",
201
+ "model.mm_projector.0.weight": "model-00009-of-00009.safetensors",
202
+ "model.mm_projector.2.bias": "model-00009-of-00009.safetensors",
203
+ "model.mm_projector.2.weight": "model-00009-of-00009.safetensors",
204
+ "model.norm.weight": "model-00008-of-00009.safetensors",
205
+ "model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.bias": "model-00008-of-00009.safetensors",
206
+ "model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00008-of-00009.safetensors",
207
+ "model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight": "model-00008-of-00009.safetensors",
208
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00008-of-00009.safetensors",
209
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00008-of-00009.safetensors",
210
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm2.bias": "model-00008-of-00009.safetensors",
211
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm2.weight": "model-00008-of-00009.safetensors",
212
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00008-of-00009.safetensors",
213
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00008-of-00009.safetensors",
214
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00008-of-00009.safetensors",
215
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00008-of-00009.safetensors",
216
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
217
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
218
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
219
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
220
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
221
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
222
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
223
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
224
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm1.bias": "model-00008-of-00009.safetensors",
225
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm1.weight": "model-00008-of-00009.safetensors",
226
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm2.bias": "model-00008-of-00009.safetensors",
227
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm2.weight": "model-00008-of-00009.safetensors",
228
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00008-of-00009.safetensors",
229
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00008-of-00009.safetensors",
230
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00008-of-00009.safetensors",
231
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00008-of-00009.safetensors",
232
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
233
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
234
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
235
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
236
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
237
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
238
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
239
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
240
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm1.bias": "model-00008-of-00009.safetensors",
241
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm1.weight": "model-00008-of-00009.safetensors",
242
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm2.bias": "model-00008-of-00009.safetensors",
243
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm2.weight": "model-00008-of-00009.safetensors",
244
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00008-of-00009.safetensors",
245
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00008-of-00009.safetensors",
246
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00008-of-00009.safetensors",
247
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00008-of-00009.safetensors",
248
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
249
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
250
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
251
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
252
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
253
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
254
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
255
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
256
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm1.bias": "model-00008-of-00009.safetensors",
257
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm1.weight": "model-00008-of-00009.safetensors",
258
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm2.bias": "model-00008-of-00009.safetensors",
259
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm2.weight": "model-00008-of-00009.safetensors",
260
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00008-of-00009.safetensors",
261
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00008-of-00009.safetensors",
262
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00008-of-00009.safetensors",
263
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00008-of-00009.safetensors",
264
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
265
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
266
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
267
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
268
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
269
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
270
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
271
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
272
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm1.bias": "model-00008-of-00009.safetensors",
273
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm1.weight": "model-00008-of-00009.safetensors",
274
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm2.bias": "model-00008-of-00009.safetensors",
275
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm2.weight": "model-00008-of-00009.safetensors",
276
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00008-of-00009.safetensors",
277
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00008-of-00009.safetensors",
278
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00008-of-00009.safetensors",
279
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00008-of-00009.safetensors",
280
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
281
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
282
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
283
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
284
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
285
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
286
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
287
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
288
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm1.bias": "model-00008-of-00009.safetensors",
289
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm1.weight": "model-00008-of-00009.safetensors",
290
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm2.bias": "model-00009-of-00009.safetensors",
291
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm2.weight": "model-00009-of-00009.safetensors",
292
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00008-of-00009.safetensors",
293
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00008-of-00009.safetensors",
294
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00009-of-00009.safetensors",
295
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00009-of-00009.safetensors",
296
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
297
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
298
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
299
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
300
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
301
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
302
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
303
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
304
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm1.bias": "model-00009-of-00009.safetensors",
305
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm1.weight": "model-00009-of-00009.safetensors",
306
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm2.bias": "model-00009-of-00009.safetensors",
307
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm2.weight": "model-00009-of-00009.safetensors",
308
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00009-of-00009.safetensors",
309
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00009-of-00009.safetensors",
310
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00009-of-00009.safetensors",
311
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00009-of-00009.safetensors",
312
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
313
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
314
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
315
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
316
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
317
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
318
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
319
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
320
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm1.bias": "model-00009-of-00009.safetensors",
321
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm1.weight": "model-00009-of-00009.safetensors",
322
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm2.bias": "model-00009-of-00009.safetensors",
323
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm2.weight": "model-00009-of-00009.safetensors",
324
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00009-of-00009.safetensors",
325
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00009-of-00009.safetensors",
326
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00009-of-00009.safetensors",
327
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00009-of-00009.safetensors",
328
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
329
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
330
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
331
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
332
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
333
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
334
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
335
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
336
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm1.bias": "model-00009-of-00009.safetensors",
337
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm1.weight": "model-00009-of-00009.safetensors",
338
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm2.bias": "model-00009-of-00009.safetensors",
339
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm2.weight": "model-00009-of-00009.safetensors",
340
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00009-of-00009.safetensors",
341
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00009-of-00009.safetensors",
342
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00009-of-00009.safetensors",
343
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00009-of-00009.safetensors",
344
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
345
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
346
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
347
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
348
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
349
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
350
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
351
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
352
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm1.bias": "model-00009-of-00009.safetensors",
353
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm1.weight": "model-00009-of-00009.safetensors",
354
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.bias": "model-00009-of-00009.safetensors",
355
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.weight": "model-00009-of-00009.safetensors",
356
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00009-of-00009.safetensors",
357
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00009-of-00009.safetensors",
358
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00009-of-00009.safetensors",
359
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00009-of-00009.safetensors",
360
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
361
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
362
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
363
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
364
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
365
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
366
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
367
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
368
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm1.bias": "model-00009-of-00009.safetensors",
369
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm1.weight": "model-00009-of-00009.safetensors",
370
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm2.bias": "model-00009-of-00009.safetensors",
371
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm2.weight": "model-00009-of-00009.safetensors",
372
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00009-of-00009.safetensors",
373
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00009-of-00009.safetensors",
374
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00009-of-00009.safetensors",
375
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00009-of-00009.safetensors",
376
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
377
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
378
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
379
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
380
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
381
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
382
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
383
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
384
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm1.bias": "model-00009-of-00009.safetensors",
385
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm1.weight": "model-00009-of-00009.safetensors",
386
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm2.bias": "model-00009-of-00009.safetensors",
387
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm2.weight": "model-00009-of-00009.safetensors",
388
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00009-of-00009.safetensors",
389
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00009-of-00009.safetensors",
390
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00009-of-00009.safetensors",
391
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00009-of-00009.safetensors",
392
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
393
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
394
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
395
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
396
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
397
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
398
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
399
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
400
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm1.bias": "model-00008-of-00009.safetensors",
401
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm1.weight": "model-00008-of-00009.safetensors",
402
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.bias": "model-00008-of-00009.safetensors",
403
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.weight": "model-00008-of-00009.safetensors",
404
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00008-of-00009.safetensors",
405
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00008-of-00009.safetensors",
406
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00008-of-00009.safetensors",
407
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00008-of-00009.safetensors",
408
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
409
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
410
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
411
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
412
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
413
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
414
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
415
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
416
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm1.bias": "model-00009-of-00009.safetensors",
417
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm1.weight": "model-00009-of-00009.safetensors",
418
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm2.bias": "model-00009-of-00009.safetensors",
419
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm2.weight": "model-00009-of-00009.safetensors",
420
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00009-of-00009.safetensors",
421
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00009-of-00009.safetensors",
422
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00009-of-00009.safetensors",
423
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00009-of-00009.safetensors",
424
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
425
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
426
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
427
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
428
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
429
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
430
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
431
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
432
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm1.bias": "model-00009-of-00009.safetensors",
433
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm1.weight": "model-00009-of-00009.safetensors",
434
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm2.bias": "model-00009-of-00009.safetensors",
435
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm2.weight": "model-00009-of-00009.safetensors",
436
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00009-of-00009.safetensors",
437
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00009-of-00009.safetensors",
438
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00009-of-00009.safetensors",
439
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00009-of-00009.safetensors",
440
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
441
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
442
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
443
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
444
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
445
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
446
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
447
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
448
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm1.bias": "model-00009-of-00009.safetensors",
449
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm1.weight": "model-00009-of-00009.safetensors",
450
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm2.bias": "model-00009-of-00009.safetensors",
451
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm2.weight": "model-00009-of-00009.safetensors",
452
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00009-of-00009.safetensors",
453
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00009-of-00009.safetensors",
454
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00009-of-00009.safetensors",
455
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00009-of-00009.safetensors",
456
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
457
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
458
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
459
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
460
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
461
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
462
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
463
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
464
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm1.bias": "model-00009-of-00009.safetensors",
465
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm1.weight": "model-00009-of-00009.safetensors",
466
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm2.bias": "model-00009-of-00009.safetensors",
467
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm2.weight": "model-00009-of-00009.safetensors",
468
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00009-of-00009.safetensors",
469
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00009-of-00009.safetensors",
470
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00009-of-00009.safetensors",
471
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00009-of-00009.safetensors",
472
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
473
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
474
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
475
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
476
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
477
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
478
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
479
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
480
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm1.bias": "model-00009-of-00009.safetensors",
481
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm1.weight": "model-00009-of-00009.safetensors",
482
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm2.bias": "model-00009-of-00009.safetensors",
483
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.layer_norm2.weight": "model-00009-of-00009.safetensors",
484
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc1.bias": "model-00009-of-00009.safetensors",
485
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc1.weight": "model-00009-of-00009.safetensors",
486
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc2.bias": "model-00009-of-00009.safetensors",
487
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.mlp.fc2.weight": "model-00009-of-00009.safetensors",
488
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
489
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
490
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
491
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
492
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
493
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
494
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
495
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
496
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm1.bias": "model-00009-of-00009.safetensors",
497
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm1.weight": "model-00009-of-00009.safetensors",
498
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm2.bias": "model-00009-of-00009.safetensors",
499
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.layer_norm2.weight": "model-00009-of-00009.safetensors",
500
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc1.bias": "model-00009-of-00009.safetensors",
501
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc1.weight": "model-00009-of-00009.safetensors",
502
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc2.bias": "model-00009-of-00009.safetensors",
503
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.mlp.fc2.weight": "model-00009-of-00009.safetensors",
504
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.bias": "model-00009-of-00009.safetensors",
505
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.weight": "model-00009-of-00009.safetensors",
506
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.bias": "model-00009-of-00009.safetensors",
507
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight": "model-00009-of-00009.safetensors",
508
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.bias": "model-00009-of-00009.safetensors",
509
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.weight": "model-00009-of-00009.safetensors",
510
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.bias": "model-00009-of-00009.safetensors",
511
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.weight": "model-00009-of-00009.safetensors",
512
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm1.bias": "model-00008-of-00009.safetensors",
513
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm1.weight": "model-00008-of-00009.safetensors",
514
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm2.bias": "model-00008-of-00009.safetensors",
515
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm2.weight": "model-00008-of-00009.safetensors",
516
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00008-of-00009.safetensors",
517
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00008-of-00009.safetensors",
518
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00008-of-00009.safetensors",
519
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00008-of-00009.safetensors",
520
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
521
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
522
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
523
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
524
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
525
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
526
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
527
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
528
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm1.bias": "model-00008-of-00009.safetensors",
529
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm1.weight": "model-00008-of-00009.safetensors",
530
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm2.bias": "model-00008-of-00009.safetensors",
531
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm2.weight": "model-00008-of-00009.safetensors",
532
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00008-of-00009.safetensors",
533
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00008-of-00009.safetensors",
534
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00008-of-00009.safetensors",
535
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00008-of-00009.safetensors",
536
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
537
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
538
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
539
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
540
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
541
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
542
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
543
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
544
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm1.bias": "model-00008-of-00009.safetensors",
545
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm1.weight": "model-00008-of-00009.safetensors",
546
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm2.bias": "model-00008-of-00009.safetensors",
547
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm2.weight": "model-00008-of-00009.safetensors",
548
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00008-of-00009.safetensors",
549
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00008-of-00009.safetensors",
550
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00008-of-00009.safetensors",
551
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00008-of-00009.safetensors",
552
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
553
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
554
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
555
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
556
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
557
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
558
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
559
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
560
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm1.bias": "model-00008-of-00009.safetensors",
561
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm1.weight": "model-00008-of-00009.safetensors",
562
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm2.bias": "model-00008-of-00009.safetensors",
563
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm2.weight": "model-00008-of-00009.safetensors",
564
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00008-of-00009.safetensors",
565
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00008-of-00009.safetensors",
566
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00008-of-00009.safetensors",
567
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00008-of-00009.safetensors",
568
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
569
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
570
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
571
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
572
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
573
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
574
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
575
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
576
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm1.bias": "model-00008-of-00009.safetensors",
577
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm1.weight": "model-00008-of-00009.safetensors",
578
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm2.bias": "model-00008-of-00009.safetensors",
579
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm2.weight": "model-00008-of-00009.safetensors",
580
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00008-of-00009.safetensors",
581
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00008-of-00009.safetensors",
582
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00008-of-00009.safetensors",
583
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00008-of-00009.safetensors",
584
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
585
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
586
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
587
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
588
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
589
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
590
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
591
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
592
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm1.bias": "model-00008-of-00009.safetensors",
593
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm1.weight": "model-00008-of-00009.safetensors",
594
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm2.bias": "model-00008-of-00009.safetensors",
595
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm2.weight": "model-00008-of-00009.safetensors",
596
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00008-of-00009.safetensors",
597
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00008-of-00009.safetensors",
598
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00008-of-00009.safetensors",
599
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00008-of-00009.safetensors",
600
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
601
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
602
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
603
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
604
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
605
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
606
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
607
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
608
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm1.bias": "model-00008-of-00009.safetensors",
609
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm1.weight": "model-00008-of-00009.safetensors",
610
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm2.bias": "model-00008-of-00009.safetensors",
611
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm2.weight": "model-00008-of-00009.safetensors",
612
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00008-of-00009.safetensors",
613
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00008-of-00009.safetensors",
614
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00008-of-00009.safetensors",
615
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00008-of-00009.safetensors",
616
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00008-of-00009.safetensors",
617
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00008-of-00009.safetensors",
618
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00008-of-00009.safetensors",
619
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00008-of-00009.safetensors",
620
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00008-of-00009.safetensors",
621
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00008-of-00009.safetensors",
622
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00008-of-00009.safetensors",
623
+ "model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00008-of-00009.safetensors",
624
+ "model.vision_tower.vision_tower.vision_model.post_layernorm.bias": "model-00009-of-00009.safetensors",
625
+ "model.vision_tower.vision_tower.vision_model.post_layernorm.weight": "model-00009-of-00009.safetensors"
626
+ }
627
+ }
modeling_imp.py ADDED
@@ -0,0 +1,1521 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) MILVLG team.
2
+ # Licensed under the Apache 2.0 license.
3
+ #
4
+ # Some code here is copied from the project Phi-2 (https://huggingface.co/microsoft/phi-2),
5
+ # SigLIP@transformers==4.37.0.dev0 (https://huggingface.co/google/siglip-so400m-patch14-384),
6
+ # and Llava (https://github.com/haotian-liu/LLaVA), and modified by
7
+ # Zhenwei Shao (shaozw@hdu.edu.cn) @ MILVLG. We thank them for their great works.
8
+ # And their original licenses and copyright should be inherited (see the statements
9
+ # in `configuration_imp.py` for more details).
10
+
11
+
12
+ # Be careful: The way how `past_key_values.seqlen_offset` is updated is modified from
13
+ # the implementation of original Phi-2. See the comments below for details.
14
+
15
+ from __future__ import annotations
16
+ import os
17
+ import math
18
+ import re
19
+ from dataclasses import dataclass, field
20
+ from typing import Any, Dict, Optional, Tuple, Union, List
21
+ from abc import ABC, abstractmethod
22
+
23
+ import torch
24
+ import torch.nn as nn
25
+ from einops import rearrange, repeat
26
+ from transformers.cache_utils import Cache, DynamicCache
27
+ from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
28
+ from transformers import (
29
+ PretrainedConfig,
30
+ PreTrainedModel,
31
+ AutoConfig,
32
+ AutoModelForCausalLM
33
+ )
34
+ from transformers.modeling_utils import PreTrainedModel
35
+ from transformers.activations import ACT2FN
36
+ from transformers.modeling_outputs import (
37
+ BaseModelOutputWithPast,
38
+ CausalLMOutputWithPast,
39
+ SequenceClassifierOutputWithPast,
40
+ TokenClassifierOutput,
41
+ )
42
+ import sys
43
+ from .configuration_imp import Phi3Config, ImpPhi3Config
44
+ from .vision_encoder import VisionTower
45
+ # from .vision_encoder import CLIPVisionTower
46
+
47
+ try:
48
+ from flash_attn.bert_padding import pad_input, unpad_input
49
+ from flash_attn.layers.rotary import RotaryEmbedding as FlashRotaryEmbedding
50
+ from flash_attn.modules.mha import FlashCrossAttention, FlashSelfAttention
51
+ from flash_attn.ops.fused_dense import FusedDense
52
+ except:
53
+ pad_input, unpad_input = None, None
54
+ FlashRotaryEmbedding = None
55
+ FlashSelfAttention, FlashCrossAttention = None, None
56
+ FusedDense = None
57
+
58
+
59
+ @dataclass
60
+ class InferenceParams:
61
+ """Inference parameters passed to model to efficiently calculate
62
+ and store context during inference.
63
+
64
+ Reference:
65
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/utils/generation.py.
66
+
67
+ Args:
68
+ max_seqlen: Maximum sequence length.
69
+ max_batch_size: Maximum batch size.
70
+ seqlen_offset: Sequence length offset.
71
+ batch_size_offset: Batch size offset.
72
+ key_value_memory_dict: Key value memory dictionary.
73
+ lengths_per_sample: Lengths per sample.
74
+
75
+ """
76
+
77
+ max_seqlen: int = field(metadata={"help": "Maximum sequence length."})
78
+
79
+ max_batch_size: int = field(metadata={"help": "Maximum batch size."})
80
+
81
+ seqlen_offset: int = field(default=0, metadata={"help": "Sequence length offset."})
82
+
83
+ batch_size_offset: int = field(default=0, metadata={"help": "Batch size offset."})
84
+
85
+ key_value_memory_dict: Dict[str, Any] = field(
86
+ default_factory=dict, metadata={"help": "Key value memory dictionary."}
87
+ )
88
+
89
+ lengths_per_sample: torch.Tensor = field(default=None, metadata={"help": "Lengths per sample."})
90
+
91
+
92
+
93
+
94
+ # Copied from transformers.models.gemma.modeling_gemma.GemmaRotaryEmbedding with gemma->phi3, Gemma->Phi3
95
+ class Phi3RotaryEmbedding(nn.Module):
96
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
97
+ super().__init__()
98
+
99
+ self.dim = dim
100
+ self.max_position_embeddings = max_position_embeddings
101
+ self.base = base
102
+ self.register_buffer("inv_freq", None, persistent=False)
103
+
104
+ @torch.no_grad()
105
+ def forward(self, x, position_ids, seq_len=None):
106
+ # x: [bs, num_attention_heads, seq_len, head_size]
107
+ if self.inv_freq is None:
108
+ self.inv_freq = 1.0 / (
109
+ self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
110
+ )
111
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
112
+ position_ids_expanded = position_ids[:, None, :].float()
113
+ # Force float32 since bfloat16 loses precision on long contexts
114
+ # See https://github.com/huggingface/transformers/pull/29285
115
+ device_type = x.device.type
116
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
117
+ with torch.autocast(device_type=device_type, enabled=False):
118
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
119
+ emb = torch.cat((freqs, freqs), dim=-1)
120
+ cos = emb.cos()
121
+ sin = emb.sin()
122
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
123
+
124
+
125
+ class Phi3SuScaledRotaryEmbedding(Phi3RotaryEmbedding):
126
+ def __init__(self, dim, config, device=None):
127
+ super().__init__(dim, config.max_position_embeddings, config.rope_theta, device)
128
+
129
+ self.short_factor = config.rope_scaling["short_factor"]
130
+ self.long_factor = config.rope_scaling["long_factor"]
131
+ self.original_max_position_embeddings = config.original_max_position_embeddings
132
+
133
+ @torch.no_grad()
134
+ def forward(self, x, position_ids, seq_len=None):
135
+ seq_len = torch.max(position_ids) + 1
136
+ if seq_len > self.original_max_position_embeddings:
137
+ ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
138
+ else:
139
+ ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
140
+
141
+ inv_freq_shape = torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim
142
+ self.inv_freq = 1.0 / (ext_factors * self.base**inv_freq_shape)
143
+
144
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
145
+ position_ids_expanded = position_ids[:, None, :].float()
146
+
147
+ # Force float32 since bfloat16 loses precision on long contexts
148
+ # See https://github.com/huggingface/transformers/pull/29285
149
+ device_type = x.device.type
150
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
151
+ with torch.autocast(device_type=device_type, enabled=False):
152
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
153
+ emb = torch.cat((freqs, freqs), dim=-1)
154
+
155
+ scale = self.max_position_embeddings / self.original_max_position_embeddings
156
+ if scale <= 1.0:
157
+ scaling_factor = 1.0
158
+ else:
159
+ scaling_factor = math.sqrt(1 + math.log(scale) / math.log(self.original_max_position_embeddings))
160
+
161
+ cos = emb.cos() * scaling_factor
162
+ sin = emb.sin() * scaling_factor
163
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
164
+
165
+
166
+ class Phi3YarnScaledRotaryEmbedding(Phi3RotaryEmbedding):
167
+ def __init__(self, dim, config, device=None):
168
+ super().__init__(dim, config.max_position_embeddings, config.rope_theta, device)
169
+
170
+ self.short_factor = config.rope_scaling["short_factor"]
171
+ self.long_factor = config.rope_scaling["long_factor"]
172
+ self.original_max_position_embeddings = config.original_max_position_embeddings
173
+
174
+ @torch.no_grad()
175
+ def forward(self, x, position_ids, seq_len=None):
176
+ seq_len = torch.max(position_ids) + 1
177
+ if seq_len > self.original_max_position_embeddings:
178
+ ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
179
+ else:
180
+ ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
181
+
182
+ inv_freq_shape = torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim
183
+ self.inv_freq = 1.0 / (ext_factors * self.base**inv_freq_shape)
184
+
185
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
186
+ position_ids_expanded = position_ids[:, None, :].float()
187
+
188
+ # Force float32 since bfloat16 loses precision on long contexts
189
+ # See https://github.com/huggingface/transformers/pull/29285
190
+ device_type = x.device.type
191
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
192
+ with torch.autocast(device_type=device_type, enabled=False):
193
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
194
+ emb = torch.cat((freqs, freqs), dim=-1)
195
+
196
+ scale = self.max_position_embeddings / self.original_max_position_embeddings
197
+ if scale <= 1.0:
198
+ scaling_factor = 1.0
199
+ else:
200
+ scaling_factor = 0.1 * math.log(scale) + 1.0
201
+
202
+ cos = emb.cos() * scaling_factor
203
+ sin = emb.sin() * scaling_factor
204
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
205
+
206
+
207
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
208
+ def rotate_half(x):
209
+ """Rotates half the hidden dims of the input."""
210
+ x1 = x[..., : x.shape[-1] // 2]
211
+ x2 = x[..., x.shape[-1] // 2 :]
212
+ return torch.cat((-x2, x1), dim=-1)
213
+
214
+
215
+ # Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
216
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
217
+ """Applies Rotary Position Embedding to the query and key tensors.
218
+
219
+ Args:
220
+ q (`torch.Tensor`): The query tensor.
221
+ k (`torch.Tensor`): The key tensor.
222
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
223
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
224
+ position_ids (`torch.Tensor`, *optional*):
225
+ Deprecated and unused.
226
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
227
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
228
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
229
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
230
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
231
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
232
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
233
+ Returns:
234
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
235
+ """
236
+ cos = cos.unsqueeze(unsqueeze_dim)
237
+ sin = sin.unsqueeze(unsqueeze_dim)
238
+ q_embed = (q * cos) + (rotate_half(q) * sin)
239
+ k_embed = (k * cos) + (rotate_half(k) * sin)
240
+ return q_embed, k_embed
241
+
242
+
243
+
244
+ class Phi3MLP(nn.Module):
245
+ def __init__(self, config):
246
+ super().__init__()
247
+
248
+ self.config = config
249
+ self.gate_up_proj = nn.Linear(config.hidden_size, 2 * config.intermediate_size, bias=False)
250
+ self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
251
+
252
+ self.activation_fn = ACT2FN[config.hidden_act]
253
+
254
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
255
+ up_states = self.gate_up_proj(hidden_states)
256
+
257
+ gate, up_states = up_states.chunk(2, dim=-1)
258
+ up_states = up_states * self.activation_fn(gate)
259
+
260
+ return self.down_proj(up_states)
261
+
262
+ class Phi3RMSNorm(nn.Module):
263
+ def __init__(self, hidden_size, eps=1e-6):
264
+ """
265
+ Phi3RMSNorm is equivalent to T5LayerNorm
266
+ """
267
+ super().__init__()
268
+ self.weight = nn.Parameter(torch.ones(hidden_size))
269
+ self.variance_epsilon = eps
270
+
271
+ def forward(self, hidden_states):
272
+ input_dtype = hidden_states.dtype
273
+ hidden_states = hidden_states.to(torch.float32)
274
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
275
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
276
+ return self.weight * hidden_states.to(input_dtype)
277
+
278
+
279
+
280
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv with llama->phi
281
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
282
+ """
283
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
284
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
285
+ """
286
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
287
+ if n_rep == 1:
288
+ return hidden_states
289
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
290
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
291
+
292
+
293
+
294
+ class Phi3Attention(nn.Module):
295
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
296
+
297
+ def __init__(self, config: Phi3Config, layer_idx: Optional[int] = None):
298
+ super().__init__()
299
+ self.config = config
300
+ self.layer_idx = layer_idx
301
+ if layer_idx is None:
302
+ # logger.warning_once(
303
+ # f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
304
+ # "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
305
+ # "when creating this class."
306
+ # )
307
+ pass
308
+
309
+ self.attention_dropout = config.attention_dropout
310
+ self.hidden_size = config.hidden_size
311
+ self.num_heads = config.num_attention_heads
312
+ self.head_dim = self.hidden_size // self.num_heads
313
+ self.num_key_value_heads = config.num_key_value_heads
314
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
315
+ self.max_position_embeddings = config.max_position_embeddings
316
+ self.original_max_position_embeddings = config.original_max_position_embeddings
317
+ self.rope_theta = config.rope_theta
318
+ self.rope_scaling = config.rope_scaling
319
+ self.is_causal = True
320
+
321
+ if (self.head_dim * self.num_heads) != self.hidden_size:
322
+ raise ValueError(
323
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
324
+ f" and `num_heads`: {self.num_heads})."
325
+ )
326
+
327
+ op_size = self.num_heads * self.head_dim + 2 * (self.num_key_value_heads * self.head_dim)
328
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
329
+ self.qkv_proj = nn.Linear(self.hidden_size, op_size, bias=False)
330
+ self._init_rope()
331
+
332
+ def _init_rope(self):
333
+ if self.rope_scaling is None:
334
+ self.rotary_emb = Phi3RotaryEmbedding(
335
+ self.head_dim,
336
+ max_position_embeddings=self.max_position_embeddings,
337
+ base=self.rope_theta,
338
+ )
339
+ else:
340
+ scaling_type = self.config.rope_scaling["type"]
341
+ if scaling_type == "su":
342
+ self.rotary_emb = Phi3SuScaledRotaryEmbedding(self.head_dim, self.config)
343
+ elif scaling_type == "yarn":
344
+ self.rotary_emb = Phi3YarnScaledRotaryEmbedding(self.head_dim, self.config)
345
+ else:
346
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
347
+
348
+ def forward(
349
+ self,
350
+ hidden_states: torch.Tensor,
351
+ attention_mask: Optional[torch.Tensor] = None,
352
+ position_ids: Optional[torch.LongTensor] = None,
353
+ past_key_value: Optional[Cache] = None,
354
+ output_attentions: bool = False,
355
+ use_cache: bool = False,
356
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
357
+ # logger.warning_once("You are not running the flash-attention implementation, expect numerical differences.")
358
+
359
+ bsz, q_len, _ = hidden_states.size()
360
+
361
+ qkv = self.qkv_proj(hidden_states)
362
+ query_pos = self.num_heads * self.head_dim
363
+ query_states = qkv[..., :query_pos]
364
+ key_states = qkv[..., query_pos : query_pos + self.num_key_value_heads * self.head_dim]
365
+ value_states = qkv[..., query_pos + self.num_key_value_heads * self.head_dim :]
366
+
367
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
368
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
369
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
370
+
371
+ kv_seq_len = key_states.shape[-2]
372
+ if past_key_value is not None:
373
+ if self.layer_idx is None:
374
+ raise ValueError(
375
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
376
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
377
+ "with a layer index."
378
+ )
379
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
380
+ cos, sin = self.rotary_emb(value_states, position_ids, seq_len=kv_seq_len)
381
+
382
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
383
+
384
+ if past_key_value is not None:
385
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
386
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
387
+
388
+ # repeat k/v heads if n_kv_heads < n_heads
389
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
390
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
391
+
392
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
393
+
394
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
395
+ raise ValueError(
396
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
397
+ f" {attn_weights.size()}"
398
+ )
399
+
400
+ if attention_mask is not None:
401
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
402
+ raise ValueError(
403
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
404
+ )
405
+ attn_weights = attn_weights + attention_mask
406
+
407
+ # upcast attention to fp32
408
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(value_states.dtype)
409
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
410
+
411
+ attn_output = torch.matmul(attn_weights, value_states)
412
+
413
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
414
+ raise ValueError(
415
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
416
+ f" {attn_output.size()}"
417
+ )
418
+
419
+ attn_output = attn_output.transpose(1, 2).contiguous()
420
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
421
+
422
+ attn_output = self.o_proj(attn_output)
423
+
424
+ if not output_attentions:
425
+ attn_weights = None
426
+
427
+ return attn_output, attn_weights, past_key_value
428
+
429
+
430
+ class Phi3FlashAttention2(Phi3Attention):
431
+ """
432
+ Phi-3 flash attention module. This module inherits from `Phi3Attention` as the weights of the module stays
433
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
434
+ flash attention and deal with padding tokens in case the input contains any of them.
435
+ """
436
+
437
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
438
+ def __init__(self, *args, **kwargs):
439
+ super().__init__(*args, **kwargs)
440
+
441
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
442
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
443
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
444
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
445
+
446
+ def forward(
447
+ self,
448
+ hidden_states: torch.Tensor,
449
+ attention_mask: Optional[torch.LongTensor] = None,
450
+ position_ids: Optional[torch.LongTensor] = None,
451
+ past_key_value: Optional[Cache] = None,
452
+ output_attentions: bool = False,
453
+ use_cache: bool = False,
454
+ **kwargs,
455
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
456
+ # Phi3FlashAttention2 attention does not support output_attentions
457
+
458
+ if not _flash_supports_window_size:
459
+ # logger.warning_once(
460
+ # "The current flash attention version does not support sliding window attention. Please use `attn_implementation='eager'` or upgrade flash-attn library."
461
+ # )
462
+ raise ValueError("The current flash attention version does not support sliding window attention.")
463
+
464
+ output_attentions = False
465
+
466
+ if "padding_mask" in kwargs:
467
+ warnings.warn(
468
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
469
+ )
470
+
471
+ # overwrite attention_mask with padding_mask
472
+ attention_mask = kwargs.pop("padding_mask")
473
+
474
+ bsz, q_len, _ = hidden_states.size()
475
+
476
+ qkv = self.qkv_proj(hidden_states)
477
+ query_pos = self.num_heads * self.head_dim
478
+ query_states = qkv[..., :query_pos]
479
+ key_states = qkv[..., query_pos : query_pos + self.num_key_value_heads * self.head_dim]
480
+ value_states = qkv[..., query_pos + self.num_key_value_heads * self.head_dim :]
481
+
482
+ # Flash attention requires the input to have the shape
483
+ # batch_size x seq_length x head_dim x hidden_dim
484
+ # therefore we just need to keep the original shape
485
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
486
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
487
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
488
+
489
+ kv_seq_len = key_states.shape[-2]
490
+ if past_key_value is not None:
491
+ if self.layer_idx is None:
492
+ raise ValueError(
493
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
494
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
495
+ "with a layer index."
496
+ )
497
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
498
+
499
+ # Because the input can be padded, the absolute sequence length depends on the max position id.
500
+ rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
501
+ cos, sin = self.rotary_emb(value_states, position_ids, seq_len=rotary_seq_len)
502
+
503
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
504
+
505
+ use_sliding_windows = (
506
+ _flash_supports_window_size
507
+ and getattr(self.config, "sliding_window", None) is not None
508
+ and kv_seq_len > self.config.sliding_window
509
+ )
510
+
511
+ if past_key_value is not None:
512
+ # Activate slicing cache only if the config has a value `sliding_windows` attribute
513
+ cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
514
+ if (
515
+ getattr(self.config, "sliding_window", None) is not None
516
+ and kv_seq_len > self.config.sliding_window
517
+ and cache_has_contents
518
+ ):
519
+ slicing_tokens = 1 - self.config.sliding_window
520
+
521
+ past_key = past_key_value[self.layer_idx][0]
522
+ past_value = past_key_value[self.layer_idx][1]
523
+
524
+ past_key = past_key[:, :, slicing_tokens:, :].contiguous()
525
+ past_value = past_value[:, :, slicing_tokens:, :].contiguous()
526
+
527
+ if past_key.shape[-2] != self.config.sliding_window - 1:
528
+ raise ValueError(
529
+ f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
530
+ f" {past_key.shape}"
531
+ )
532
+
533
+ if attention_mask is not None:
534
+ attention_mask = attention_mask[:, slicing_tokens:]
535
+ attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
536
+
537
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
538
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
539
+
540
+ # repeat k/v heads if n_kv_heads < n_heads
541
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
542
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
543
+
544
+ attn_dropout = self.attention_dropout if self.training else 0.0
545
+
546
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
547
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
548
+ # cast them back in the correct dtype just to be sure everything works as expected.
549
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
550
+ # in fp32.
551
+
552
+ if query_states.dtype == torch.float32:
553
+ if torch.is_autocast_enabled():
554
+ target_dtype = torch.get_autocast_gpu_dtype()
555
+ # Handle the case where the model is quantized
556
+ elif hasattr(self.config, "_pre_quantization_dtype"):
557
+ target_dtype = self.config._pre_quantization_dtype
558
+ else:
559
+ target_dtype = self.qkv_proj.weight.dtype
560
+
561
+ # logger.warning_once(
562
+ # f"The input hidden states seems to be silently casted in float32, this might be related to"
563
+ # f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
564
+ # f" {target_dtype}."
565
+ # )
566
+
567
+ query_states = query_states.to(target_dtype)
568
+ key_states = key_states.to(target_dtype)
569
+ value_states = value_states.to(target_dtype)
570
+
571
+ # Reashape to the expected shape for Flash Attention
572
+ query_states = query_states.transpose(1, 2)
573
+ key_states = key_states.transpose(1, 2)
574
+ value_states = value_states.transpose(1, 2)
575
+
576
+ attn_output = self._flash_attention_forward(
577
+ query_states,
578
+ key_states,
579
+ value_states,
580
+ attention_mask,
581
+ q_len,
582
+ dropout=attn_dropout,
583
+ use_sliding_windows=use_sliding_windows,
584
+ )
585
+
586
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
587
+ attn_output = self.o_proj(attn_output)
588
+
589
+ if not output_attentions:
590
+ attn_weights = None
591
+
592
+ return attn_output, attn_weights, past_key_value
593
+
594
+ # Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2._flash_attention_forward
595
+ def _flash_attention_forward(
596
+ self,
597
+ query_states,
598
+ key_states,
599
+ value_states,
600
+ attention_mask,
601
+ query_length,
602
+ dropout=0.0,
603
+ softmax_scale=None,
604
+ use_sliding_windows=False,
605
+ ):
606
+ """
607
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
608
+ first unpad the input, then computes the attention scores and pad the final attention scores.
609
+
610
+ Args:
611
+ query_states (`torch.Tensor`):
612
+ Input query states to be passed to Flash Attention API
613
+ key_states (`torch.Tensor`):
614
+ Input key states to be passed to Flash Attention API
615
+ value_states (`torch.Tensor`):
616
+ Input value states to be passed to Flash Attention API
617
+ attention_mask (`torch.Tensor`):
618
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
619
+ position of padding tokens and 1 for the position of non-padding tokens.
620
+ dropout (`float`):
621
+ Attention dropout
622
+ softmax_scale (`float`, *optional*):
623
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
624
+ use_sliding_windows (`bool`, *optional*):
625
+ Whether to activate sliding window attention.
626
+ """
627
+ if not self._flash_attn_uses_top_left_mask:
628
+ causal = self.is_causal
629
+ else:
630
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
631
+ causal = self.is_causal and query_length != 1
632
+
633
+ # Contains at least one padding token in the sequence
634
+ if attention_mask is not None:
635
+ batch_size = query_states.shape[0]
636
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
637
+ query_states, key_states, value_states, attention_mask, query_length
638
+ )
639
+
640
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
641
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
642
+
643
+ if not use_sliding_windows:
644
+ attn_output_unpad = flash_attn_varlen_func(
645
+ query_states,
646
+ key_states,
647
+ value_states,
648
+ cu_seqlens_q=cu_seqlens_q,
649
+ cu_seqlens_k=cu_seqlens_k,
650
+ max_seqlen_q=max_seqlen_in_batch_q,
651
+ max_seqlen_k=max_seqlen_in_batch_k,
652
+ dropout_p=dropout,
653
+ softmax_scale=softmax_scale,
654
+ causal=causal,
655
+ )
656
+ else:
657
+ attn_output_unpad = flash_attn_varlen_func(
658
+ query_states,
659
+ key_states,
660
+ value_states,
661
+ cu_seqlens_q=cu_seqlens_q,
662
+ cu_seqlens_k=cu_seqlens_k,
663
+ max_seqlen_q=max_seqlen_in_batch_q,
664
+ max_seqlen_k=max_seqlen_in_batch_k,
665
+ dropout_p=dropout,
666
+ softmax_scale=softmax_scale,
667
+ causal=causal,
668
+ window_size=(self.config.sliding_window, self.config.sliding_window),
669
+ )
670
+
671
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
672
+ else:
673
+ if not use_sliding_windows:
674
+ attn_output = flash_attn_func(
675
+ query_states,
676
+ key_states,
677
+ value_states,
678
+ dropout,
679
+ softmax_scale=softmax_scale,
680
+ causal=causal,
681
+ )
682
+ else:
683
+ attn_output = flash_attn_func(
684
+ query_states,
685
+ key_states,
686
+ value_states,
687
+ dropout,
688
+ softmax_scale=softmax_scale,
689
+ causal=causal,
690
+ window_size=(self.config.sliding_window, self.config.sliding_window),
691
+ )
692
+
693
+ return attn_output
694
+
695
+ # Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2._upad_input
696
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
697
+ batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
698
+
699
+ # On the first iteration we need to properly re-create the padding mask
700
+ # by slicing it on the proper place
701
+ if kv_seq_len != attention_mask.shape[-1]:
702
+ attention_mask_num_tokens = attention_mask.shape[-1]
703
+ attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :]
704
+
705
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
706
+
707
+ key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
708
+ value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
709
+
710
+ if query_length == kv_seq_len:
711
+ query_layer = index_first_axis(
712
+ query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
713
+ )
714
+ cu_seqlens_q = cu_seqlens_k
715
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
716
+ indices_q = indices_k
717
+ elif query_length == 1:
718
+ max_seqlen_in_batch_q = 1
719
+ cu_seqlens_q = torch.arange(
720
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
721
+ ) # There is a memcpy here, that is very bad.
722
+ indices_q = cu_seqlens_q[:-1]
723
+ query_layer = query_layer.squeeze(1)
724
+ else:
725
+ # The -q_len: slice assumes left padding.
726
+ attention_mask = attention_mask[:, -query_length:]
727
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
728
+
729
+ return (
730
+ query_layer,
731
+ key_layer,
732
+ value_layer,
733
+ indices_q,
734
+ (cu_seqlens_q, cu_seqlens_k),
735
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
736
+ )
737
+
738
+
739
+ # copied from transformers.models.llama.modeling_llama.LlamaSdpaAttention with Llama->Phi3
740
+ # TODO @Arthur no longer copied from LLama after static cache
741
+ class Phi3SdpaAttention(Phi3Attention):
742
+ """
743
+ Phi3 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
744
+ `Phi3Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
745
+ SDPA API.
746
+ """
747
+
748
+ # Adapted from Phi3Attention.forward
749
+ def forward(
750
+ self,
751
+ hidden_states: torch.Tensor,
752
+ attention_mask: Optional[torch.Tensor] = None,
753
+ position_ids: Optional[torch.LongTensor] = None,
754
+ past_key_value: Optional[Cache] = None,
755
+ output_attentions: bool = False,
756
+ use_cache: bool = False,
757
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
758
+ if output_attentions:
759
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
760
+ # logger.warning_once(
761
+ # "Phi3Model is using Phi3SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
762
+ # 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
763
+ # )
764
+ return super().forward(
765
+ hidden_states=hidden_states,
766
+ attention_mask=attention_mask,
767
+ position_ids=position_ids,
768
+ past_key_value=past_key_value,
769
+ output_attentions=output_attentions,
770
+ use_cache=use_cache,
771
+ )
772
+
773
+ bsz, q_len, _ = hidden_states.size()
774
+
775
+ qkv = self.qkv_proj(hidden_states)
776
+ query_pos = self.num_heads * self.head_dim
777
+ query_states = qkv[..., :query_pos]
778
+ key_states = qkv[..., query_pos : query_pos + self.num_key_value_heads * self.head_dim]
779
+ value_states = qkv[..., query_pos + self.num_key_value_heads * self.head_dim :]
780
+
781
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
782
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
783
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
784
+
785
+ kv_seq_len = key_states.shape[-2]
786
+ if past_key_value is not None:
787
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
788
+ cos, sin = self.rotary_emb(value_states, position_ids, seq_len=kv_seq_len)
789
+
790
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
791
+
792
+ if past_key_value is not None:
793
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
794
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
795
+
796
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
797
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
798
+
799
+ if attention_mask is not None:
800
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
801
+ raise ValueError(
802
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
803
+ )
804
+
805
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
806
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
807
+ if query_states.device.type == "cuda" and attention_mask is not None:
808
+ query_states = query_states.contiguous()
809
+ key_states = key_states.contiguous()
810
+ value_states = value_states.contiguous()
811
+
812
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
813
+ query_states,
814
+ key_states,
815
+ value_states,
816
+ attn_mask=attention_mask,
817
+ dropout_p=self.attention_dropout if self.training else 0.0,
818
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
819
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
820
+ )
821
+
822
+ attn_output = attn_output.transpose(1, 2).contiguous()
823
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
824
+
825
+ attn_output = self.o_proj(attn_output)
826
+
827
+ return attn_output, None, past_key_value
828
+
829
+
830
+
831
+
832
+ PHI3_ATTENTION_CLASSES = {
833
+ "eager": Phi3Attention,
834
+ "flash_attention_2": Phi3FlashAttention2,
835
+ "sdpa": Phi3SdpaAttention,
836
+ }
837
+
838
+ class Phi3DecoderLayer(nn.Module):
839
+ def __init__(self, config: Phi3Config, layer_idx: int):
840
+ super().__init__()
841
+
842
+ self.config = config
843
+ self.self_attn = PHI3_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx=layer_idx)
844
+
845
+ self.mlp = Phi3MLP(config)
846
+ self.input_layernorm = Phi3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
847
+
848
+ self.resid_attn_dropout = nn.Dropout(config.resid_pdrop)
849
+ self.resid_mlp_dropout = nn.Dropout(config.resid_pdrop)
850
+ self.post_attention_layernorm = Phi3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
851
+
852
+ def forward(
853
+ self,
854
+ hidden_states: torch.Tensor,
855
+ attention_mask: Optional[torch.Tensor] = None,
856
+ position_ids: Optional[torch.LongTensor] = None,
857
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
858
+ output_attentions: Optional[bool] = False,
859
+ use_cache: Optional[bool] = False,
860
+ **kwargs,
861
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
862
+ if "padding_mask" in kwargs:
863
+ warnings.warn(
864
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
865
+ )
866
+ """
867
+ Args:
868
+ hidden_states (`torch.FloatTensor`):
869
+ input to the layer of shape `(batch, seq_len, embed_dim)`
870
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
871
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
872
+ position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
873
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
874
+ `[0, config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids)
875
+ output_attentions (`bool`, *optional*):
876
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
877
+ returned tensors for more detail.
878
+ use_cache (`bool`, *optional*):
879
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
880
+ (see `past_key_values`).
881
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
882
+ """
883
+
884
+ residual = hidden_states
885
+
886
+ hidden_states = self.input_layernorm(hidden_states)
887
+
888
+ # Self Attention
889
+ attn_outputs, self_attn_weights, present_key_value = self.self_attn(
890
+ hidden_states=hidden_states,
891
+ attention_mask=attention_mask,
892
+ position_ids=position_ids,
893
+ past_key_value=past_key_value,
894
+ output_attentions=output_attentions,
895
+ use_cache=use_cache,
896
+ )
897
+
898
+ hidden_states = residual + self.resid_attn_dropout(attn_outputs)
899
+
900
+ residual = hidden_states
901
+ hidden_states = self.post_attention_layernorm(hidden_states)
902
+ hidden_states = self.mlp(hidden_states)
903
+ hidden_states = residual + self.resid_mlp_dropout(hidden_states)
904
+
905
+ outputs = (hidden_states,)
906
+
907
+ if output_attentions:
908
+ outputs += (self_attn_weights,)
909
+
910
+ if use_cache:
911
+ outputs += (present_key_value,)
912
+
913
+ return outputs
914
+
915
+
916
+
917
+ class Phi3PreTrainedModel(PreTrainedModel):
918
+ config_class = Phi3Config
919
+ base_model_prefix = "model"
920
+ supports_gradient_checkpointing = True
921
+ _no_split_modules = ["Phi3DecoderLayer"]
922
+ _skip_keys_device_placement = "past_key_values"
923
+ _supports_flash_attn_2 = True
924
+ _supports_sdpa = False
925
+ _supports_cache_class = True
926
+
927
+ _version = "0.0.5"
928
+
929
+ def _init_weights(self, module):
930
+ std = self.config.initializer_range
931
+ if isinstance(module, nn.Linear):
932
+ module.weight.data.normal_(mean=0.0, std=std)
933
+ if module.bias is not None:
934
+ module.bias.data.zero_()
935
+ elif isinstance(module, nn.Embedding):
936
+ module.weight.data.normal_(mean=0.0, std=std)
937
+ if module.padding_idx is not None:
938
+ module.weight.data[module.padding_idx].zero_()
939
+
940
+ def prepare_inputs_for_generation(
941
+ self,
942
+ input_ids: torch.LongTensor,
943
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
944
+ inputs_embeds: Optional[torch.FloatTensor] = None,
945
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
946
+ **kwargs,
947
+ ) -> Dict[str, Any]:
948
+ if past_key_values is not None:
949
+ if isinstance(past_key_values, Cache):
950
+ cache_length = past_key_values.get_seq_length()
951
+ past_length = past_key_values.seen_tokens
952
+ max_cache_length = past_key_values.get_max_length()
953
+ else:
954
+ cache_length = past_length = past_key_values[0][0].shape[2]
955
+ max_cache_length = None
956
+
957
+ # Keep only the unprocessed tokens:
958
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
959
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
960
+ # input)
961
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
962
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
963
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
964
+ # input_ids based on the past_length.
965
+ elif past_length < input_ids.shape[1]:
966
+ input_ids = input_ids[:, past_length:]
967
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
968
+
969
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
970
+ if (
971
+ max_cache_length is not None
972
+ and attention_mask is not None
973
+ and cache_length + input_ids.shape[1] > max_cache_length
974
+ ):
975
+ attention_mask = attention_mask[:, -max_cache_length:]
976
+
977
+ position_ids = kwargs.get("position_ids", None)
978
+ if attention_mask is not None and position_ids is None:
979
+ # create position_ids on the fly for batch generation
980
+ position_ids = attention_mask.long().cumsum(-1) - 1
981
+ position_ids.masked_fill_(attention_mask == 0, 1)
982
+ if past_key_values:
983
+ position_ids = position_ids[:, -input_ids.shape[1] :]
984
+
985
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
986
+ if inputs_embeds is not None and past_key_values is None:
987
+ model_inputs = {"inputs_embeds": inputs_embeds}
988
+ else:
989
+ model_inputs = {"input_ids": input_ids}
990
+
991
+ model_inputs.update(
992
+ {
993
+ "position_ids": position_ids,
994
+ "past_key_values": past_key_values,
995
+ "use_cache": kwargs.get("use_cache"),
996
+ "attention_mask": attention_mask,
997
+ }
998
+ )
999
+ return model_inputs
1000
+
1001
+
1002
+
1003
+
1004
+ class LlavaMetaModel(ABC):
1005
+ """
1006
+ Define the APIs for building components that are related to image perceiving.
1007
+ This implementation is based on the implementation from the Llave project.
1008
+ """
1009
+
1010
+ def get_vision_tower(self):
1011
+ vision_tower = getattr(self, 'vision_tower', None)
1012
+ if type(vision_tower) is list:
1013
+ vision_tower = vision_tower[0]
1014
+ return vision_tower
1015
+
1016
+ def build_vision_tower(self, config):
1017
+ self.vision_tower = VisionTower(config.vision_tower_cfg)
1018
+ # self.vision_tower = CLIPVisionTower(config.vision_tower_cfg)
1019
+
1020
+ def build_vision_projector(self, config):
1021
+ projector_type = getattr(config, 'mm_projector_type', 'linear')
1022
+
1023
+ if projector_type == 'linear':
1024
+ self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
1025
+ return
1026
+
1027
+ mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
1028
+ if mlp_gelu_match:
1029
+ mlp_depth = int(mlp_gelu_match.group(1))
1030
+ modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
1031
+ for _ in range(1, mlp_depth):
1032
+ modules.append(nn.GELU())
1033
+ modules.append(nn.Linear(config.hidden_size, config.hidden_size))
1034
+ self.mm_projector = nn.Sequential(*modules)
1035
+ return
1036
+
1037
+ if projector_type == 'identity':
1038
+ self.mm_projector = nn.Identity()
1039
+ return
1040
+
1041
+ raise ValueError(f'Unknown projector type: {projector_type}')
1042
+
1043
+
1044
+ class ImpPhi3Model(Phi3PreTrainedModel, LlavaMetaModel):
1045
+ """Imp model. This implementation is modified from the implementation of Phi-2"""
1046
+
1047
+ config_class = ImpPhi3Config
1048
+
1049
+ def __init__(self, config: ImpPhi3Config) -> None:
1050
+ super().__init__(config)
1051
+ self.padding_idx = config.pad_token_id
1052
+ self.vocab_size = config.vocab_size
1053
+
1054
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1055
+ self.embed_dropout = nn.Dropout(config.embd_pdrop)
1056
+ self.layers = nn.ModuleList(
1057
+ [Phi3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1058
+ )
1059
+ self._attn_implementation = config._attn_implementation
1060
+ self.norm = Phi3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1061
+
1062
+ self.gradient_checkpointing = False
1063
+
1064
+ if hasattr(config, "mm_vision_tower"):
1065
+ self.build_vision_tower(config)
1066
+ self.build_vision_projector(config)
1067
+ # Initialize weights and apply final processing
1068
+ self.post_init()
1069
+
1070
+
1071
+ def get_input_embeddings(self) -> nn.Embedding:
1072
+ return self.embed_tokens
1073
+
1074
+ def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
1075
+ self.embed_tokens = value
1076
+
1077
+ def forward(
1078
+ self,
1079
+ input_ids: torch.LongTensor = None,
1080
+ attention_mask: Optional[torch.Tensor] = None,
1081
+ position_ids: Optional[torch.LongTensor] = None,
1082
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1083
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1084
+ use_cache: Optional[bool] = None,
1085
+ output_attentions: Optional[bool] = None,
1086
+ output_hidden_states: Optional[bool] = None,
1087
+ return_dict: Optional[bool] = None,
1088
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
1089
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1090
+ output_hidden_states = (
1091
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1092
+ )
1093
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1094
+
1095
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1096
+
1097
+ # retrieve input_ids and inputs_embeds
1098
+ if input_ids is not None and inputs_embeds is not None:
1099
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
1100
+ elif input_ids is not None:
1101
+ batch_size, seq_length = input_ids.shape[:2]
1102
+ elif inputs_embeds is not None:
1103
+ batch_size, seq_length = inputs_embeds.shape[:2]
1104
+ else:
1105
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
1106
+
1107
+ past_key_values_length = 0
1108
+
1109
+ if self.gradient_checkpointing and self.training:
1110
+ if use_cache:
1111
+ # logger.warning_once(
1112
+ # "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
1113
+ # )
1114
+ use_cache = False
1115
+
1116
+ if use_cache:
1117
+ use_legacy_cache = not isinstance(past_key_values, Cache)
1118
+ if use_legacy_cache:
1119
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
1120
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
1121
+
1122
+ if position_ids is None:
1123
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1124
+ position_ids = torch.arange(
1125
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1126
+ )
1127
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
1128
+ else:
1129
+ position_ids = position_ids.view(-1, seq_length).long()
1130
+
1131
+ if inputs_embeds is None:
1132
+ inputs_embeds = self.embed_tokens(input_ids)
1133
+
1134
+ if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
1135
+ is_padding_right = attention_mask[:, -1].sum().item() != batch_size
1136
+ if is_padding_right:
1137
+ raise ValueError(
1138
+ "You are attempting to perform batched generation with padding_side='right'"
1139
+ " this may lead to unexpected behaviour for Flash Attention version of Phi3. Make sure to "
1140
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
1141
+ )
1142
+
1143
+ if self._attn_implementation == "flash_attention_2":
1144
+ # 2d mask is passed through the layers
1145
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1146
+ else:
1147
+ # 4d mask is passed through the layers
1148
+ attention_mask = _prepare_4d_causal_attention_mask(
1149
+ attention_mask,
1150
+ (batch_size, seq_length),
1151
+ inputs_embeds,
1152
+ past_key_values_length,
1153
+ sliding_window=self.config.sliding_window,
1154
+ )
1155
+
1156
+ hidden_states = inputs_embeds
1157
+
1158
+ # decoder layers
1159
+ all_hidden_states = () if output_hidden_states else None
1160
+ all_self_attns = () if output_attentions else None
1161
+ next_decoder_cache = None
1162
+
1163
+ for decoder_layer in self.layers:
1164
+ if output_hidden_states:
1165
+ all_hidden_states += (hidden_states,)
1166
+
1167
+ if self.gradient_checkpointing and self.training:
1168
+ layer_outputs = self._gradient_checkpointing_func(
1169
+ decoder_layer.__call__,
1170
+ hidden_states,
1171
+ attention_mask,
1172
+ position_ids,
1173
+ past_key_values,
1174
+ output_attentions,
1175
+ use_cache,
1176
+ )
1177
+ else:
1178
+ layer_outputs = decoder_layer(
1179
+ hidden_states,
1180
+ attention_mask=attention_mask,
1181
+ position_ids=position_ids,
1182
+ past_key_value=past_key_values,
1183
+ output_attentions=output_attentions,
1184
+ use_cache=use_cache,
1185
+ )
1186
+
1187
+ hidden_states = layer_outputs[0]
1188
+
1189
+ if use_cache:
1190
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1191
+
1192
+ if output_attentions:
1193
+ all_self_attns += (layer_outputs[1],)
1194
+
1195
+ hidden_states = self.norm(hidden_states)
1196
+
1197
+ # add hidden states from the last decoder layer
1198
+ if output_hidden_states:
1199
+ all_hidden_states += (hidden_states,)
1200
+
1201
+ next_cache = None
1202
+ if use_cache:
1203
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1204
+ if not return_dict:
1205
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1206
+ return BaseModelOutputWithPast(
1207
+ last_hidden_state=hidden_states,
1208
+ past_key_values=next_cache,
1209
+ hidden_states=all_hidden_states,
1210
+ attentions=all_self_attns,
1211
+ )
1212
+
1213
+
1214
+
1215
+ class LlavaMetaForCausalLM(ABC):
1216
+ """This implementation is based on the implementation from the Llave project."""
1217
+
1218
+ def init_constants(self, config):
1219
+ self.IGNORE_INDEX = getattr(config, 'ignore_index', -100)
1220
+ self.IMAGE_TOKEN_INDEX = getattr(config, 'image_token_index', 50296)
1221
+ self.DEFAULT_IMAGE_TOKEN = getattr(config, 'image_token', "<image>")
1222
+
1223
+ @abstractmethod
1224
+ def get_model(self):
1225
+ pass
1226
+
1227
+ def get_vision_tower(self):
1228
+ return self.get_model().get_vision_tower()
1229
+
1230
+ def encode_images(self, images):
1231
+ image_features = self.get_model().get_vision_tower()(images)
1232
+ image_features = self.get_model().mm_projector(image_features)
1233
+ return image_features
1234
+
1235
+ def prepare_inputs_labels_for_multimodal(
1236
+ self, input_ids, position_ids, attention_mask, past_key_values, labels, images
1237
+ ):
1238
+ vision_tower = self.get_vision_tower()
1239
+ # if vision_tower is None or images is None or past_key_values.seqlen_offset != 0:
1240
+ if past_key_values is not None:
1241
+ target_shape = past_key_values[0][0].shape[2] + 1
1242
+ attention_mask = torch.ones(
1243
+ (attention_mask.shape[0], target_shape),
1244
+ dtype=attention_mask.dtype,
1245
+ device=attention_mask.device
1246
+ )
1247
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
1248
+ # print('position_ids', position_ids.shape)
1249
+ # print(input_ids[:, -1:].item())
1250
+ return input_ids[:, -1:], position_ids, attention_mask, past_key_values, None, labels
1251
+
1252
+ if type(images) is list or images.ndim == 5:
1253
+ concat_images = torch.cat([image for image in images], dim=0)
1254
+ # concat_images.requires_grad_(True)
1255
+ image_features = self.encode_images(concat_images)
1256
+ split_sizes = [image.shape[0] for image in images]
1257
+ image_features = torch.split(image_features, split_sizes, dim=0)
1258
+ image_features = [x.flatten(0, 1).to(self.device) for x in image_features]
1259
+ else:
1260
+ # images.requires_grad_(True)
1261
+ image_features = self.encode_images(images).to(self.device)
1262
+
1263
+ # TODO: image start / end is not implemented here to support pretraining.
1264
+ if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
1265
+ raise NotImplementedError
1266
+
1267
+ # Let's just add dummy tensors if they do not exist,
1268
+ # it is a headache to deal with None all the time.
1269
+ # But it is not ideal, and if you have a better idea,
1270
+ # please open an issue / submit a PR, thanks.
1271
+ _labels = labels
1272
+ _position_ids = position_ids
1273
+ _attention_mask = attention_mask
1274
+ if attention_mask is None:
1275
+ attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
1276
+ else:
1277
+ attention_mask = attention_mask.bool()
1278
+ if position_ids is None:
1279
+ position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
1280
+ if labels is None:
1281
+ labels = torch.full_like(input_ids, self.IGNORE_INDEX)
1282
+
1283
+ # remove the padding using attention_mask -- TODO: double check
1284
+ input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
1285
+ labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
1286
+
1287
+ new_input_embeds = []
1288
+ new_labels = []
1289
+ cur_image_idx = 0
1290
+ for batch_idx, cur_input_ids in enumerate(input_ids):
1291
+ num_images = (cur_input_ids == self.IMAGE_TOKEN_INDEX).sum()
1292
+ if num_images == 0:
1293
+ cur_image_features = image_features[cur_image_idx]
1294
+ cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
1295
+ cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
1296
+ new_input_embeds.append(cur_input_embeds)
1297
+ new_labels.append(labels[batch_idx])
1298
+ cur_image_idx += 1
1299
+ continue
1300
+
1301
+ image_token_indices = [-1] + torch.where(cur_input_ids == self.IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
1302
+ cur_input_ids_noim = []
1303
+ cur_labels = labels[batch_idx]
1304
+ cur_labels_noim = []
1305
+ for i in range(len(image_token_indices) - 1):
1306
+ cur_input_ids_noim.append(cur_input_ids[image_token_indices[i]+1:image_token_indices[i+1]])
1307
+ cur_labels_noim.append(cur_labels[image_token_indices[i]+1:image_token_indices[i+1]])
1308
+ split_sizes = [x.shape[0] for x in cur_labels_noim]
1309
+ cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
1310
+ # print(cur_input_embeds.shape)
1311
+ cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
1312
+ cur_new_input_embeds = []
1313
+ cur_new_labels = []
1314
+
1315
+ for i in range(num_images + 1):
1316
+ cur_new_input_embeds.append(cur_input_embeds_no_im[i])
1317
+ cur_new_labels.append(cur_labels_noim[i])
1318
+ if i < num_images:
1319
+ cur_image_features = image_features[cur_image_idx]
1320
+ cur_image_idx += 1
1321
+ cur_new_input_embeds.append(cur_image_features)
1322
+ cur_new_labels.append(torch.full((cur_image_features.shape[0],), self.IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
1323
+
1324
+ cur_new_input_embeds = torch.cat(cur_new_input_embeds)
1325
+ cur_new_labels = torch.cat(cur_new_labels)
1326
+
1327
+ new_input_embeds.append(cur_new_input_embeds)
1328
+ new_labels.append(cur_new_labels)
1329
+
1330
+ # Truncate sequences to max length as image embeddings can make the sequence longer
1331
+ tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
1332
+ if tokenizer_model_max_length is not None:
1333
+ new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
1334
+ new_labels = [x[:tokenizer_model_max_length] for x in new_labels]
1335
+
1336
+ # Combine them
1337
+ max_len = max(x.shape[0] for x in new_input_embeds)
1338
+ batch_size = len(new_input_embeds)
1339
+
1340
+ new_input_embeds_padded = []
1341
+ new_labels_padded = torch.full((batch_size, max_len), self.IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
1342
+ attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
1343
+ position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
1344
+
1345
+ for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
1346
+ cur_len = cur_new_embed.shape[0]
1347
+ if getattr(self.config, 'tokenizer_padding_side', 'right') == "left":
1348
+ new_input_embeds_padded.append(torch.cat((
1349
+ torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device),
1350
+ cur_new_embed
1351
+ ), dim=0))
1352
+ if cur_len > 0:
1353
+ new_labels_padded[i, -cur_len:] = cur_new_labels
1354
+ attention_mask[i, -cur_len:] = True
1355
+ position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
1356
+ else:
1357
+ new_input_embeds_padded.append(torch.cat((
1358
+ cur_new_embed,
1359
+ torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)
1360
+ ), dim=0))
1361
+ if cur_len > 0:
1362
+ new_labels_padded[i, :cur_len] = cur_new_labels
1363
+ attention_mask[i, :cur_len] = True
1364
+ position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
1365
+
1366
+ new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
1367
+
1368
+ if new_input_embeds.shape[-2] > 2000:
1369
+ self.need_clear_cache = True
1370
+
1371
+ if _labels is None:
1372
+ new_labels = None
1373
+ else:
1374
+ new_labels = new_labels_padded
1375
+
1376
+ if _attention_mask is None:
1377
+ attention_mask = None
1378
+ else:
1379
+ attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
1380
+
1381
+ if _position_ids is None:
1382
+ position_ids = None
1383
+
1384
+ return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
1385
+
1386
+
1387
+ class ImpPhi3ForCausalLM(Phi3PreTrainedModel, LlavaMetaForCausalLM):
1388
+ """Impphi3 for Causal Language Modeling."""
1389
+
1390
+ config_class = ImpPhi3Config
1391
+
1392
+ def __init__(self, config: ImpPhi3Config) -> None:
1393
+ super().__init__(config)
1394
+
1395
+ self.model = ImpPhi3Model(config)
1396
+ self.vocab_size = config.vocab_size
1397
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1398
+ self.need_clear_cache = False
1399
+
1400
+ self.post_init()
1401
+ self.init_constants(config)
1402
+
1403
+ # def get_output_embeddings(self) -> nn.Linear:
1404
+ # return self.lm_head
1405
+
1406
+ # def set_output_embeddings(self, new_embeddings: nn.Linear) -> None:
1407
+ # self.lm_head = new_embeddings
1408
+
1409
+ def get_model(self):
1410
+ return self.model
1411
+
1412
+
1413
+ def image_preprocess(self, images):
1414
+ return self.get_vision_tower().image_processor(images)['pixel_values']
1415
+
1416
+
1417
+ def forward(
1418
+ self,
1419
+ input_ids: torch.LongTensor = None,
1420
+ attention_mask: Optional[torch.Tensor] = None,
1421
+ position_ids: Optional[torch.LongTensor] = None,
1422
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1423
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1424
+ labels: Optional[torch.LongTensor] = None,
1425
+ use_cache: Optional[bool] = None,
1426
+ output_attentions: Optional[bool] = None,
1427
+ output_hidden_states: Optional[bool] = None,
1428
+ images: Optional[torch.FloatTensor] = None,
1429
+ return_dict: Optional[bool] = None,
1430
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1431
+
1432
+ if inputs_embeds is None:
1433
+ (
1434
+ input_ids,
1435
+ position_ids,
1436
+ attention_mask,
1437
+ past_key_values,
1438
+ inputs_embeds,
1439
+ labels
1440
+ ) = self.prepare_inputs_labels_for_multimodal(
1441
+ input_ids,
1442
+ position_ids,
1443
+ attention_mask,
1444
+ past_key_values,
1445
+ labels,
1446
+ images
1447
+ )
1448
+
1449
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1450
+ output_hidden_states = (
1451
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1452
+ )
1453
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1454
+
1455
+ outputs = self.model(
1456
+ input_ids=input_ids,
1457
+ attention_mask=attention_mask,
1458
+ position_ids=position_ids,
1459
+ past_key_values=past_key_values,
1460
+ inputs_embeds=inputs_embeds,
1461
+ use_cache=use_cache,
1462
+ output_attentions=output_attentions,
1463
+ output_hidden_states=output_hidden_states,
1464
+ return_dict=return_dict,
1465
+ )
1466
+
1467
+ hidden_states = outputs[0]
1468
+ logits = self.lm_head(hidden_states)
1469
+ # logits = logits.float()
1470
+
1471
+ loss = None
1472
+ if labels is not None:
1473
+ # Shift so that tokens < n predict n
1474
+ shift_logits = logits[..., :-1, :].contiguous()
1475
+ shift_labels = labels[..., 1:].contiguous()
1476
+ # Flatten the tokens
1477
+ loss_fct = CrossEntropyLoss()
1478
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1479
+ shift_labels = shift_labels.view(-1)
1480
+ # Enable model parallelism
1481
+ shift_labels = shift_labels.to(shift_logits.device)
1482
+ loss = loss_fct(shift_logits, shift_labels)
1483
+
1484
+ if not return_dict:
1485
+ output = (logits,) + outputs[1:]
1486
+ return (loss,) + output if loss is not None else output
1487
+
1488
+ return CausalLMOutputWithPast(
1489
+ loss=loss,
1490
+ logits=logits,
1491
+ past_key_values=outputs.past_key_values,
1492
+ hidden_states=outputs.hidden_states,
1493
+ attentions=outputs.attentions,
1494
+ )
1495
+
1496
+
1497
+ # # inputs_embeds.requires_grad_(True)
1498
+ # return super().forward(
1499
+ # input_ids=input_ids,
1500
+ # attention_mask=attention_mask,
1501
+ # position_ids=position_ids,
1502
+ # past_key_values=past_key_values,
1503
+ # inputs_embeds=inputs_embeds,
1504
+ # labels=labels,
1505
+ # use_cache=use_cache,
1506
+ # output_attentions=output_attentions,
1507
+ # output_hidden_states=output_hidden_states,
1508
+ # return_dict=return_dict
1509
+ # )
1510
+
1511
+ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
1512
+ images = kwargs.pop("images", None)
1513
+ _inputs = super().prepare_inputs_for_generation(
1514
+ input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs
1515
+ )
1516
+ if images is not None:
1517
+ _inputs['images'] = images
1518
+ return _inputs
1519
+
1520
+ AutoConfig.register("imp_phi3", ImpPhi3Config)
1521
+ AutoModelForCausalLM.register(ImpPhi3Config, ImpPhi3ForCausalLM)
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "32011": {
6
+ "content": "<image>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "0": {
14
+ "content": "<unk>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "1": {
22
+ "content": "<s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "2": {
30
+ "content": "</s>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "32000": {
38
+ "content": "<|endoftext|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "32001": {
46
+ "content": "<|assistant|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "32002": {
54
+ "content": "<|placeholder1|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "32003": {
62
+ "content": "<|placeholder2|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "32004": {
70
+ "content": "<|placeholder3|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "32005": {
78
+ "content": "<|placeholder4|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "32006": {
86
+ "content": "<|system|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "32007": {
94
+ "content": "<|end|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "32008": {
102
+ "content": "<|placeholder5|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "32009": {
110
+ "content": "<|placeholder6|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "32010": {
118
+ "content": "<|user|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ }
125
+ },
126
+ "bos_token": "<s>",
127
+ "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}",
128
+ "clean_up_tokenization_spaces": false,
129
+ "eos_token": "</s>",
130
+ "legacy": false,
131
+ "model_max_length": 4096,
132
+ "pad_token": "<|endoftext|>",
133
+ "padding_side": "left",
134
+ "sp_model_kwargs": {},
135
+ "tokenizer_class": "LlamaTokenizer",
136
+ "unk_token": "<unk>",
137
+ "use_default_system_prompt": false
138
+ }
vision_encoder.py ADDED
@@ -0,0 +1,613 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) MILVLG team.
2
+ # Licensed under the Apache 2.0 license.
3
+ #
4
+ # Some code here is copied from the project Phi-2 (https://huggingface.co/microsoft/phi-2),
5
+ # SigLIP@transformers==4.37.0.dev0 (https://huggingface.co/google/siglip-so400m-patch14-384),
6
+ # and Llava (https://github.com/haotian-liu/LLaVA), and modified by
7
+ # Zhenwei Shao (shaozw@hdu.edu.cn) @ MILVLG. We thank them for their great works.
8
+ # And their original licenses and copyright should be inherited (see the statements
9
+ # in `configuration_imp.py` for more details).
10
+
11
+
12
+ from typing import Any, Optional, Tuple, Union, List, Dict
13
+ from dataclasses import dataclass
14
+ import math
15
+ import warnings
16
+ from functools import partial, reduce
17
+
18
+
19
+ import numpy as np
20
+ from PIL import Image
21
+ import torch
22
+ import torch.utils.checkpoint
23
+ from torch import nn
24
+
25
+ from transformers.image_processing_utils import BatchFeature
26
+ from transformers.image_transforms import (
27
+ convert_to_rgb,
28
+ normalize,
29
+ rescale,
30
+ resize,
31
+ to_channel_dimension_format,
32
+ )
33
+ from transformers.image_utils import (
34
+ ChannelDimension,
35
+ PILImageResampling,
36
+ to_numpy_array,
37
+ )
38
+ from transformers.activations import ACT2FN
39
+ from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
40
+ from transformers.modeling_utils import PreTrainedModel
41
+ from transformers.utils import ModelOutput
42
+
43
+ from .configuration_imp import SiglipVisionConfig
44
+
45
+
46
+ # ============================================================================
47
+ # A simple image preprocessor for SigLIP models.
48
+ # ============================================================================
49
+
50
+ def expand2square(pil_img, background_color):
51
+ width, height = pil_img.size
52
+ if width == height:
53
+ return pil_img
54
+ elif width > height:
55
+ result = Image.new(pil_img.mode, (width, width), background_color)
56
+ result.paste(pil_img, (0, (width - height) // 2))
57
+ return result
58
+ else:
59
+ result = Image.new(pil_img.mode, (height, height), background_color)
60
+ result.paste(pil_img, ((height - width) // 2, 0))
61
+ return result
62
+
63
+ def simple_image_processor(
64
+ images,
65
+ image_mean=(0.5, 0.5, 0.5),
66
+ image_std=(0.5, 0.5, 0.5),
67
+ size=(384, 384),
68
+ resample=PILImageResampling.BICUBIC,
69
+ rescale_factor=1 / 255,
70
+ data_format=ChannelDimension.FIRST,
71
+ return_tensors="pt"
72
+ ):
73
+
74
+ if isinstance(images, Image.Image):
75
+ images = [images]
76
+ else:
77
+ assert isinstance(images, list)
78
+
79
+ new_images = []
80
+ for image in images:
81
+ image = expand2square(image, tuple(int(x*255) for x in image_mean))
82
+ new_images.append(image)
83
+ images=new_images
84
+
85
+ transforms = [
86
+ convert_to_rgb,
87
+ to_numpy_array,
88
+ partial(resize, size=size, resample=resample, data_format=data_format),
89
+ partial(rescale, scale=rescale_factor, data_format=data_format),
90
+ partial(normalize, mean=image_mean, std=image_std, data_format=data_format),
91
+ partial(to_channel_dimension_format, channel_dim=data_format, input_channel_dim=data_format),
92
+ ]
93
+
94
+ images = reduce(lambda x, f: [*map(f, x)], transforms, images)
95
+ data = {"pixel_values": images}
96
+
97
+ return BatchFeature(data=data, tensor_type=return_tensors)
98
+
99
+ # ============================================================================
100
+ # Definitions for SigLIP models.
101
+ # ============================================================================
102
+
103
+ @dataclass
104
+ # Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Siglip
105
+ class SiglipVisionModelOutput(ModelOutput):
106
+ """
107
+ Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
108
+
109
+ Args:
110
+ image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
111
+ The image embeddings obtained by applying the projection layer to the pooler_output.
112
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
113
+ Sequence of hidden-states at the output of the last layer of the model.
114
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
115
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
116
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
117
+
118
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
119
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
120
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
121
+ sequence_length)`.
122
+
123
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
124
+ heads.
125
+ """
126
+
127
+ image_embeds: Optional[torch.FloatTensor] = None
128
+ last_hidden_state: torch.FloatTensor = None
129
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
130
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
131
+
132
+
133
+ class SiglipVisionEmbeddings(nn.Module):
134
+ def __init__(self, config: SiglipVisionConfig):
135
+ super().__init__()
136
+ self.config = config
137
+ self.embed_dim = config.hidden_size
138
+ self.image_size = config.image_size
139
+ self.patch_size = config.patch_size
140
+
141
+ self.patch_embedding = nn.Conv2d(
142
+ in_channels=config.num_channels,
143
+ out_channels=self.embed_dim,
144
+ kernel_size=self.patch_size,
145
+ stride=self.patch_size,
146
+ padding="valid",
147
+ )
148
+
149
+ self.num_patches = (self.image_size // self.patch_size) ** 2
150
+ self.num_positions = self.num_patches
151
+ self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
152
+ self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
153
+
154
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
155
+ patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
156
+ embeddings = patch_embeds.flatten(2).transpose(1, 2)
157
+
158
+ embeddings = embeddings + self.position_embedding(self.position_ids)
159
+ return embeddings
160
+
161
+
162
+
163
+ class SiglipAttention(nn.Module):
164
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
165
+
166
+ # Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__
167
+ def __init__(self, config):
168
+ super().__init__()
169
+ self.config = config
170
+ self.embed_dim = config.hidden_size
171
+ self.num_heads = config.num_attention_heads
172
+ self.head_dim = self.embed_dim // self.num_heads
173
+ if self.head_dim * self.num_heads != self.embed_dim:
174
+ raise ValueError(
175
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
176
+ f" {self.num_heads})."
177
+ )
178
+ self.scale = self.head_dim**-0.5
179
+ self.dropout = config.attention_dropout
180
+
181
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
182
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
183
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
184
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
185
+
186
+ def forward(
187
+ self,
188
+ hidden_states: torch.Tensor,
189
+ attention_mask: Optional[torch.Tensor] = None,
190
+ output_attentions: Optional[bool] = False,
191
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
192
+ """Input shape: Batch x Time x Channel"""
193
+
194
+ batch_size, q_len, _ = hidden_states.size()
195
+
196
+ query_states = self.q_proj(hidden_states)
197
+ key_states = self.k_proj(hidden_states)
198
+ value_states = self.v_proj(hidden_states)
199
+
200
+ query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
201
+ key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
202
+ value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
203
+
204
+ k_v_seq_len = key_states.shape[-2]
205
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
206
+
207
+ if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
208
+ raise ValueError(
209
+ f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is"
210
+ f" {attn_weights.size()}"
211
+ )
212
+
213
+ if attention_mask is not None:
214
+ if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len):
215
+ raise ValueError(
216
+ f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}"
217
+ )
218
+ attn_weights = attn_weights + attention_mask
219
+
220
+ # upcast attention to fp32
221
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
222
+ attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
223
+ attn_output = torch.matmul(attn_weights, value_states)
224
+
225
+ if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):
226
+ raise ValueError(
227
+ f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is"
228
+ f" {attn_output.size()}"
229
+ )
230
+
231
+ attn_output = attn_output.transpose(1, 2).contiguous()
232
+ attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim)
233
+
234
+ attn_output = self.out_proj(attn_output)
235
+
236
+ return attn_output, attn_weights
237
+
238
+
239
+ # Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->Siglip
240
+ class SiglipMLP(nn.Module):
241
+ def __init__(self, config):
242
+ super().__init__()
243
+ self.config = config
244
+ self.activation_fn = ACT2FN[config.hidden_act]
245
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
246
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
247
+
248
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
249
+ hidden_states = self.fc1(hidden_states)
250
+ hidden_states = self.activation_fn(hidden_states)
251
+ hidden_states = self.fc2(hidden_states)
252
+ return hidden_states
253
+
254
+
255
+ # Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->Siglip
256
+ class SiglipEncoderLayer(nn.Module):
257
+ def __init__(self, config: SiglipVisionConfig):
258
+ super().__init__()
259
+ self.embed_dim = config.hidden_size
260
+ self.self_attn = SiglipAttention(config)
261
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
262
+ self.mlp = SiglipMLP(config)
263
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
264
+
265
+ # Ignore copy
266
+ def forward(
267
+ self,
268
+ hidden_states: torch.Tensor,
269
+ attention_mask: torch.Tensor,
270
+ output_attentions: Optional[bool] = False,
271
+ ) -> Tuple[torch.FloatTensor]:
272
+ """
273
+ Args:
274
+ hidden_states (`torch.FloatTensor`):
275
+ Input to the layer of shape `(batch, seq_len, embed_dim)`.
276
+ attention_mask (`torch.FloatTensor`):
277
+ Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values.
278
+ output_attentions (`bool`, *optional*, defaults to `False`):
279
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
280
+ returned tensors for more detail.
281
+ """
282
+ residual = hidden_states
283
+
284
+ hidden_states = self.layer_norm1(hidden_states)
285
+ hidden_states, attn_weights = self.self_attn(
286
+ hidden_states=hidden_states,
287
+ attention_mask=attention_mask,
288
+ output_attentions=output_attentions,
289
+ )
290
+ hidden_states = residual + hidden_states
291
+
292
+ residual = hidden_states
293
+ hidden_states = self.layer_norm2(hidden_states)
294
+ hidden_states = self.mlp(hidden_states)
295
+ hidden_states = residual + hidden_states
296
+
297
+ outputs = (hidden_states,)
298
+
299
+ if output_attentions:
300
+ outputs += (attn_weights,)
301
+
302
+ return outputs
303
+
304
+
305
+ class SiglipPreTrainedModel(PreTrainedModel):
306
+ """
307
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
308
+ models.
309
+ """
310
+
311
+ config_class = SiglipVisionConfig
312
+ base_model_prefix = "siglip"
313
+ supports_gradient_checkpointing = True
314
+
315
+ def _init_weights(self, module):
316
+ """Initialize the weights"""
317
+ pass
318
+
319
+ # Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->Siglip
320
+ class SiglipEncoder(nn.Module):
321
+ """
322
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
323
+ [`SiglipEncoderLayer`].
324
+
325
+ Args:
326
+ config: SiglipVisionConfig
327
+ """
328
+
329
+ def __init__(self, config: SiglipVisionConfig):
330
+ super().__init__()
331
+ self.config = config
332
+ self.layers = nn.ModuleList([SiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
333
+ self.gradient_checkpointing = False
334
+
335
+ # Ignore copy
336
+ def forward(
337
+ self,
338
+ inputs_embeds,
339
+ attention_mask: Optional[torch.Tensor] = None,
340
+ output_attentions: Optional[bool] = None,
341
+ output_hidden_states: Optional[bool] = None,
342
+ return_dict: Optional[bool] = None,
343
+ ) -> Union[Tuple, BaseModelOutput]:
344
+ r"""
345
+ Args:
346
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
347
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
348
+ This is useful if you want more control over how to convert `input_ids` indices into associated vectors
349
+ than the model's internal embedding lookup matrix.
350
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
351
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
352
+
353
+ - 1 for tokens that are **not masked**,
354
+ - 0 for tokens that are **masked**.
355
+
356
+ [What are attention masks?](../glossary#attention-mask)
357
+ output_attentions (`bool`, *optional*):
358
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
359
+ returned tensors for more detail.
360
+ output_hidden_states (`bool`, *optional*):
361
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
362
+ for more detail.
363
+ return_dict (`bool`, *optional*):
364
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
365
+ """
366
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
367
+ output_hidden_states = (
368
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
369
+ )
370
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
371
+
372
+ encoder_states = () if output_hidden_states else None
373
+ all_attentions = () if output_attentions else None
374
+
375
+ hidden_states = inputs_embeds
376
+ for encoder_layer in self.layers:
377
+ if output_hidden_states:
378
+ encoder_states = encoder_states + (hidden_states,)
379
+ if self.gradient_checkpointing and self.training:
380
+ layer_outputs = self._gradient_checkpointing_func(
381
+ encoder_layer.__call__,
382
+ hidden_states,
383
+ attention_mask,
384
+ output_attentions,
385
+ )
386
+ else:
387
+ layer_outputs = encoder_layer(
388
+ hidden_states,
389
+ attention_mask,
390
+ output_attentions=output_attentions,
391
+ )
392
+
393
+ hidden_states = layer_outputs[0]
394
+
395
+ if output_attentions:
396
+ all_attentions = all_attentions + (layer_outputs[1],)
397
+
398
+ if output_hidden_states:
399
+ encoder_states = encoder_states + (hidden_states,)
400
+
401
+ if not return_dict:
402
+ return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
403
+ return BaseModelOutput(
404
+ last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
405
+ )
406
+
407
+
408
+ class SiglipVisionTransformer(nn.Module):
409
+ def __init__(self, config: SiglipVisionConfig):
410
+ super().__init__()
411
+ self.config = config
412
+ embed_dim = config.hidden_size
413
+
414
+ self.embeddings = SiglipVisionEmbeddings(config)
415
+ self.encoder = SiglipEncoder(config)
416
+ self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
417
+ self.head = SiglipMultiheadAttentionPoolingHead(config)
418
+
419
+ def forward(
420
+ self,
421
+ pixel_values,
422
+ output_attentions: Optional[bool] = None,
423
+ output_hidden_states: Optional[bool] = None,
424
+ return_dict: Optional[bool] = None,
425
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
426
+ r"""
427
+ Returns:
428
+
429
+ """
430
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
431
+ output_hidden_states = (
432
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
433
+ )
434
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
435
+
436
+ hidden_states = self.embeddings(pixel_values)
437
+
438
+ encoder_outputs = self.encoder(
439
+ inputs_embeds=hidden_states,
440
+ output_attentions=output_attentions,
441
+ output_hidden_states=output_hidden_states,
442
+ return_dict=return_dict,
443
+ )
444
+
445
+ last_hidden_state = encoder_outputs[0]
446
+ last_hidden_state = self.post_layernorm(last_hidden_state)
447
+
448
+ pooled_output = self.head(last_hidden_state)
449
+
450
+ if not return_dict:
451
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
452
+
453
+ return BaseModelOutputWithPooling(
454
+ last_hidden_state=last_hidden_state,
455
+ pooler_output=pooled_output,
456
+ hidden_states=encoder_outputs.hidden_states,
457
+ attentions=encoder_outputs.attentions,
458
+ )
459
+
460
+
461
+ class SiglipMultiheadAttentionPoolingHead(nn.Module):
462
+ """Multihead Attention Pooling."""
463
+
464
+ def __init__(self, config: SiglipVisionConfig):
465
+ super().__init__()
466
+
467
+ self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size))
468
+ self.attention = torch.nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
469
+ self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
470
+ self.mlp = SiglipMLP(config)
471
+
472
+ def forward(self, hidden_state):
473
+ batch_size = hidden_state.shape[0]
474
+ probe = self.probe.repeat(batch_size, 1, 1)
475
+
476
+ hidden_state = self.attention(probe, hidden_state, hidden_state)[0]
477
+
478
+ residual = hidden_state
479
+ hidden_state = self.layernorm(hidden_state)
480
+ hidden_state = residual + self.mlp(hidden_state)
481
+
482
+ return hidden_state[:, 0]
483
+
484
+
485
+ class SiglipVisionModel(SiglipPreTrainedModel):
486
+ config_class = SiglipVisionConfig
487
+ main_input_name = "pixel_values"
488
+ _no_split_modules = ["SiglipEncoderLayer"]
489
+
490
+ def __init__(self, config: SiglipVisionConfig):
491
+ super().__init__(config)
492
+
493
+ self.vision_model = SiglipVisionTransformer(config)
494
+
495
+ # Initialize weights and apply final processing
496
+ self.post_init()
497
+
498
+ def get_input_embeddings(self) -> nn.Module:
499
+ return self.vision_model.embeddings.patch_embedding
500
+
501
+ def forward(
502
+ self,
503
+ pixel_values,
504
+ output_attentions: Optional[bool] = None,
505
+ output_hidden_states: Optional[bool] = None,
506
+ return_dict: Optional[bool] = None,
507
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
508
+ r"""
509
+ Returns:
510
+
511
+ Examples:
512
+
513
+ ```python
514
+ >>> from PIL import Image
515
+ >>> import requests
516
+ >>> from transformers import AutoProcessor, SiglipVisionModel
517
+
518
+ >>> model = SiglipVisionModel.from_pretrained("google/siglip-base-patch16-224")
519
+ >>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
520
+
521
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
522
+ >>> image = Image.open(requests.get(url, stream=True).raw)
523
+
524
+ >>> inputs = processor(images=image, return_tensors="pt")
525
+
526
+ >>> outputs = model(**inputs)
527
+ >>> last_hidden_state = outputs.last_hidden_state
528
+ >>> pooled_output = outputs.pooler_output # pooled features
529
+ ```"""
530
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
531
+
532
+ return self.vision_model(
533
+ pixel_values=pixel_values,
534
+ output_attentions=output_attentions,
535
+ output_hidden_states=output_hidden_states,
536
+ return_dict=return_dict,
537
+ )
538
+
539
+
540
+ # ============================================================================
541
+ # VisionTower module for Imp
542
+ # ============================================================================
543
+
544
+ class VisionTower(nn.Module):
545
+ def __init__(self, vision_tower_cfg, delay_load=False):
546
+ super().__init__()
547
+
548
+ self.is_loaded = False
549
+
550
+ self.config = vision_tower_cfg
551
+ self.vision_tower_name = vision_tower_cfg.mm_vision_tower
552
+ self.select_layer = vision_tower_cfg.mm_vision_select_layer
553
+ # self.select_feature = getattr(vision_tower_cfg, 'mm_vision_select_feature', 'patch')
554
+
555
+ self.image_processor = simple_image_processor
556
+
557
+ if not delay_load:
558
+ self.load_model()
559
+ else:
560
+ raise NotImplementedError("delay load is not implemented yet.")
561
+
562
+ def load_model(self):
563
+ if self.is_loaded:
564
+ return
565
+
566
+ # "google/siglip-so400m-patch14-384"
567
+ # self.vision_tower = SiglipVisionModel.from_pretrained(self.vision_tower_name)
568
+ self.vision_tower = SiglipVisionModel(self.config)
569
+ del self.vision_tower.vision_model.encoder.layers[(self.select_layer + 1):]
570
+ self.vision_tower.vision_model.head = nn.Identity()
571
+ self.vision_tower.vision_model.post_layernorm=nn.Identity()
572
+ self.vision_tower.requires_grad_(False)
573
+ self.vision_tower.eval()
574
+
575
+ self.is_loaded = True
576
+
577
+ @torch.no_grad()
578
+ def forward(self, images):
579
+ if type(images) is list:
580
+ image_features = []
581
+ for image in images:
582
+ image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
583
+ image_feature = image_forward_out.hidden_states[-1].to(image.dtype)
584
+ assert image_features.shape[-2] == 729
585
+ image_features.append(image_feature)
586
+ else:
587
+ image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
588
+ image_features = image_forward_outs.hidden_states[-1].to(images.dtype)
589
+ assert image_features.shape[-2] == 729
590
+
591
+ return image_features
592
+
593
+ @property
594
+ def dummy_feature(self):
595
+ return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
596
+
597
+ @property
598
+ def dtype(self):
599
+ for p in self.vision_tower.parameters():
600
+ return p.dtype
601
+
602
+ @property
603
+ def device(self):
604
+ for p in self.vision_tower.parameters():
605
+ return p.device
606
+
607
+ @property
608
+ def hidden_size(self):
609
+ return self.config.hidden_size
610
+
611
+ @property
612
+ def num_patches(self):
613
+ return (self.config.image_size // self.config.patch_size) ** 2
vocab.json ADDED
The diff for this file is too large to render. See raw diff