xiezhe24 commited on
Commit
7923a5a
1 Parent(s): f21a6af

Update model.

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright 2024 Alibaba Cloud
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
NOTICE ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright 2024 Alibaba Cloud. All rights reserved.
2
+ This software contains code that was originally developed and copyrighted by Alibaba Cloud. The original code is subject to the terms and conditions of the Apache License (Version 2.0), which can be found in the accompanying LICENSE file.
3
+ ByteDance and Tsinghua University has made modifications and enhancements to the original code. The modifications are as follows:
4
+ - Fine-tuned the model on the Qwen2.5-14B-Instruct model for ChatTS.
5
+ - Modified `modeling_qwen2.py` and `configuration_qwen2.py` for the ChatTS model.
6
+ - Modified the `README.md` file to provide some information about the usage of the modified model.
7
+
8
+ Please note that any distribution of this software must include this NOTICE file intact, along with the original LICENSE file and any other relevant license information, to ensure compliance with all applicable copyright and licensing requirements.
9
+ ByteDance and Tsinghua University
10
+
11
+ December 2024
12
+
13
+ This NOTICE is provided to clarify the copyright status and licensing of the software, ensuring that all users and distributors are aware of their rights and obligations.
README.md CHANGED
@@ -1,3 +1,21 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChatTS-14B Model
2
+ This model is fine-tuned on the QWen2.5-14B-Instruct (https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) model. For more usage details, please refer to the `README.md` in the ChatTS repository.
3
+ **由于仓库大小限制,本仓库未包含模型权重文件本身,只包含了模型必要的代码文件与LICENCE。权重文件参考:[]**
4
+
5
+ # Reference
6
+ - QWen2.5-14B-Instruct (https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
7
+ - transformers (https://github.com/huggingface/transformers.git)
8
+ - [ChatTS Paper](https://arxiv.org/pdf/2412.03104)
9
+
10
+ # License
11
+ This model is licensed under the [Apache License 2.0](LICENSE).
12
+
13
+ # Cite
14
+ ```
15
+ @article{xie2024chatts,
16
+ title={ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning},
17
+ author={Xie, Zhe and Li, Zeyan and He, Xiao and Xu, Longlong and Wen, Xidao and Zhang, Tieying and Chen, Jianjun and Shi, Rui and Pei, Dan},
18
+ journal={arXiv preprint arXiv:2412.03104},
19
+ year={2024}
20
+ }
21
+ ```
added_tokens.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<ts/>": 151666,
5
+ "<ts>": 151665,
6
+ "<|box_end|>": 151649,
7
+ "<|box_start|>": 151648,
8
+ "<|endoftext|>": 151643,
9
+ "<|file_sep|>": 151664,
10
+ "<|fim_middle|>": 151660,
11
+ "<|fim_pad|>": 151662,
12
+ "<|fim_prefix|>": 151659,
13
+ "<|fim_suffix|>": 151661,
14
+ "<|im_end|>": 151645,
15
+ "<|im_start|>": 151644,
16
+ "<|image_pad|>": 151655,
17
+ "<|object_ref_end|>": 151647,
18
+ "<|object_ref_start|>": 151646,
19
+ "<|quad_end|>": 151651,
20
+ "<|quad_start|>": 151650,
21
+ "<|repo_name|>": 151663,
22
+ "<|video_pad|>": 151656,
23
+ "<|vision_end|>": 151653,
24
+ "<|vision_pad|>": 151654,
25
+ "<|vision_start|>": 151652
26
+ }
config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/mnt/bn/mllmhl/sft_checkpoints/qwen2.5-14b-ts-explaints-1124-stage1-sp/checkpoint-400",
3
+ "architectures": [
4
+ "Qwen2TSForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_qwen2.Qwen2TSConfig",
9
+ "AutoModel": "modeling_qwen2.Qwen2TSForCausalLM",
10
+ "AutoModelForCausalLM": "modeling_qwen2.Qwen2TSForCausalLM"
11
+ },
12
+ "bos_token_id": 151643,
13
+ "eos_token_id": 151645,
14
+ "hidden_act": "silu",
15
+ "hidden_size": 5120,
16
+ "ignore_index": -100,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 13824,
19
+ "max_position_embeddings": 32768,
20
+ "max_window_layers": 70,
21
+ "model_type": "qwen2",
22
+ "num_attention_heads": 40,
23
+ "num_hidden_layers": 48,
24
+ "num_key_value_heads": 8,
25
+ "pad_token_id": 151643,
26
+ "rms_norm_eps": 1e-06,
27
+ "rope_theta": 1000000.0,
28
+ "sliding_window": 131072,
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "float16",
31
+ "transformers_version": "4.46.2",
32
+ "ts": {
33
+ "hidden_size": 5120,
34
+ "num_features": 2,
35
+ "num_layers": 5,
36
+ "patch_size": 16
37
+ },
38
+ "ts_token_end_index": 151665,
39
+ "ts_token_start_index": 151666,
40
+ "use_cache": false,
41
+ "use_sliding_window": false,
42
+ "vocab_size": 152064
43
+ }
configuration_qwen2.py ADDED
@@ -0,0 +1,367 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # The following code are reused from the QWen project (https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) of Alibaba Cloud.
3
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ # The code is modified by ByteDance and Tsinghua University from the original implementation of Qwen:
18
+ # - We changed Qwen2Config to Qwen2TSConfig to support time series modeling.
19
+
20
+ """ Qwen2 model configuration"""
21
+
22
+ from transformers import PretrainedConfig
23
+ from transformers.utils import logging
24
+ from typing import *
25
+
26
+
27
+ logger = logging.get_logger(__name__)
28
+
29
+
30
+ class Qwen2TSConfig(PretrainedConfig):
31
+ r"""
32
+ This is the configuration class to store the configuration of a [`Qwen2Model`]. It is used to instantiate a
33
+ Qwen2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
34
+ with the defaults will yield a similar configuration to that of
35
+ Qwen2-7B-beta [Qwen/Qwen2-7B-beta](https://huggingface.co/Qwen/Qwen2-7B-beta).
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 151936):
43
+ Vocabulary size of the Qwen2 model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling [`Qwen2Model`]
45
+ hidden_size (`int`, *optional*, defaults to 4096):
46
+ Dimension of the hidden representations.
47
+ intermediate_size (`int`, *optional*, defaults to 22016):
48
+ Dimension of the MLP representations.
49
+ num_hidden_layers (`int`, *optional*, defaults to 32):
50
+ Number of hidden layers in the Transformer encoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 32):
52
+ Number of attention heads for each attention layer in the Transformer encoder.
53
+ num_key_value_heads (`int`, *optional*, defaults to 32):
54
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
55
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
56
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
57
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
58
+ by meanpooling all the original heads within that group. For more details checkout [this
59
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
60
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
61
+ The non-linear activation function (function or string) in the decoder.
62
+ max_position_embeddings (`int`, *optional*, defaults to 32768):
63
+ The maximum sequence length that this model might ever be used with.
64
+ initializer_range (`float`, *optional*, defaults to 0.02):
65
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
66
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
67
+ The epsilon used by the rms normalization layers.
68
+ use_cache (`bool`, *optional*, defaults to `True`):
69
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
70
+ relevant if `config.is_decoder=True`.
71
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
72
+ Whether the model's input and output word embeddings should be tied.
73
+ rope_theta (`float`, *optional*, defaults to 10000.0):
74
+ The base period of the RoPE embeddings.
75
+ use_sliding_window (`bool`, *optional*, defaults to `False`):
76
+ Whether to use sliding window attention.
77
+ sliding_window (`int`, *optional*, defaults to 4096):
78
+ Sliding window attention (SWA) window size. If not specified, will default to `4096`.
79
+ max_window_layers (`int`, *optional*, defaults to 28):
80
+ The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
81
+ attention_dropout (`float`, *optional*, defaults to 0.0):
82
+ The dropout ratio for the attention probabilities.
83
+
84
+ ```python
85
+ >>> from transformers import Qwen2Model, Qwen2Config
86
+
87
+ >>> # Initializing a Qwen2 style configuration
88
+ >>> configuration = Qwen2Config()
89
+
90
+ >>> # Initializing a model from the Qwen2-7B style configuration
91
+ >>> model = Qwen2Model(configuration)
92
+
93
+ >>> # Accessing the model configuration
94
+ >>> configuration = model.config
95
+ ```"""
96
+
97
+ model_type = "qwen2"
98
+ keys_to_ignore_at_inference = ["past_key_values"]
99
+
100
+ def __init__(
101
+ self,
102
+ vocab_size=151936,
103
+ hidden_size=4096,
104
+ intermediate_size=22016,
105
+ num_hidden_layers=32,
106
+ num_attention_heads=32,
107
+ num_key_value_heads=32,
108
+ hidden_act="silu",
109
+ max_position_embeddings=32768,
110
+ initializer_range=0.02,
111
+ rms_norm_eps=1e-6,
112
+ use_cache=True,
113
+ tie_word_embeddings=False,
114
+ rope_theta=10000.0,
115
+ use_sliding_window=False,
116
+ sliding_window=4096,
117
+ max_window_layers=28,
118
+ attention_dropout=0.0,
119
+ **kwargs,
120
+ ):
121
+ self.vocab_size = vocab_size
122
+ self.max_position_embeddings = max_position_embeddings
123
+ self.hidden_size = hidden_size
124
+ self.intermediate_size = intermediate_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.use_sliding_window = use_sliding_window
128
+ self.sliding_window = sliding_window
129
+ self.max_window_layers = max_window_layers
130
+
131
+ # for backward compatibility
132
+ if num_key_value_heads is None:
133
+ num_key_value_heads = num_attention_heads
134
+
135
+ self.num_key_value_heads = num_key_value_heads
136
+ self.hidden_act = hidden_act
137
+ self.initializer_range = initializer_range
138
+ self.rms_norm_eps = rms_norm_eps
139
+ self.use_cache = use_cache
140
+ self.rope_theta = rope_theta
141
+ self.attention_dropout = attention_dropout
142
+
143
+ super().__init__(
144
+ tie_word_embeddings=tie_word_embeddings,
145
+ **kwargs,
146
+ )
147
+
148
+ TINYTIMEMIXER_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
149
+
150
+
151
+ class TinyTimeMixerConfig(PretrainedConfig):
152
+ r"""
153
+ This is the configuration class to store the configuration of a [`TinyTimeMixerModel`]. It is used to instantiate a
154
+ TinyTimeMixer model according to the specified arguments, defining the model architecture. Instantiating a
155
+ configuration with the defaults will yield a similar configuration to that of the TinyTimeMixer {} architecture.
156
+
157
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
158
+ documentation from [`PretrainedConfig`] for more information.
159
+
160
+ Args:
161
+ context_length (`int`, *optional*, defaults to 64)
162
+ The context/history length for the input sequence.
163
+ patch_length (`int`, *optional*, defaults to 8)
164
+ The patch length for the input sequence.
165
+ num_input_channels (`int`):
166
+ Number of input variates. For Univariate, set it to 1.
167
+ patch_stride (`int`, *optional*, defaults to 8):
168
+ Amount of points to stride. If its value is same as patch_length, we get non-overlapping patches.
169
+ d_model (`int`, *optional*, defaults to 16):
170
+ Hidden feature size of the model.
171
+ prediction_length (`int`, *optional*, defaults to 16)
172
+ Number of time steps to forecast for a forecasting task. Also known as the Forecast Horizon.
173
+ expansion_factor (`int`, *optional*, defaults to 2):
174
+ Expansion factor to use inside MLP. Recommended range is 2-5. Larger value indicates more complex model.
175
+ num_layers (`int`, *optional*, defaults to 3):
176
+ Number of layers to use. Recommended range is 3-15. Larger value indicates more complex model.
177
+ dropout (`float`, *optional*, defaults to 0.2):
178
+ The dropout probability the `TinyTimeMixer` backbone. Recommended range is 0.2-0.7
179
+ mode (`str`, *optional*, defaults to `"common_channel"`):
180
+ Mixer Mode. Determines how to process the channels. Allowed values: "common_channel", "mix_channel". In
181
+ "common_channel" mode, we follow Channel-independent modelling with no explicit channel-mixing. Channel
182
+ mixing happens in an implicit manner via shared weights across channels. (preferred first approach) In
183
+ "mix_channel" mode, we follow explicit channel-mixing in addition to patch and feature mixer. (preferred
184
+ approach when channel correlations are very important to model)
185
+ gated_attn (`bool`, *optional*, defaults to `True`):
186
+ Enable Gated Attention.
187
+ norm_mlp (`str`, *optional*, defaults to `"LayerNorm"`):
188
+ Normalization layer (BatchNorm or LayerNorm).
189
+ self_attn (`bool`, *optional*, defaults to `False`):
190
+ Enable Tiny self attention across patches. This can be enabled when the output of Vanilla TinyTimeMixer with
191
+ gated attention is not satisfactory. Enabling this leads to explicit pair-wise attention and modelling
192
+ across patches.
193
+ self_attn_heads (`int`, *optional*, defaults to 1):
194
+ Number of self-attention heads. Works only when `self_attn` is set to `True`.
195
+ use_positional_encoding (`bool`, *optional*, defaults to `False`):
196
+ Enable the use of positional embedding for the tiny self-attention layers. Works only when `self_attn` is
197
+ set to `True`.
198
+ positional_encoding_type (`str`, *optional*, defaults to `"sincos"`):
199
+ Positional encodings. Options `"random"` and `"sincos"` are supported. Works only when
200
+ `use_positional_encoding` is set to `True`
201
+ scaling (`string` or `bool`, *optional*, defaults to `"std"`):
202
+ Whether to scale the input targets via "mean" scaler, "std" scaler or no scaler if `None`. If `True`, the
203
+ scaler is set to "mean".
204
+ loss (`string`, *optional*, defaults to `"mse"`):
205
+ The loss function for the model. Defaults to mean squared error "mse". Allowed values: ["mse", "mae"]
206
+ init_std (`float`, *optional*, defaults to 0.02):
207
+ The standard deviation of the truncated normal weight initialization distribution.
208
+ post_init (`bool`, *optional*, defaults to `False`):
209
+ Whether to use custom weight initialization from `transformers` library, or the default initialization in
210
+ `PyTorch`. Setting it to `False` performs `PyTorch` weight initialization.
211
+ norm_eps (`float`, *optional*, defaults to 1e-05):
212
+ A value added to the denominator for numerical stability of normalization.
213
+ adaptive_patching_levels (`int`, *optional*, defaults to 0):
214
+ If adaptive_patching_levels is i, then we will have i levels with each level having n_layers.
215
+ Level id starts with 0. num_patches at level i will be multipled by (2^i) and num_features at level i will be divided by (2^i).
216
+ For Ex. if adaptive_patching_levels is 3 - then we will have 3 levels:
217
+ level 2: num_features//(2^2), num_patches*(2^2)
218
+ level 1: num_features//(2^1), num_patches*(2^1)
219
+ level 0: num_features//(2^0), num_patches*(2^0)
220
+ adaptive_patching_levels = 1 is same as one level PatchTSMixer. This module gets disabled when adaptive_patching_levels is 0 or neg value. Defaults to 0 (off mode).
221
+ resolution_prefix_tuning (`bool`, *optional*, defaults to `False`):
222
+ Enable if your dataloader has time resolution information as defined in `get_freq_mapping` function in `modelling_tinytimemixer`.
223
+ frequency_token_vocab_size (`int`, *optional*, defaults to 5):
224
+ Vocab size to use when resolution_prefix_tuning is enabled.
225
+ head_dropout (`float`, *optional*, defaults to 0.2):
226
+ The dropout probability the `TinyTimeMixer` head.
227
+ prediction_channel_indices (`list`, *optional*):
228
+ List of channel indices to forecast. If None, forecast all channels. Target data is expected to have all
229
+ channels and we explicitly filter the channels in prediction and target before loss computation. Please provide the indices
230
+ in sorted ascending order.
231
+ decoder_num_layers (`int`, *optional*, defaults to 8):
232
+ Number of layers to use in decoder
233
+ decoder_d_model(`int`, *optional*, defaults to 16):
234
+ Defines the hidden feature size of the decoder.
235
+ decoder_adaptive_patching_levels (`int`, *optional*, defaults to 0):
236
+ Adaptive Patching levels for decoder. Preferable to set it to 0 for decoder to keep it light weight.
237
+ decoder_raw_residual (`bool`, *optional*, defaults to `False`):
238
+ Flag to enable merging of raw embedding with encoder embedding for decoder input. Defaults to False.
239
+ decoder_mode (`string`, *optional*, defaults to `"common_channel"`):
240
+ Decoder channel mode. Use `"common_channel" for channel-independent modelling and `"mix_channel"` for channel-mixing modelling
241
+ use_decoder (`bool`, *optional*, defaults to `True`):
242
+ Enable to use decoder.
243
+ prediction_filter_length (`int`,*optional*, defaults to None):
244
+ Actual length in the prediction output to use for loss calculations.
245
+
246
+
247
+ Example:
248
+
249
+ ```python
250
+ >>> from transformers import TinyTimeMixerConfig, TinyTimeMixerModel
251
+
252
+ >>> # Initializing a default TinyTimeMixer configuration
253
+ >>> configuration = TinyTimeMixerConfig()
254
+
255
+ >>> # Randomly initializing a model (with random weights) from the configuration
256
+ >>> model = TinyTimeMixerModel(configuration)
257
+
258
+ >>> # Accessing the model configuration
259
+ >>> configuration = model.config
260
+ ```"""
261
+
262
+ model_type = "tinytimemixer"
263
+ attribute_map = {
264
+ "hidden_size": "d_model",
265
+ "num_hidden_layers": "num_layers",
266
+ }
267
+
268
+ def __init__(
269
+ self,
270
+ # Time series specific configuration
271
+ context_length: int = 64,
272
+ patch_length: int = 8,
273
+ num_input_channels: int = 1,
274
+ prediction_length: int = 16,
275
+ patch_stride: int = 8,
276
+ prediction_channel_indices: Optional[list] = None,
277
+ # General model configuration
278
+ d_model: int = 16,
279
+ expansion_factor: int = 2,
280
+ num_layers: int = 3,
281
+ dropout: float = 0.2,
282
+ mode: str = "common_channel",
283
+ gated_attn: bool = True,
284
+ norm_mlp: str = "LayerNorm",
285
+ self_attn: bool = False,
286
+ self_attn_heads: int = 1,
287
+ use_positional_encoding: bool = False,
288
+ positional_encoding_type: str = "sincos",
289
+ scaling: Optional[Union[str, bool]] = "std",
290
+ loss: str = "mse",
291
+ init_std: float = 0.02,
292
+ post_init: bool = False,
293
+ norm_eps: float = 1e-5,
294
+ adaptive_patching_levels: int = 0,
295
+ resolution_prefix_tuning: bool = False,
296
+ frequency_token_vocab_size: int = 5,
297
+ # General head configuration
298
+ head_dropout: float = 0.2,
299
+ # decoder parameters
300
+ decoder_num_layers: int = 8,
301
+ decoder_d_model: int = 8,
302
+ decoder_adaptive_patching_levels: int = 0,
303
+ decoder_raw_residual: bool = False,
304
+ decoder_mode: str = "common_channel",
305
+ use_decoder: bool = True,
306
+ # prediction length filtering
307
+ prediction_filter_length: Optional[int] = None,
308
+ **kwargs,
309
+ ):
310
+ self.num_input_channels = num_input_channels
311
+ self.context_length = context_length
312
+ self.patch_length = patch_length
313
+ self.expansion_factor = expansion_factor
314
+ self.num_layers = num_layers
315
+ self.dropout = dropout
316
+ self.mode = mode
317
+ self.gated_attn = gated_attn
318
+ self.norm_mlp = norm_mlp
319
+ self.scaling = scaling
320
+ self.head_dropout = head_dropout
321
+
322
+ self.patch_last = True
323
+ self.use_positional_encoding = use_positional_encoding
324
+ self.positional_encoding_type = positional_encoding_type
325
+ self.prediction_length = prediction_length
326
+ self.prediction_channel_indices = prediction_channel_indices
327
+ self.self_attn = self_attn
328
+ self.self_attn_heads = self_attn_heads
329
+ self.init_std = init_std
330
+ self.post_init = post_init
331
+ self.loss = loss
332
+ self.norm_eps = norm_eps
333
+
334
+ self.use_decoder = use_decoder
335
+
336
+ self.adaptive_patching_levels = adaptive_patching_levels
337
+ self.resolution_prefix_tuning = resolution_prefix_tuning
338
+ self.decoder_num_layers = decoder_num_layers
339
+ self.decoder_adaptive_patching_levels = decoder_adaptive_patching_levels
340
+ self.decoder_raw_residual = decoder_raw_residual
341
+ self.decoder_mode = decoder_mode
342
+ self.frequency_token_vocab_size = frequency_token_vocab_size
343
+ self.d_model = d_model
344
+ self.patch_stride = patch_stride
345
+ self.decoder_d_model = decoder_d_model
346
+ self.init_processing = False
347
+ self.prediction_filter_length = prediction_filter_length
348
+
349
+ super().__init__(**kwargs)
350
+
351
+ def check_and_init_preprocessing(self):
352
+ self.init_processing = True
353
+
354
+ if not hasattr(self, "num_patches"):
355
+ self.num_patches = (
356
+ max(self.context_length, self.patch_length) - self.patch_length
357
+ ) // self.patch_stride + 1
358
+
359
+ if self.resolution_prefix_tuning:
360
+ self.num_patches += 1
361
+
362
+ if self.prediction_filter_length is not None:
363
+ if self.prediction_filter_length > self.prediction_length or self.prediction_filter_length <= 0:
364
+ raise ValueError("prediction_filter_length should be positive and less than prediction_length")
365
+
366
+ if self.prediction_channel_indices is not None:
367
+ self.prediction_channel_indices.sort()
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.7,
11
+ "top_k": 20,
12
+ "top_p": 0.8,
13
+ "transformers_version": "4.46.2"
14
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_qwen2.py ADDED
@@ -0,0 +1,1702 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # The following code are reused from the QWen project (https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) of Alibaba Cloud.
3
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+ # The code is modified by ByteDance and Tsinghua University from the original implementation of Qwen:
23
+ # - Support time series modality for Qwen2 model.
24
+
25
+ """ PyTorch Qwen2 model."""
26
+ import inspect
27
+ import math
28
+ import copy
29
+ from typing import List, Optional, Tuple, Union
30
+ from dataclasses import dataclass
31
+
32
+ import torch
33
+ import torch.nn.functional as F
34
+ import torch.utils.checkpoint
35
+ from torch import nn
36
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
37
+
38
+ from transformers.activations import ACT2FN
39
+ from transformers.cache_utils import Cache, DynamicCache
40
+ from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask, _prepare_4d_causal_attention_mask_for_sdpa
41
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
42
+ from transformers.modeling_utils import PreTrainedModel
43
+ from transformers import AutoConfig
44
+ from transformers.utils import (
45
+ add_start_docstrings,
46
+ add_start_docstrings_to_model_forward,
47
+ is_flash_attn_2_available,
48
+ is_flash_attn_greater_or_equal_2_10,
49
+ logging,
50
+ replace_return_docstrings,
51
+ ModelOutput
52
+ )
53
+ from .configuration_qwen2 import Qwen2TSConfig, TinyTimeMixerConfig
54
+
55
+ # from .modeling_tinytimemixer import TinyTimeMixerForPrediction
56
+ # from .configuration_tinytimemixer import TinyTimeMixerConfig
57
+
58
+ if is_flash_attn_2_available():
59
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
60
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
61
+
62
+ _flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
63
+
64
+
65
+ logger = logging.get_logger(__name__)
66
+
67
+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen2-7B-beta"
68
+ _CONFIG_FOR_DOC = "Qwen2TSConfig"
69
+
70
+
71
+ ########################Naive TS Embedding#####################
72
+ class TimeSeriesEmbedding(nn.Module):
73
+ def __init__(self, config):
74
+ super(TimeSeriesEmbedding, self).__init__()
75
+ self.patch_size = config['patch_size']
76
+ self.num_layers = config['num_layers']
77
+ self.hidden_size = config['hidden_size']
78
+ self.num_features = config['num_features']
79
+
80
+ layers = []
81
+ # 调整输入大小以包含掩码通道
82
+ input_size = 1 * self.patch_size
83
+
84
+ for _ in range(self.num_layers - 1):
85
+ layers.append(nn.Linear(input_size, self.hidden_size))
86
+ layers.append(nn.GELU())
87
+ input_size = self.hidden_size
88
+ layers.append(nn.Linear(input_size, self.hidden_size))
89
+
90
+ self.mlp = nn.Sequential(*layers)
91
+
92
+ def forward(self, x: torch.Tensor):
93
+ batch_size = x.size(0)
94
+ x = x.reshape(batch_size, -1, self.num_features)
95
+
96
+ mask = x[:, :, -1]
97
+ valid_lengths = mask.sum(dim=1).long() # Shape: (batch_size)
98
+
99
+ patch_cnt = (valid_lengths + self.patch_size - 1) // self.patch_size # 向上取整
100
+ # print(f"[DEBUG] TimeSeriesEmbedding: {valid_lengths=}, {patch_cnt=}, {mask.shape=}")
101
+
102
+ patches_list = []
103
+ for i in range(batch_size):
104
+ vl = valid_lengths[i].item()
105
+ pc = patch_cnt[i].item()
106
+ if pc == 0:
107
+ continue
108
+ xi = x[i, :vl, :1]
109
+ total_padded_length = pc * self.patch_size
110
+ padding_length = total_padded_length - vl
111
+ if padding_length > 0:
112
+ padding = torch.zeros(padding_length, 1, device=x.device, dtype=x.dtype)
113
+ xi = torch.cat([xi, padding], dim=0)
114
+ xi = xi.reshape(pc, self.patch_size * 1)
115
+ patches_list.append(xi)
116
+
117
+ if patches_list:
118
+ x_patches = torch.cat(patches_list, dim=0) # Shape: (total_patch_cnt, patch_size * num_features)
119
+ x = self.mlp(x_patches)
120
+ else:
121
+ # 如果没有有效的 patches,返回空 tensor
122
+ x = torch.empty(0, self.hidden_size, device=x.device)
123
+ # print(f"[DEBUG] TimeSeriesEmbedding OUTPUT: {x.shape=}, {patch_cnt=}")
124
+
125
+ return x, patch_cnt
126
+
127
+
128
+ ########################QWEN2###################################
129
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
130
+ def _get_unpad_data(attention_mask):
131
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
132
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
133
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
134
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
135
+ return (
136
+ indices,
137
+ cu_seqlens,
138
+ max_seqlen_in_batch,
139
+ )
140
+
141
+
142
+ # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Qwen2
143
+ class Qwen2RMSNorm(nn.Module):
144
+ def __init__(self, hidden_size, eps=1e-6):
145
+ """
146
+ Qwen2RMSNorm is equivalent to T5LayerNorm
147
+ """
148
+ super().__init__()
149
+ self.weight = nn.Parameter(torch.ones(hidden_size))
150
+ self.variance_epsilon = eps
151
+
152
+ def forward(self, hidden_states):
153
+ input_dtype = hidden_states.dtype
154
+ hidden_states = hidden_states.to(torch.float32)
155
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
156
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
157
+ return self.weight * hidden_states.to(input_dtype)
158
+
159
+
160
+ # Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Qwen2
161
+ class Qwen2RotaryEmbedding(nn.Module):
162
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
163
+ super().__init__()
164
+
165
+ self.dim = dim
166
+ self.max_position_embeddings = max_position_embeddings
167
+ self.base = base
168
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
169
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
170
+
171
+ # Build here to make `torch.jit.trace` work.
172
+ self._set_cos_sin_cache(
173
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
174
+ )
175
+
176
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
177
+ self.max_seq_len_cached = seq_len
178
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
179
+
180
+ freqs = torch.outer(t, self.inv_freq)
181
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
182
+ emb = torch.cat((freqs, freqs), dim=-1)
183
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
184
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
185
+
186
+ def forward(self, x, seq_len=None):
187
+ # x: [bs, num_attention_heads, seq_len, head_size]
188
+ if seq_len > self.max_seq_len_cached:
189
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
190
+
191
+ return (
192
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
193
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
194
+ )
195
+
196
+
197
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
198
+ def rotate_half(x):
199
+ """Rotates half the hidden dims of the input."""
200
+ x1 = x[..., : x.shape[-1] // 2]
201
+ x2 = x[..., x.shape[-1] // 2 :]
202
+ return torch.cat((-x2, x1), dim=-1)
203
+
204
+
205
+ # Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
206
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
207
+ """Applies Rotary Position Embedding to the query and key tensors.
208
+
209
+ Args:
210
+ q (`torch.Tensor`): The query tensor.
211
+ k (`torch.Tensor`): The key tensor.
212
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
213
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
214
+ position_ids (`torch.Tensor`):
215
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
216
+ used to pass offsetted position ids when working with a KV-cache.
217
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
218
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
219
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
220
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
221
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
222
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
223
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
224
+ Returns:
225
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
226
+ """
227
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
228
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
229
+ q_embed = (q * cos) + (rotate_half(q) * sin)
230
+ k_embed = (k * cos) + (rotate_half(k) * sin)
231
+ return q_embed, k_embed
232
+
233
+
234
+ # Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2
235
+ class Qwen2MLP(nn.Module):
236
+ def __init__(self, config):
237
+ super().__init__()
238
+ self.config = config
239
+ self.hidden_size = config.hidden_size
240
+ self.intermediate_size = config.intermediate_size
241
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
242
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
243
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
244
+ self.act_fn = ACT2FN[config.hidden_act]
245
+
246
+ def forward(self, x):
247
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
248
+
249
+
250
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
251
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
252
+ """
253
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
254
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
255
+ """
256
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
257
+ if n_rep == 1:
258
+ return hidden_states
259
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
260
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
261
+
262
+
263
+ class Qwen2Attention(nn.Module):
264
+ """
265
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
266
+ and "Generating Long Sequences with Sparse Transformers".
267
+ """
268
+
269
+ def __init__(self, config: Qwen2TSConfig, layer_idx: Optional[int] = None):
270
+ super().__init__()
271
+ self.config = config
272
+ self.layer_idx = layer_idx
273
+ if layer_idx is None:
274
+ logger.warning_once(
275
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
276
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
277
+ "when creating this class."
278
+ )
279
+
280
+ self.hidden_size = config.hidden_size
281
+ self.num_heads = config.num_attention_heads
282
+ self.head_dim = self.hidden_size // self.num_heads
283
+ self.num_key_value_heads = config.num_key_value_heads
284
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
285
+ self.max_position_embeddings = config.max_position_embeddings
286
+ self.rope_theta = config.rope_theta
287
+ self.is_causal = True
288
+ self.attention_dropout = config.attention_dropout
289
+
290
+ if (self.head_dim * self.num_heads) != self.hidden_size:
291
+ raise ValueError(
292
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
293
+ f" and `num_heads`: {self.num_heads})."
294
+ )
295
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
296
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
297
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
298
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
299
+
300
+ self.rotary_emb = Qwen2RotaryEmbedding(
301
+ self.head_dim,
302
+ max_position_embeddings=self.max_position_embeddings,
303
+ base=self.rope_theta,
304
+ )
305
+
306
+ def forward(
307
+ self,
308
+ hidden_states: torch.Tensor,
309
+ attention_mask: Optional[torch.Tensor] = None,
310
+ position_ids: Optional[torch.LongTensor] = None,
311
+ past_key_value: Optional[Cache] = None,
312
+ output_attentions: bool = False,
313
+ use_cache: bool = False,
314
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
315
+ bsz, q_len, _ = hidden_states.size()
316
+
317
+ query_states = self.q_proj(hidden_states)
318
+ key_states = self.k_proj(hidden_states)
319
+ value_states = self.v_proj(hidden_states)
320
+
321
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
322
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
323
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
324
+
325
+ kv_seq_len = key_states.shape[-2]
326
+ if past_key_value is not None:
327
+ if self.layer_idx is None:
328
+ raise ValueError(
329
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
330
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
331
+ "with a layer index."
332
+ )
333
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
334
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
335
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
336
+
337
+ if past_key_value is not None:
338
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
339
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
340
+
341
+ # repeat k/v heads if n_kv_heads < n_heads
342
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
343
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
344
+
345
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
346
+
347
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
348
+ raise ValueError(
349
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
350
+ f" {attn_weights.size()}"
351
+ )
352
+
353
+ if attention_mask is not None:
354
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
355
+ raise ValueError(
356
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
357
+ )
358
+
359
+ attn_weights = attn_weights + attention_mask
360
+
361
+ # upcast attention to fp32
362
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
363
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
364
+ attn_output = torch.matmul(attn_weights, value_states)
365
+
366
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
367
+ raise ValueError(
368
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
369
+ f" {attn_output.size()}"
370
+ )
371
+
372
+ attn_output = attn_output.transpose(1, 2).contiguous()
373
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
374
+
375
+ attn_output = self.o_proj(attn_output)
376
+
377
+ if not output_attentions:
378
+ attn_weights = None
379
+
380
+ return attn_output, attn_weights, past_key_value
381
+
382
+
383
+ class Qwen2FlashAttention2(Qwen2Attention):
384
+ """
385
+ Qwen2 flash attention module, following Qwen2 attention module. This module inherits from `Qwen2Attention`
386
+ as the weights of the module stays untouched. The only required change would be on the forward pass
387
+ where it needs to correctly call the public API of flash attention and deal with padding tokens
388
+ in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom
389
+ config.max_window_layers layers.
390
+ """
391
+
392
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
393
+ def __init__(self, *args, **kwargs):
394
+ super().__init__(*args, **kwargs)
395
+
396
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
397
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
398
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
399
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
400
+
401
+ def forward(
402
+ self,
403
+ hidden_states: torch.Tensor,
404
+ attention_mask: Optional[torch.Tensor] = None,
405
+ position_ids: Optional[torch.LongTensor] = None,
406
+ past_key_value: Optional[Cache] = None,
407
+ output_attentions: bool = False,
408
+ use_cache: bool = False,
409
+ ):
410
+ bsz, q_len, _ = hidden_states.size()
411
+
412
+ query_states = self.q_proj(hidden_states)
413
+ key_states = self.k_proj(hidden_states)
414
+ value_states = self.v_proj(hidden_states)
415
+
416
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
417
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
418
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
419
+
420
+ kv_seq_len = key_states.shape[-2]
421
+ if past_key_value is not None:
422
+ if self.layer_idx is None:
423
+ raise ValueError(
424
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
425
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
426
+ "with a layer index."
427
+ )
428
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
429
+
430
+ # Because the input can be padded, the absolute sequence length depends on the max position id.
431
+ rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
432
+ cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
433
+
434
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
435
+
436
+ use_sliding_windows = (
437
+ _flash_supports_window_size
438
+ and getattr(self.config, "sliding_window", None) is not None
439
+ and kv_seq_len > self.config.sliding_window
440
+ and self.config.use_sliding_window
441
+ )
442
+
443
+ if not _flash_supports_window_size:
444
+ logger.warning_once(
445
+ "The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
446
+ " make sure to upgrade flash-attn library."
447
+ )
448
+
449
+ if past_key_value is not None:
450
+ # Activate slicing cache only if the config has a value `sliding_windows` attribute
451
+ cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
452
+ if (
453
+ getattr(self.config, "sliding_window", None) is not None
454
+ and kv_seq_len > self.config.sliding_window
455
+ and cache_has_contents
456
+ ):
457
+ slicing_tokens = 1 - self.config.sliding_window
458
+
459
+ past_key = past_key_value[self.layer_idx][0]
460
+ past_value = past_key_value[self.layer_idx][1]
461
+
462
+ past_key = past_key[:, :, slicing_tokens:, :].contiguous()
463
+ past_value = past_value[:, :, slicing_tokens:, :].contiguous()
464
+
465
+ if past_key.shape[-2] != self.config.sliding_window - 1:
466
+ raise ValueError(
467
+ f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
468
+ f" {past_key.shape}"
469
+ )
470
+
471
+ if attention_mask is not None:
472
+ attention_mask = attention_mask[:, slicing_tokens:]
473
+ attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
474
+
475
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
476
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
477
+
478
+ # repeat k/v heads if n_kv_heads < n_heads
479
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
480
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
481
+ dropout_rate = 0.0 if not self.training else self.attention_dropout
482
+
483
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
484
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
485
+ # cast them back in float16 just to be sure everything works as expected.
486
+ input_dtype = query_states.dtype
487
+ if input_dtype == torch.float32:
488
+ if torch.is_autocast_enabled():
489
+ target_dtype = torch.get_autocast_gpu_dtype()
490
+ # Handle the case where the model is quantized
491
+ elif hasattr(self.config, "_pre_quantization_dtype"):
492
+ target_dtype = self.config._pre_quantization_dtype
493
+ else:
494
+ target_dtype = self.q_proj.weight.dtype
495
+
496
+ logger.warning_once(
497
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
498
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
499
+ f" {target_dtype}."
500
+ )
501
+
502
+ query_states = query_states.to(target_dtype)
503
+ key_states = key_states.to(target_dtype)
504
+ value_states = value_states.to(target_dtype)
505
+
506
+ # Reashape to the expected shape for Flash Attention
507
+ query_states = query_states.transpose(1, 2)
508
+ key_states = key_states.transpose(1, 2)
509
+ value_states = value_states.transpose(1, 2)
510
+
511
+ attn_output = self._flash_attention_forward(
512
+ query_states,
513
+ key_states,
514
+ value_states,
515
+ attention_mask,
516
+ q_len,
517
+ dropout=dropout_rate,
518
+ use_sliding_windows=use_sliding_windows,
519
+ )
520
+
521
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
522
+ attn_output = self.o_proj(attn_output)
523
+
524
+ if not output_attentions:
525
+ attn_weights = None
526
+
527
+ return attn_output, attn_weights, past_key_value
528
+
529
+ def _flash_attention_forward(
530
+ self,
531
+ query_states,
532
+ key_states,
533
+ value_states,
534
+ attention_mask,
535
+ query_length,
536
+ dropout=0.0,
537
+ softmax_scale=None,
538
+ use_sliding_windows=False,
539
+ ):
540
+ """
541
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
542
+ first unpad the input, then computes the attention scores and pad the final attention scores.
543
+
544
+ Args:
545
+ query_states (`torch.Tensor`):
546
+ Input query states to be passed to Flash Attention API
547
+ key_states (`torch.Tensor`):
548
+ Input key states to be passed to Flash Attention API
549
+ value_states (`torch.Tensor`):
550
+ Input value states to be passed to Flash Attention API
551
+ attention_mask (`torch.Tensor`):
552
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
553
+ position of padding tokens and 1 for the position of non-padding tokens.
554
+ dropout (`float`):
555
+ Attention dropout
556
+ softmax_scale (`float`, *optional*):
557
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
558
+ use_sliding_windows (`bool`, *optional*):
559
+ Whether to activate sliding window attention.
560
+ """
561
+ if not self._flash_attn_uses_top_left_mask:
562
+ causal = self.is_causal
563
+ else:
564
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
565
+ causal = self.is_causal and query_length != 1
566
+
567
+ # Decide whether to use SWA or not by layer index.
568
+ if use_sliding_windows and self.layer_idx >= self.config.max_window_layers:
569
+ use_sliding_windows = False
570
+
571
+ # Contains at least one padding token in the sequence
572
+ if attention_mask is not None:
573
+ batch_size = query_states.shape[0]
574
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
575
+ query_states, key_states, value_states, attention_mask, query_length
576
+ )
577
+
578
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
579
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
580
+
581
+ if not use_sliding_windows:
582
+ attn_output_unpad = flash_attn_varlen_func(
583
+ query_states,
584
+ key_states,
585
+ value_states,
586
+ cu_seqlens_q=cu_seqlens_q,
587
+ cu_seqlens_k=cu_seqlens_k,
588
+ max_seqlen_q=max_seqlen_in_batch_q,
589
+ max_seqlen_k=max_seqlen_in_batch_k,
590
+ dropout_p=dropout,
591
+ softmax_scale=softmax_scale,
592
+ causal=causal,
593
+ )
594
+ else:
595
+ attn_output_unpad = flash_attn_varlen_func(
596
+ query_states,
597
+ key_states,
598
+ value_states,
599
+ cu_seqlens_q=cu_seqlens_q,
600
+ cu_seqlens_k=cu_seqlens_k,
601
+ max_seqlen_q=max_seqlen_in_batch_q,
602
+ max_seqlen_k=max_seqlen_in_batch_k,
603
+ dropout_p=dropout,
604
+ softmax_scale=softmax_scale,
605
+ causal=causal,
606
+ window_size=(self.config.sliding_window, self.config.sliding_window),
607
+ )
608
+
609
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
610
+ else:
611
+ if not use_sliding_windows:
612
+ attn_output = flash_attn_func(
613
+ query_states,
614
+ key_states,
615
+ value_states,
616
+ dropout,
617
+ softmax_scale=softmax_scale,
618
+ causal=causal,
619
+ )
620
+ else:
621
+ attn_output = flash_attn_func(
622
+ query_states,
623
+ key_states,
624
+ value_states,
625
+ dropout,
626
+ softmax_scale=softmax_scale,
627
+ causal=causal,
628
+ window_size=(self.config.sliding_window, self.config.sliding_window),
629
+ )
630
+
631
+ return attn_output
632
+
633
+ # Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2._upad_input
634
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
635
+ batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
636
+
637
+ # On the first iteration we need to properly re-create the padding mask
638
+ # by slicing it on the proper place
639
+ if kv_seq_len != attention_mask.shape[-1]:
640
+ attention_mask_num_tokens = attention_mask.shape[-1]
641
+ attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :]
642
+
643
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
644
+
645
+ key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
646
+ value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
647
+
648
+ if query_length == kv_seq_len:
649
+ query_layer = index_first_axis(
650
+ query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
651
+ )
652
+ cu_seqlens_q = cu_seqlens_k
653
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
654
+ indices_q = indices_k
655
+ elif query_length == 1:
656
+ max_seqlen_in_batch_q = 1
657
+ cu_seqlens_q = torch.arange(
658
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
659
+ ) # There is a memcpy here, that is very bad.
660
+ indices_q = cu_seqlens_q[:-1]
661
+ query_layer = query_layer.squeeze(1)
662
+ else:
663
+ # The -q_len: slice assumes left padding.
664
+ attention_mask = attention_mask[:, -query_length:]
665
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
666
+
667
+ return (
668
+ query_layer,
669
+ key_layer,
670
+ value_layer,
671
+ indices_q,
672
+ (cu_seqlens_q, cu_seqlens_k),
673
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
674
+ )
675
+
676
+
677
+ # Copied from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Qwen2
678
+ class Qwen2SdpaAttention(Qwen2Attention):
679
+ """
680
+ Qwen2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
681
+ `Qwen2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
682
+ SDPA API.
683
+ """
684
+
685
+ # Adapted from Qwen2Attention.forward
686
+ def forward(
687
+ self,
688
+ hidden_states: torch.Tensor,
689
+ attention_mask: Optional[torch.Tensor] = None,
690
+ position_ids: Optional[torch.LongTensor] = None,
691
+ past_key_value: Optional[Cache] = None,
692
+ output_attentions: bool = False,
693
+ use_cache: bool = False,
694
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
695
+ if output_attentions:
696
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
697
+ logger.warning_once(
698
+ "Qwen2Model is using Qwen2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
699
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
700
+ )
701
+ return super().forward(
702
+ hidden_states=hidden_states,
703
+ attention_mask=attention_mask,
704
+ position_ids=position_ids,
705
+ past_key_value=past_key_value,
706
+ output_attentions=output_attentions,
707
+ use_cache=use_cache,
708
+ )
709
+
710
+ bsz, q_len, _ = hidden_states.size()
711
+
712
+ query_states = self.q_proj(hidden_states)
713
+ key_states = self.k_proj(hidden_states)
714
+ value_states = self.v_proj(hidden_states)
715
+
716
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
717
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
718
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
719
+
720
+ kv_seq_len = key_states.shape[-2]
721
+ if past_key_value is not None:
722
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
723
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
724
+
725
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
726
+
727
+ if past_key_value is not None:
728
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
729
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
730
+
731
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
732
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
733
+
734
+ if attention_mask is not None:
735
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
736
+ raise ValueError(
737
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
738
+ )
739
+
740
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
741
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
742
+ if query_states.device.type == "cuda" and attention_mask is not None:
743
+ query_states = query_states.contiguous()
744
+ key_states = key_states.contiguous()
745
+ value_states = value_states.contiguous()
746
+
747
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
748
+ query_states,
749
+ key_states,
750
+ value_states,
751
+ attn_mask=attention_mask,
752
+ dropout_p=self.attention_dropout if self.training else 0.0,
753
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
754
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
755
+ )
756
+
757
+ attn_output = attn_output.transpose(1, 2).contiguous()
758
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
759
+
760
+ attn_output = self.o_proj(attn_output)
761
+
762
+ return attn_output, None, past_key_value
763
+
764
+
765
+ QWEN2_ATTENTION_CLASSES = {
766
+ "eager": Qwen2Attention,
767
+ "flash_attention_2": Qwen2FlashAttention2,
768
+ "sdpa": Qwen2SdpaAttention,
769
+ }
770
+
771
+
772
+ class Qwen2DecoderLayer(nn.Module):
773
+ def __init__(self, config: Qwen2TSConfig, layer_idx: int):
774
+ super().__init__()
775
+ self.hidden_size = config.hidden_size
776
+
777
+ if config.use_sliding_window and config._attn_implementation != "flash_attention_2":
778
+ logger.warning_once(
779
+ f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
780
+ "unexpected results may be encountered."
781
+ )
782
+ self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
783
+
784
+ self.mlp = Qwen2MLP(config)
785
+ self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
786
+ self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
787
+
788
+ def forward(
789
+ self,
790
+ hidden_states: torch.Tensor,
791
+ attention_mask: Optional[torch.Tensor] = None,
792
+ position_ids: Optional[torch.LongTensor] = None,
793
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
794
+ output_attentions: Optional[bool] = False,
795
+ use_cache: Optional[bool] = False,
796
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
797
+ """
798
+ Args:
799
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
800
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
801
+ `(batch, sequence_length)` where padding elements are indicated by 0.
802
+ output_attentions (`bool`, *optional*):
803
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
804
+ returned tensors for more detail.
805
+ use_cache (`bool`, *optional*):
806
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
807
+ (see `past_key_values`).
808
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
809
+ """
810
+
811
+ residual = hidden_states
812
+
813
+ hidden_states = self.input_layernorm(hidden_states)
814
+
815
+ # Self Attention
816
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
817
+ hidden_states=hidden_states,
818
+ attention_mask=attention_mask,
819
+ position_ids=position_ids,
820
+ past_key_value=past_key_value,
821
+ output_attentions=output_attentions,
822
+ use_cache=use_cache,
823
+ )
824
+ hidden_states = residual + hidden_states
825
+
826
+ # Fully Connected
827
+ residual = hidden_states
828
+ hidden_states = self.post_attention_layernorm(hidden_states)
829
+ hidden_states = self.mlp(hidden_states)
830
+ hidden_states = residual + hidden_states
831
+
832
+ outputs = (hidden_states,)
833
+
834
+ if output_attentions:
835
+ outputs += (self_attn_weights,)
836
+
837
+ if use_cache:
838
+ outputs += (present_key_value,)
839
+
840
+ return outputs
841
+
842
+
843
+ QWEN2_START_DOCSTRING = r"""
844
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
845
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
846
+ etc.)
847
+
848
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
849
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
850
+ and behavior.
851
+
852
+ Parameters:
853
+ config ([`Qwen2TSConfig`]):
854
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
855
+ load the weights associated with the model, only the configuration. Check out the
856
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
857
+ """
858
+
859
+
860
+ @add_start_docstrings(
861
+ "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
862
+ QWEN2_START_DOCSTRING,
863
+ )
864
+ class Qwen2PreTrainedModel(PreTrainedModel):
865
+ config_class = Qwen2TSConfig
866
+ base_model_prefix = "model"
867
+ supports_gradient_checkpointing = True
868
+ _no_split_modules = ["Qwen2DecoderLayer"]
869
+ _skip_keys_device_placement = "past_key_values"
870
+ _supports_flash_attn_2 = True
871
+ _supports_sdpa = True
872
+ _supports_cache_class = True
873
+
874
+ def _init_weights(self, module):
875
+ std = self.config.initializer_range
876
+ if isinstance(module, nn.Linear):
877
+ module.weight.data.normal_(mean=0.0, std=std)
878
+ if module.bias is not None:
879
+ module.bias.data.zero_()
880
+ elif isinstance(module, nn.Embedding):
881
+ module.weight.data.normal_(mean=0.0, std=std)
882
+ if module.padding_idx is not None:
883
+ module.weight.data[module.padding_idx].zero_()
884
+
885
+
886
+ class TSProjector(nn.Module):
887
+ def __init__(self, config: Qwen2TSConfig):
888
+ super().__init__()
889
+ self.config = config
890
+ self.linear_1 = nn.Linear(config.ts['d_model'], config.hidden_size, bias=True)
891
+ self.linear_2 = nn.LayerNorm(config.hidden_size, bias=True)
892
+ self.linear_3 = nn.Linear(config.hidden_size, config.hidden_size * 4, bias=True)
893
+ self.linear_4 = nn.LayerNorm(config.hidden_size * 4, bias=True)
894
+ self.act = nn.GELU()
895
+
896
+ def forward(self, ts_features):
897
+ hidden_states = self.linear_1(ts_features)
898
+ hidden_states = self.linear_2(hidden_states)
899
+ hidden_states = self.act(hidden_states)
900
+ hidden_states = self.linear_3(hidden_states)
901
+ hidden_states = self.linear_4(hidden_states)
902
+ hidden_states = hidden_states.reshape(hidden_states.size(0), -1, self.config.hidden_size)
903
+ return hidden_states
904
+
905
+
906
+ QWEN2_INPUTS_DOCSTRING = r"""
907
+ Args:
908
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
909
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
910
+ it.
911
+
912
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
913
+ [`PreTrainedTokenizer.__call__`] for details.
914
+
915
+ [What are input IDs?](../glossary#input-ids)
916
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
917
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
918
+
919
+ - 1 for tokens that are **not masked**,
920
+ - 0 for tokens that are **masked**.
921
+
922
+ [What are attention masks?](../glossary#attention-mask)
923
+
924
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
925
+ [`PreTrainedTokenizer.__call__`] for details.
926
+
927
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
928
+ `past_key_values`).
929
+
930
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
931
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
932
+ information on the default strategy.
933
+
934
+ - 1 indicates the head is **not masked**,
935
+ - 0 indicates the head is **masked**.
936
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
937
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
938
+ config.n_positions - 1]`.
939
+
940
+ [What are position IDs?](../glossary#position-ids)
941
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
942
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
943
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
944
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
945
+
946
+ Two formats are allowed:
947
+ - a [`~cache_utils.Cache`] instance;
948
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
949
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
950
+ cache format.
951
+
952
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
953
+ legacy cache format will be returned.
954
+
955
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
956
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
957
+ of shape `(batch_size, sequence_length)`.
958
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
959
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
960
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
961
+ model's internal embedding lookup matrix.
962
+ use_cache (`bool`, *optional*):
963
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
964
+ `past_key_values`).
965
+ output_attentions (`bool`, *optional*):
966
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
967
+ tensors for more detail.
968
+ output_hidden_states (`bool`, *optional*):
969
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
970
+ more detail.
971
+ return_dict (`bool`, *optional*):
972
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
973
+ """
974
+
975
+
976
+ @add_start_docstrings(
977
+ "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
978
+ QWEN2_START_DOCSTRING,
979
+ )
980
+ class Qwen2Model(Qwen2PreTrainedModel):
981
+ """
982
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
983
+
984
+ Args:
985
+ config: Qwen2TSConfig
986
+ """
987
+
988
+ def __init__(self, config: Qwen2TSConfig):
989
+ super().__init__(config)
990
+ self.padding_idx = config.pad_token_id
991
+ self.vocab_size = config.vocab_size
992
+
993
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
994
+ self.layers = nn.ModuleList(
995
+ [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
996
+ )
997
+ self._attn_implementation = config._attn_implementation
998
+ self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
999
+
1000
+ self.gradient_checkpointing = False
1001
+
1002
+ # Initialize weights and apply final processing
1003
+ self.post_init()
1004
+
1005
+ def get_input_embeddings(self):
1006
+ return self.embed_tokens
1007
+
1008
+ def set_input_embeddings(self, value):
1009
+ self.embed_tokens = value
1010
+
1011
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1012
+ def forward(
1013
+ self,
1014
+ input_ids: torch.LongTensor = None,
1015
+ attention_mask: Optional[torch.Tensor] = None,
1016
+ position_ids: Optional[torch.LongTensor] = None,
1017
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1018
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1019
+ use_cache: Optional[bool] = None,
1020
+ output_attentions: Optional[bool] = None,
1021
+ output_hidden_states: Optional[bool] = None,
1022
+ return_dict: Optional[bool] = None,
1023
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
1024
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1025
+ output_hidden_states = (
1026
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1027
+ )
1028
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1029
+
1030
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1031
+
1032
+ # retrieve input_ids and inputs_embeds
1033
+ if input_ids is not None and inputs_embeds is not None:
1034
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
1035
+ elif input_ids is not None:
1036
+ batch_size, seq_length = input_ids.shape
1037
+ elif inputs_embeds is not None:
1038
+ batch_size, seq_length, _ = inputs_embeds.shape
1039
+ else:
1040
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
1041
+
1042
+ if self.gradient_checkpointing and self.training:
1043
+ if use_cache:
1044
+ logger.warning_once(
1045
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
1046
+ )
1047
+ use_cache = False
1048
+
1049
+ past_key_values_length = 0
1050
+
1051
+ if use_cache:
1052
+ use_legacy_cache = not isinstance(past_key_values, Cache)
1053
+ if use_legacy_cache:
1054
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
1055
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
1056
+
1057
+ if position_ids is None:
1058
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1059
+ position_ids = torch.arange(
1060
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1061
+ )
1062
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
1063
+ else:
1064
+ position_ids = position_ids.view(-1, seq_length).long()
1065
+
1066
+ if inputs_embeds is None:
1067
+ inputs_embeds = self.embed_tokens(input_ids)
1068
+
1069
+ if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
1070
+ is_padding_right = attention_mask[:, -1].sum().item() != batch_size
1071
+ if is_padding_right:
1072
+ raise ValueError(
1073
+ "You are attempting to perform batched generation with padding_side='right'"
1074
+ " this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to "
1075
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
1076
+ )
1077
+
1078
+ if self._attn_implementation == "flash_attention_2":
1079
+ # 2d mask is passed through the layers
1080
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1081
+ elif self._attn_implementation == "sdpa" and not output_attentions:
1082
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
1083
+ # the manual implementation that requires a 4D causal mask in all cases.
1084
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
1085
+ attention_mask,
1086
+ (batch_size, seq_length),
1087
+ inputs_embeds,
1088
+ past_key_values_length,
1089
+ sliding_window=self.config.sliding_window,
1090
+ )
1091
+ else:
1092
+ # 4d mask is passed through the layers
1093
+ attention_mask = _prepare_4d_causal_attention_mask(
1094
+ attention_mask,
1095
+ (batch_size, seq_length),
1096
+ inputs_embeds,
1097
+ past_key_values_length,
1098
+ sliding_window=self.config.sliding_window,
1099
+ )
1100
+
1101
+ hidden_states = inputs_embeds
1102
+
1103
+ # decoder layers
1104
+ all_hidden_states = () if output_hidden_states else None
1105
+ all_self_attns = () if output_attentions else None
1106
+ next_decoder_cache = None
1107
+
1108
+ for decoder_layer in self.layers:
1109
+ if output_hidden_states:
1110
+ all_hidden_states += (hidden_states,)
1111
+
1112
+ if self.gradient_checkpointing and self.training:
1113
+ layer_outputs = self._gradient_checkpointing_func(
1114
+ decoder_layer.__call__,
1115
+ hidden_states,
1116
+ attention_mask,
1117
+ position_ids,
1118
+ past_key_values,
1119
+ output_attentions,
1120
+ use_cache,
1121
+ )
1122
+ else:
1123
+ layer_outputs = decoder_layer(
1124
+ hidden_states,
1125
+ attention_mask=attention_mask,
1126
+ position_ids=position_ids,
1127
+ past_key_value=past_key_values,
1128
+ output_attentions=output_attentions,
1129
+ use_cache=use_cache,
1130
+ )
1131
+
1132
+ hidden_states = layer_outputs[0]
1133
+
1134
+ if use_cache:
1135
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1136
+
1137
+ if output_attentions:
1138
+ all_self_attns += (layer_outputs[1],)
1139
+
1140
+ hidden_states = self.norm(hidden_states)
1141
+
1142
+ # add hidden states from the last decoder layer
1143
+ if output_hidden_states:
1144
+ all_hidden_states += (hidden_states,)
1145
+
1146
+ next_cache = None
1147
+ if use_cache:
1148
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1149
+
1150
+ if not return_dict:
1151
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1152
+ return BaseModelOutputWithPast(
1153
+ last_hidden_state=hidden_states,
1154
+ past_key_values=next_cache,
1155
+ hidden_states=all_hidden_states,
1156
+ attentions=all_self_attns,
1157
+ )
1158
+
1159
+
1160
+ class Qwen2TSForCausalLM(Qwen2PreTrainedModel):
1161
+ _tied_weights_keys = ["lm_head.weight"]
1162
+
1163
+ def __init__(self, config):
1164
+ super().__init__(config)
1165
+ self.config = config
1166
+
1167
+ self.model = Qwen2Model(config)
1168
+ self.vocab_size = config.vocab_size
1169
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1170
+
1171
+ # TS embedding
1172
+ self.ts_encoder = TimeSeriesEmbedding(config.ts)
1173
+
1174
+ # Initialize weights and apply final processing
1175
+ self.post_init()
1176
+
1177
+ def get_input_embeddings(self):
1178
+ return self.model.embed_tokens
1179
+
1180
+ def set_input_embeddings(self, value):
1181
+ self.model.embed_tokens = value
1182
+
1183
+ def get_output_embeddings(self):
1184
+ return self.lm_head
1185
+
1186
+ def set_output_embeddings(self, new_embeddings):
1187
+ self.lm_head = new_embeddings
1188
+
1189
+ def set_decoder(self, decoder):
1190
+ self.model = decoder
1191
+
1192
+ def get_decoder(self):
1193
+ return self.model
1194
+
1195
+ def _get_real_length(self, timeseries, input_ids):
1196
+ # Return the embed length after inserting timeseries features
1197
+ if timeseries is None:
1198
+ return input_ids.size(1)
1199
+
1200
+ num_time_steps = timeseries.size(1) * timeseries.size(2) // self.config.ts['num_features']
1201
+ num_patches = num_time_steps // self.config.ts['patch_size']
1202
+ special_ts_token_mask_start = input_ids == self.config.ts_token_start_index
1203
+ num_special_ts_tokens = torch.sum(special_ts_token_mask_start, dim=-1)
1204
+ return num_special_ts_tokens * (num_patches - 2) + input_ids.size(1)
1205
+
1206
+ def _get_original_length(self, timeseries, input_ids, past_length):
1207
+ """
1208
+ 根据转换后的 past_length 计算对应的原始序列长度,并返回包含的 <ts> 标记数量。
1209
+
1210
+ Args:
1211
+ timeseries (Tensor): 时间序列数据张量,形状为 (batch_size, num_time_steps)。
1212
+ input_ids (Tensor): 原始输入 IDs 张量,形状为 (batch_size, seq_length)。
1213
+ past_length (int 或 Tensor): 转换后的序列长度(包含插入的时间序列特征 token),可以是标量或形状为 (batch_size,) 的张量。
1214
+
1215
+ Returns:
1216
+ Tuple[Tensor, Tensor]:
1217
+ - original_length (Tensor): 每个样本对应的原始序列长度,形状为 (batch_size,)。
1218
+ - num_special_ts_tokens_within_past (Tensor): 每个样本在 past_length 范围内包含的 <ts> 标记数量,形状为 (batch_size,)。
1219
+ """
1220
+ if timeseries is None:
1221
+ # 如果没有时间序列特征插入,原始长度等于 past_length
1222
+ if isinstance(past_length, int):
1223
+ original_length = torch.full((input_ids.size(0),), past_length, dtype=torch.long, device=input_ids.device)
1224
+ else:
1225
+ original_length = past_length
1226
+ num_special_ts_tokens_within_past = torch.zeros(input_ids.size(0), dtype=torch.long, device=input_ids.device)
1227
+ return original_length, num_special_ts_tokens_within_past
1228
+
1229
+ # 获取配置参数
1230
+ patch_size = self.config.ts['patch_size']
1231
+ num_patches = timeseries.size(1) * timeseries.size(2) // patch_size // self.config.ts['num_features']
1232
+ ts_token_start_index = self.config.ts_token_start_index
1233
+
1234
+ # 生成 mask,标识 <ts> token 的位置
1235
+ ts_mask = (input_ids == ts_token_start_index).long() # (batch_size, seq_length)
1236
+
1237
+ # 计算每个位置之前的 <ts> token 数量的累积和
1238
+ cumsum_ts = torch.cumsum(ts_mask, dim=1) # (batch_size, seq_length)
1239
+
1240
+ # 生成位置索引,从 1 开始
1241
+ seq_length = input_ids.size(1)
1242
+ positions = torch.arange(1, seq_length + 1, device=input_ids.device).unsqueeze(0).expand_as(input_ids) # (batch_size, seq_length)
1243
+
1244
+ # 计算转换后的位置
1245
+ transformed_length = positions + cumsum_ts * (num_patches - 2) # (batch_size, seq_length)
1246
+
1247
+ # 处理 past_length,可以是标量或张量
1248
+ if isinstance(past_length, int):
1249
+ past_length_tensor = torch.full((input_ids.size(0),), past_length, dtype=torch.long, device=input_ids.device)
1250
+ else:
1251
+ past_length_tensor = past_length.to(input_ids.device)
1252
+
1253
+ # 创建一个 mask,标识哪些原始位置在转换后不超过 past_length
1254
+ mask = transformed_length <= past_length_tensor.unsqueeze(1) # (batch_size, seq_length)
1255
+
1256
+ # 对每个样本,计算满足条件的位置数量,即原始长度
1257
+ original_length = torch.sum(mask, dim=1) # (batch_size,)
1258
+
1259
+ # 计算在 original_length 范围内包含的 <ts> 标记数量
1260
+ # 生成一个 mask,标识 original_length 范围内的 <ts> token
1261
+ # 首先生成一个位置索引
1262
+ original_positions = torch.arange(1, seq_length + 1, device=input_ids.device).unsqueeze(0).expand_as(input_ids) # (batch_size, seq_length)
1263
+ original_mask = original_positions <= original_length.unsqueeze(1) # (batch_size, seq_length)
1264
+ ts_within_original_mask = ts_mask.bool() & original_mask.bool() # (batch_size, seq_length)
1265
+ num_special_ts_tokens_within_past = torch.sum(ts_within_original_mask, dim=1) # (batch_size,)
1266
+
1267
+ # 确保 original_length 不为负数
1268
+ original_length = torch.clamp(original_length, min=0)
1269
+
1270
+ return original_length, num_special_ts_tokens_within_past
1271
+
1272
+ def _merge_input_ids_with_time_series_features(
1273
+ self, time_series_features, inputs_embeds, input_ids, attention_mask, labels, patch_cnt
1274
+ ):
1275
+ total_time_steps, embed_dim = time_series_features.shape
1276
+ batch_size, sequence_length = input_ids.shape
1277
+ left_padding = False
1278
+
1279
+ # 1. Create a mask to know where special time series tokens are
1280
+ special_ts_token_mask_start = input_ids == self.config.ts_token_start_index
1281
+ special_ts_token_mask_end = input_ids == self.config.ts_token_end_index
1282
+ special_ts_token_mask = special_ts_token_mask_start | special_ts_token_mask_end
1283
+ # print("Special ts token mask:", special_ts_token_mask)
1284
+ num_special_ts_tokens = torch.sum(special_ts_token_mask_start, dim=-1)
1285
+ # Correctly calculate the total number of patches per batch
1286
+ num_total_patches = torch.zeros(batch_size, dtype=patch_cnt.dtype, device=patch_cnt.device)
1287
+ special_ts_token_mask_start_nonzero = special_ts_token_mask_start.nonzero()
1288
+ special_ts_token_mask_start_with_size = special_ts_token_mask_start.clone().long()
1289
+ patch_index = 0
1290
+ for i in range(batch_size):
1291
+ num_ts_in_batch = num_special_ts_tokens[i]
1292
+ num_total_patches[i] = patch_cnt[patch_index:patch_index + num_ts_in_batch].sum() - 2 * num_ts_in_batch
1293
+ for idx in range(patch_index, patch_index + num_ts_in_batch):
1294
+ batch_idx, seq_idx = special_ts_token_mask_start_nonzero[idx]
1295
+ special_ts_token_mask_start_with_size[batch_idx, seq_idx] *= (patch_cnt[idx].item() - 2)
1296
+ patch_index += num_ts_in_batch
1297
+
1298
+ # Compute the maximum embed dimension, considering both start and end tokens
1299
+ max_embed_dim = sequence_length + num_total_patches.max()
1300
+
1301
+ # batch_indices, non_ts_indices = torch.where(~special_ts_token_mask)
1302
+ batch_indices, non_ts_indices = torch.where(~special_ts_token_mask)
1303
+ # print("non_ts_indices:", non_ts_indices)
1304
+ # print("batch_indices:", batch_indices)
1305
+
1306
+ # 2. Compute the positions where text should be written
1307
+ new_token_positions = torch.cumsum((special_ts_token_mask_start_with_size + 1), dim=-1) - 1
1308
+ # print("new_token_positions", new_token_positions)
1309
+ nb_ts_pad = max_embed_dim - 1 - new_token_positions[:, -1]
1310
+ if left_padding:
1311
+ new_token_positions += nb_ts_pad[:, None] # offset for left padding
1312
+ text_to_overwrite = new_token_positions[batch_indices, non_ts_indices]
1313
+ # print('nb_ts_pad', nb_ts_pad)
1314
+
1315
+ # 3. Create the full embedding, already padded to the maximum position
1316
+ final_embedding = torch.zeros(
1317
+ batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
1318
+ )
1319
+ final_attention_mask = torch.zeros(
1320
+ batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
1321
+ )
1322
+ if labels is not None:
1323
+ final_labels = torch.full(
1324
+ (batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device
1325
+ )
1326
+ target_device = inputs_embeds.device
1327
+ batch_indices, non_ts_indices, text_to_overwrite = (
1328
+ batch_indices.to(target_device),
1329
+ non_ts_indices.to(target_device),
1330
+ text_to_overwrite.to(target_device),
1331
+ )
1332
+ attention_mask = attention_mask.to(target_device)
1333
+
1334
+ # 4. Fill the embeddings based on the mask
1335
+ final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_ts_indices]
1336
+ final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_ts_indices]
1337
+ # print('final_attention_mask=', final_attention_mask)
1338
+ if labels is not None:
1339
+ final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_ts_indices]
1340
+
1341
+ # 5. Fill the embeddings corresponding to the time series
1342
+ ts_to_overwrite = torch.full(
1343
+ (batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
1344
+ )
1345
+ ts_to_overwrite[batch_indices, text_to_overwrite] = False
1346
+ # print('ts_to_overwrite.long().cumsum(-1) - 1=', ts_to_overwrite.long().cumsum(-1) - 1)
1347
+ # print('nb_ts_pad=', nb_ts_pad[:, None])
1348
+ reversed_cumsum = ts_to_overwrite.flip(dims=[-1]).cumsum(-1).flip(dims=[-1]) - 1
1349
+ ts_to_overwrite &= reversed_cumsum >= nb_ts_pad[:, None].to(target_device)
1350
+ # print('ts_to_overwrite=', ts_to_overwrite)
1351
+
1352
+ if ts_to_overwrite.sum() != time_series_features.shape[:-1].numel():
1353
+ raise ValueError(
1354
+ f"The input provided to the model are wrong. The number of time series tokens is {torch.sum(special_ts_token_mask_start)} while"
1355
+ f" the number of time series given to the model is {len(patch_cnt)}. This prevents correct indexing and breaks batch generation."
1356
+ )
1357
+
1358
+ final_embedding[ts_to_overwrite] = time_series_features.contiguous().reshape(-1, embed_dim).to(target_device)
1359
+ # logger.warning(f"[DEBUG] {final_embedding[ts_to_overwrite][:, 0]=}")
1360
+ final_attention_mask |= ts_to_overwrite
1361
+ position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
1362
+
1363
+ # 6. Mask out the embedding at padding positions
1364
+ batch_indices, pad_indices = torch.where(input_ids == self.config.pad_token_id)
1365
+ indices_to_mask = new_token_positions[batch_indices, pad_indices]
1366
+
1367
+ final_embedding[batch_indices, indices_to_mask] = 0
1368
+
1369
+ if labels is None:
1370
+ final_labels = None
1371
+
1372
+ return final_embedding, final_attention_mask, position_ids, final_labels
1373
+
1374
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1375
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1376
+ def forward(
1377
+ self,
1378
+ input_ids: torch.LongTensor = None,
1379
+ timeseries: torch.FloatTensor = None,
1380
+ attention_mask: Optional[torch.Tensor] = None,
1381
+ position_ids: Optional[torch.LongTensor] = None,
1382
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1383
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1384
+ labels: Optional[torch.LongTensor] = None,
1385
+ use_cache: Optional[bool] = None,
1386
+ output_attentions: Optional[bool] = None,
1387
+ output_hidden_states: Optional[bool] = None,
1388
+ return_dict: Optional[bool] = None,
1389
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1390
+ r"""
1391
+ Args:
1392
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1393
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1394
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1395
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1396
+
1397
+ Returns:
1398
+
1399
+ Example:
1400
+
1401
+ ```python
1402
+ >>> from transformers import AutoTokenizer, Qwen2ForCausalLM
1403
+
1404
+ >>> model = Qwen2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1405
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1406
+
1407
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1408
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1409
+
1410
+ >>> # Generate
1411
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1412
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1413
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1414
+ ```"""
1415
+
1416
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1417
+ output_hidden_states = (
1418
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1419
+ )
1420
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1421
+
1422
+ if inputs_embeds is None:
1423
+ inputs_embeds = self.get_input_embeddings()(input_ids)
1424
+
1425
+ if timeseries is not None and timeseries.shape[0] > 0:
1426
+ use_cache = False
1427
+ # print(f"[DEBUG] input timeseries.shape: {timeseries.shape}")
1428
+
1429
+ # 调用 ts_encoder,并打印输入和输出的形状
1430
+ ts_features, patch_cnt = self.ts_encoder(timeseries)
1431
+ # print(f"[DEBUG] ts_features.shape: {ts_features.shape}")
1432
+ # print(f"[DEBUG] patch_cnt: {patch_cnt}")
1433
+
1434
+ inputs_embeds = inputs_embeds.to(ts_features.dtype)
1435
+
1436
+ # 在合并前打印相关形状
1437
+ # print(f"[DEBUG] Before merging:")
1438
+ # print(f"{inputs_embeds[0, -5:, :5]=}")
1439
+ # print(f"{attention_mask.sum()=}")
1440
+ # print(f" inputs_embeds.shape: {inputs_embeds.shape}")
1441
+ # print(f" input_ids.shape: {input_ids.shape}")
1442
+ # print(f" attention_mask.shape: {attention_mask.shape}")
1443
+ # if labels is not None:
1444
+ # print(f" labels.shape: {labels.shape}")
1445
+ # else:
1446
+ # print(f" labels: None")
1447
+ # print(f" patch_cnt.shape: {patch_cnt.shape}")
1448
+
1449
+ # 调用 _merge_input_ids_with_time_series_features,并打印输出的形状
1450
+ inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_time_series_features(
1451
+ ts_features, inputs_embeds, input_ids, attention_mask, labels, patch_cnt
1452
+ )
1453
+
1454
+ # print(f"[DEBUG] After merging:")
1455
+ # print(f" inputs_embeds.shape: {inputs_embeds.shape}")
1456
+ # print(f" attention_mask.shape: {attention_mask.shape}")
1457
+ # print(f"{attention_mask.sum()=}")
1458
+ # print(f"{inputs_embeds[0, -5:, :5]=}")
1459
+
1460
+ # print(f" position_ids.shape: {position_ids.shape}")
1461
+ # if labels is not None:
1462
+ # print(f" labels.shape: {labels.shape}")
1463
+ # else:
1464
+ # print(f" labels: None")
1465
+
1466
+ # 继续模型的前向传播
1467
+ outputs = self.model(
1468
+ attention_mask=attention_mask,
1469
+ position_ids=position_ids,
1470
+ past_key_values=past_key_values,
1471
+ inputs_embeds=inputs_embeds,
1472
+ use_cache=use_cache,
1473
+ output_attentions=output_attentions,
1474
+ output_hidden_states=output_hidden_states,
1475
+ return_dict=return_dict,
1476
+ )
1477
+
1478
+ hidden_states = outputs[0]
1479
+ logits = self.lm_head(hidden_states)
1480
+ logits = logits.float()
1481
+
1482
+ loss = None
1483
+ if labels is not None:
1484
+ # Shift so that tokens < n predict n
1485
+ shift_logits = logits[..., :-1, :].contiguous()
1486
+ shift_labels = labels[..., 1:].contiguous()
1487
+ # Flatten the tokens
1488
+ loss_fct = CrossEntropyLoss()
1489
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1490
+ shift_labels = shift_labels.view(-1)
1491
+ # Enable model parallelism
1492
+ shift_labels = shift_labels.to(shift_logits.device)
1493
+ loss = loss_fct(shift_logits, shift_labels)
1494
+
1495
+ if not return_dict:
1496
+ output = (logits,) + outputs[1:]
1497
+ return (loss,) + output if loss is not None else output
1498
+
1499
+ return CausalLMOutputWithPast(
1500
+ loss=loss,
1501
+ logits=logits,
1502
+ past_key_values=outputs.past_key_values,
1503
+ hidden_states=outputs.hidden_states,
1504
+ attentions=outputs.attentions,
1505
+ )
1506
+
1507
+
1508
+ def prepare_inputs_for_generation(
1509
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, timeseries=None, **kwargs
1510
+ ):
1511
+ # Omit tokens covered by past_key_values
1512
+ if past_key_values is not None:
1513
+ if isinstance(past_key_values, Cache):
1514
+ cache_length = past_key_values.get_seq_length()
1515
+ past_length = past_key_values.seen_tokens
1516
+ max_cache_length = past_key_values.get_max_length()
1517
+ else:
1518
+ cache_length = past_length = past_key_values[0][0].shape[2]
1519
+ max_cache_length = None
1520
+
1521
+ # print(f"[prepare_inputs_for_generation] {cache_length=}, {past_length=}, {max_cache_length=}")
1522
+
1523
+ # Keep only the unprocessed tokens:
1524
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1525
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1526
+ # input)
1527
+ real_len = self._get_real_length(timeseries, input_ids)
1528
+ origin_past_len, past_num_ts = self._get_original_length(timeseries, input_ids, past_length)
1529
+ if attention_mask is not None and attention_mask.shape[1] > real_len:
1530
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1531
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1532
+ # input_ids based on the past_length.
1533
+ elif past_length < real_len:
1534
+ input_ids = input_ids[:, origin_past_len:]
1535
+ if timeseries is not None:
1536
+ timeseries = timeseries[past_num_ts:]
1537
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1538
+
1539
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1540
+ if (
1541
+ max_cache_length is not None
1542
+ and attention_mask is not None
1543
+ and cache_length + input_ids.size(1) > max_cache_length
1544
+ ):
1545
+ attention_mask = attention_mask[:, -max_cache_length:]
1546
+
1547
+ position_ids = kwargs.get("position_ids", None)
1548
+ if attention_mask is not None and position_ids is None:
1549
+ # create position_ids on the fly for batch generation
1550
+ position_ids = attention_mask.long().cumsum(-1) - 1
1551
+ position_ids.masked_fill_(attention_mask == 0, 1)
1552
+ if past_key_values:
1553
+ position_ids = position_ids[:, -input_ids.size(1) :]
1554
+
1555
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1556
+ if inputs_embeds is not None and past_key_values is None:
1557
+ model_inputs = {"inputs_embeds": inputs_embeds}
1558
+ else:
1559
+ model_inputs = {"input_ids": input_ids}
1560
+
1561
+ model_inputs.update(
1562
+ {
1563
+ "position_ids": position_ids,
1564
+ "past_key_values": past_key_values,
1565
+ "use_cache": kwargs.get("use_cache"),
1566
+ "attention_mask": attention_mask,
1567
+ "timeseries": timeseries
1568
+ }
1569
+ )
1570
+ return model_inputs
1571
+
1572
+ @staticmethod
1573
+ def _reorder_cache(past_key_values, beam_idx):
1574
+ reordered_past = ()
1575
+ for layer_past in past_key_values:
1576
+ reordered_past += (
1577
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1578
+ )
1579
+ return reordered_past
1580
+
1581
+
1582
+ @add_start_docstrings(
1583
+ """
1584
+ The Qwen2 Model transformer with a sequence classification head on top (linear layer).
1585
+
1586
+ [`Qwen2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1587
+ (e.g. GPT-2) do.
1588
+
1589
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1590
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1591
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1592
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1593
+ each row of the batch).
1594
+ """,
1595
+ QWEN2_START_DOCSTRING,
1596
+ )
1597
+ class Qwen2ForSequenceClassification(Qwen2PreTrainedModel):
1598
+ def __init__(self, config):
1599
+ super().__init__(config)
1600
+ self.num_labels = config.num_labels
1601
+ self.model = Qwen2Model(config)
1602
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1603
+
1604
+ # Initialize weights and apply final processing
1605
+ self.post_init()
1606
+
1607
+ def get_input_embeddings(self):
1608
+ return self.model.embed_tokens
1609
+
1610
+ def set_input_embeddings(self, value):
1611
+ self.model.embed_tokens = value
1612
+
1613
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1614
+ def forward(
1615
+ self,
1616
+ input_ids: torch.LongTensor = None,
1617
+ attention_mask: Optional[torch.Tensor] = None,
1618
+ position_ids: Optional[torch.LongTensor] = None,
1619
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1620
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1621
+ labels: Optional[torch.LongTensor] = None,
1622
+ use_cache: Optional[bool] = None,
1623
+ output_attentions: Optional[bool] = None,
1624
+ output_hidden_states: Optional[bool] = None,
1625
+ return_dict: Optional[bool] = None,
1626
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1627
+ r"""
1628
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1629
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1630
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1631
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1632
+ """
1633
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1634
+
1635
+ transformer_outputs = self.model(
1636
+ input_ids,
1637
+ attention_mask=attention_mask,
1638
+ position_ids=position_ids,
1639
+ past_key_values=past_key_values,
1640
+ inputs_embeds=inputs_embeds,
1641
+ use_cache=use_cache,
1642
+ output_attentions=output_attentions,
1643
+ output_hidden_states=output_hidden_states,
1644
+ return_dict=return_dict,
1645
+ )
1646
+ hidden_states = transformer_outputs[0]
1647
+ logits = self.score(hidden_states)
1648
+
1649
+ if input_ids is not None:
1650
+ batch_size = input_ids.shape[0]
1651
+ else:
1652
+ batch_size = inputs_embeds.shape[0]
1653
+
1654
+ if self.config.pad_token_id is None and batch_size != 1:
1655
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1656
+ if self.config.pad_token_id is None:
1657
+ sequence_lengths = -1
1658
+ else:
1659
+ if input_ids is not None:
1660
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
1661
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
1662
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
1663
+ sequence_lengths = sequence_lengths.to(logits.device)
1664
+ else:
1665
+ sequence_lengths = -1
1666
+
1667
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1668
+
1669
+ loss = None
1670
+ if labels is not None:
1671
+ labels = labels.to(logits.device)
1672
+ if self.config.problem_type is None:
1673
+ if self.num_labels == 1:
1674
+ self.config.problem_type = "regression"
1675
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1676
+ self.config.problem_type = "single_label_classification"
1677
+ else:
1678
+ self.config.problem_type = "multi_label_classification"
1679
+
1680
+ if self.config.problem_type == "regression":
1681
+ loss_fct = MSELoss()
1682
+ if self.num_labels == 1:
1683
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1684
+ else:
1685
+ loss = loss_fct(pooled_logits, labels)
1686
+ elif self.config.problem_type == "single_label_classification":
1687
+ loss_fct = CrossEntropyLoss()
1688
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1689
+ elif self.config.problem_type == "multi_label_classification":
1690
+ loss_fct = BCEWithLogitsLoss()
1691
+ loss = loss_fct(pooled_logits, labels)
1692
+ if not return_dict:
1693
+ output = (pooled_logits,) + transformer_outputs[1:]
1694
+ return ((loss,) + output) if loss is not None else output
1695
+
1696
+ return SequenceClassifierOutputWithPast(
1697
+ loss=loss,
1698
+ logits=pooled_logits,
1699
+ past_key_values=transformer_outputs.past_key_values,
1700
+ hidden_states=transformer_outputs.hidden_states,
1701
+ attentions=transformer_outputs.attentions,
1702
+ )
pytorch_model-00001-of-00006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83a53e7d1a251f3b98011ab11d444925ca258396b3bb9f8666e86e526a55946f
3
+ size 4986229446
pytorch_model-00002-of-00006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51e7a8f7dd02f15c748419734c081fbbe694ff545415b2de298904094db14f31
3
+ size 4954871698
pytorch_model-00003-of-00006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:acefeb6c407b39e909f6c4a3f5e2e7f721ba4242396a0b6628d7e9716009c6ba
3
+ size 4954871762
pytorch_model-00004-of-00006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:add7c02bd933343327595571db05da2f3fe91cce9874238eb9b245ec28a72131
3
+ size 4954871762
pytorch_model-00005-of-00006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b7bf3b267657166ebb79aa8cf7980ae6da30aed5e0577345684955d7150c1c6
3
+ size 4954871762
pytorch_model-00006-of-00006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e9abb8fce6a951e2c953f96043dc42c7d0b2a0376a3c77499f265c09abe5649
3
+ size 4944481872
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,596 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 29749997568
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "pytorch_model-00006-of-00006.bin",
7
+ "model.embed_tokens.weight": "pytorch_model-00001-of-00006.bin",
8
+ "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00006.bin",
9
+ "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00006.bin",
10
+ "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00006.bin",
11
+ "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00006.bin",
12
+ "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00006.bin",
13
+ "model.layers.0.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
14
+ "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
15
+ "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
16
+ "model.layers.0.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
17
+ "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
18
+ "model.layers.0.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
19
+ "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
20
+ "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00006.bin",
21
+ "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00006.bin",
22
+ "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00006.bin",
23
+ "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00006.bin",
24
+ "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00006.bin",
25
+ "model.layers.1.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
26
+ "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
27
+ "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
28
+ "model.layers.1.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
29
+ "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
30
+ "model.layers.1.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
31
+ "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
32
+ "model.layers.10.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
33
+ "model.layers.10.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
34
+ "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
35
+ "model.layers.10.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
36
+ "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
37
+ "model.layers.10.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
38
+ "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
39
+ "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
40
+ "model.layers.10.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
41
+ "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
42
+ "model.layers.10.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
43
+ "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
44
+ "model.layers.11.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
45
+ "model.layers.11.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
46
+ "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
47
+ "model.layers.11.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
48
+ "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
49
+ "model.layers.11.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
50
+ "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
51
+ "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
52
+ "model.layers.11.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
53
+ "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
54
+ "model.layers.11.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
55
+ "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
56
+ "model.layers.12.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
57
+ "model.layers.12.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
58
+ "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
59
+ "model.layers.12.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
60
+ "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
61
+ "model.layers.12.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
62
+ "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
63
+ "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
64
+ "model.layers.12.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
65
+ "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
66
+ "model.layers.12.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
67
+ "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
68
+ "model.layers.13.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
69
+ "model.layers.13.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
70
+ "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
71
+ "model.layers.13.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
72
+ "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
73
+ "model.layers.13.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
74
+ "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
75
+ "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
76
+ "model.layers.13.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
77
+ "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
78
+ "model.layers.13.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
79
+ "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
80
+ "model.layers.14.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
81
+ "model.layers.14.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
82
+ "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
83
+ "model.layers.14.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
84
+ "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
85
+ "model.layers.14.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
86
+ "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
87
+ "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
88
+ "model.layers.14.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
89
+ "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
90
+ "model.layers.14.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
91
+ "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
92
+ "model.layers.15.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
93
+ "model.layers.15.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
94
+ "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
95
+ "model.layers.15.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
96
+ "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
97
+ "model.layers.15.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
98
+ "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
99
+ "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
100
+ "model.layers.15.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
101
+ "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
102
+ "model.layers.15.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
103
+ "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
104
+ "model.layers.16.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
105
+ "model.layers.16.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
106
+ "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
107
+ "model.layers.16.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
108
+ "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
109
+ "model.layers.16.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
110
+ "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
111
+ "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
112
+ "model.layers.16.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
113
+ "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
114
+ "model.layers.16.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
115
+ "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
116
+ "model.layers.17.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
117
+ "model.layers.17.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
118
+ "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
119
+ "model.layers.17.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
120
+ "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
121
+ "model.layers.17.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
122
+ "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
123
+ "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
124
+ "model.layers.17.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
125
+ "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
126
+ "model.layers.17.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
127
+ "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
128
+ "model.layers.18.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
129
+ "model.layers.18.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
130
+ "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
131
+ "model.layers.18.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
132
+ "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
133
+ "model.layers.18.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
134
+ "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
135
+ "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
136
+ "model.layers.18.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
137
+ "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
138
+ "model.layers.18.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
139
+ "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
140
+ "model.layers.19.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
141
+ "model.layers.19.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
142
+ "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
143
+ "model.layers.19.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
144
+ "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
145
+ "model.layers.19.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
146
+ "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
147
+ "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
148
+ "model.layers.19.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
149
+ "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
150
+ "model.layers.19.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
151
+ "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
152
+ "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00006.bin",
153
+ "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00006.bin",
154
+ "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00006.bin",
155
+ "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00006.bin",
156
+ "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00006.bin",
157
+ "model.layers.2.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
158
+ "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
159
+ "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
160
+ "model.layers.2.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
161
+ "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
162
+ "model.layers.2.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
163
+ "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
164
+ "model.layers.20.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
165
+ "model.layers.20.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
166
+ "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
167
+ "model.layers.20.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
168
+ "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
169
+ "model.layers.20.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
170
+ "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
171
+ "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
172
+ "model.layers.20.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
173
+ "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
174
+ "model.layers.20.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
175
+ "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
176
+ "model.layers.21.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
177
+ "model.layers.21.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
178
+ "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
179
+ "model.layers.21.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
180
+ "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
181
+ "model.layers.21.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
182
+ "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
183
+ "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
184
+ "model.layers.21.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
185
+ "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
186
+ "model.layers.21.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
187
+ "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
188
+ "model.layers.22.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
189
+ "model.layers.22.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
190
+ "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
191
+ "model.layers.22.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
192
+ "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
193
+ "model.layers.22.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
194
+ "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
195
+ "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
196
+ "model.layers.22.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
197
+ "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
198
+ "model.layers.22.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
199
+ "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
200
+ "model.layers.23.input_layernorm.weight": "pytorch_model-00003-of-00006.bin",
201
+ "model.layers.23.mlp.down_proj.weight": "pytorch_model-00003-of-00006.bin",
202
+ "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00003-of-00006.bin",
203
+ "model.layers.23.mlp.up_proj.weight": "pytorch_model-00003-of-00006.bin",
204
+ "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00003-of-00006.bin",
205
+ "model.layers.23.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
206
+ "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
207
+ "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
208
+ "model.layers.23.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
209
+ "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
210
+ "model.layers.23.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
211
+ "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
212
+ "model.layers.24.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
213
+ "model.layers.24.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
214
+ "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
215
+ "model.layers.24.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
216
+ "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
217
+ "model.layers.24.self_attn.k_proj.bias": "pytorch_model-00003-of-00006.bin",
218
+ "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00003-of-00006.bin",
219
+ "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00003-of-00006.bin",
220
+ "model.layers.24.self_attn.q_proj.bias": "pytorch_model-00003-of-00006.bin",
221
+ "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00003-of-00006.bin",
222
+ "model.layers.24.self_attn.v_proj.bias": "pytorch_model-00003-of-00006.bin",
223
+ "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00003-of-00006.bin",
224
+ "model.layers.25.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
225
+ "model.layers.25.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
226
+ "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
227
+ "model.layers.25.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
228
+ "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
229
+ "model.layers.25.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
230
+ "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
231
+ "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
232
+ "model.layers.25.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
233
+ "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
234
+ "model.layers.25.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
235
+ "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
236
+ "model.layers.26.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
237
+ "model.layers.26.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
238
+ "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
239
+ "model.layers.26.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
240
+ "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
241
+ "model.layers.26.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
242
+ "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
243
+ "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
244
+ "model.layers.26.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
245
+ "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
246
+ "model.layers.26.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
247
+ "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
248
+ "model.layers.27.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
249
+ "model.layers.27.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
250
+ "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
251
+ "model.layers.27.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
252
+ "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
253
+ "model.layers.27.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
254
+ "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
255
+ "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
256
+ "model.layers.27.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
257
+ "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
258
+ "model.layers.27.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
259
+ "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
260
+ "model.layers.28.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
261
+ "model.layers.28.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
262
+ "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
263
+ "model.layers.28.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
264
+ "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
265
+ "model.layers.28.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
266
+ "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
267
+ "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
268
+ "model.layers.28.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
269
+ "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
270
+ "model.layers.28.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
271
+ "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
272
+ "model.layers.29.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
273
+ "model.layers.29.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
274
+ "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
275
+ "model.layers.29.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
276
+ "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
277
+ "model.layers.29.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
278
+ "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
279
+ "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
280
+ "model.layers.29.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
281
+ "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
282
+ "model.layers.29.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
283
+ "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
284
+ "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00006.bin",
285
+ "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00006.bin",
286
+ "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00006.bin",
287
+ "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00006.bin",
288
+ "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00006.bin",
289
+ "model.layers.3.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
290
+ "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
291
+ "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
292
+ "model.layers.3.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
293
+ "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
294
+ "model.layers.3.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
295
+ "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
296
+ "model.layers.30.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
297
+ "model.layers.30.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
298
+ "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
299
+ "model.layers.30.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
300
+ "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
301
+ "model.layers.30.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
302
+ "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
303
+ "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
304
+ "model.layers.30.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
305
+ "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
306
+ "model.layers.30.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
307
+ "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
308
+ "model.layers.31.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
309
+ "model.layers.31.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
310
+ "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
311
+ "model.layers.31.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
312
+ "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
313
+ "model.layers.31.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
314
+ "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
315
+ "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
316
+ "model.layers.31.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
317
+ "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
318
+ "model.layers.31.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
319
+ "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
320
+ "model.layers.32.input_layernorm.weight": "pytorch_model-00004-of-00006.bin",
321
+ "model.layers.32.mlp.down_proj.weight": "pytorch_model-00004-of-00006.bin",
322
+ "model.layers.32.mlp.gate_proj.weight": "pytorch_model-00004-of-00006.bin",
323
+ "model.layers.32.mlp.up_proj.weight": "pytorch_model-00004-of-00006.bin",
324
+ "model.layers.32.post_attention_layernorm.weight": "pytorch_model-00004-of-00006.bin",
325
+ "model.layers.32.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
326
+ "model.layers.32.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
327
+ "model.layers.32.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
328
+ "model.layers.32.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
329
+ "model.layers.32.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
330
+ "model.layers.32.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
331
+ "model.layers.32.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
332
+ "model.layers.33.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
333
+ "model.layers.33.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
334
+ "model.layers.33.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
335
+ "model.layers.33.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
336
+ "model.layers.33.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
337
+ "model.layers.33.self_attn.k_proj.bias": "pytorch_model-00004-of-00006.bin",
338
+ "model.layers.33.self_attn.k_proj.weight": "pytorch_model-00004-of-00006.bin",
339
+ "model.layers.33.self_attn.o_proj.weight": "pytorch_model-00004-of-00006.bin",
340
+ "model.layers.33.self_attn.q_proj.bias": "pytorch_model-00004-of-00006.bin",
341
+ "model.layers.33.self_attn.q_proj.weight": "pytorch_model-00004-of-00006.bin",
342
+ "model.layers.33.self_attn.v_proj.bias": "pytorch_model-00004-of-00006.bin",
343
+ "model.layers.33.self_attn.v_proj.weight": "pytorch_model-00004-of-00006.bin",
344
+ "model.layers.34.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
345
+ "model.layers.34.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
346
+ "model.layers.34.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
347
+ "model.layers.34.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
348
+ "model.layers.34.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
349
+ "model.layers.34.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
350
+ "model.layers.34.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
351
+ "model.layers.34.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
352
+ "model.layers.34.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
353
+ "model.layers.34.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
354
+ "model.layers.34.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
355
+ "model.layers.34.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
356
+ "model.layers.35.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
357
+ "model.layers.35.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
358
+ "model.layers.35.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
359
+ "model.layers.35.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
360
+ "model.layers.35.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
361
+ "model.layers.35.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
362
+ "model.layers.35.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
363
+ "model.layers.35.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
364
+ "model.layers.35.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
365
+ "model.layers.35.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
366
+ "model.layers.35.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
367
+ "model.layers.35.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
368
+ "model.layers.36.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
369
+ "model.layers.36.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
370
+ "model.layers.36.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
371
+ "model.layers.36.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
372
+ "model.layers.36.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
373
+ "model.layers.36.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
374
+ "model.layers.36.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
375
+ "model.layers.36.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
376
+ "model.layers.36.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
377
+ "model.layers.36.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
378
+ "model.layers.36.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
379
+ "model.layers.36.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
380
+ "model.layers.37.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
381
+ "model.layers.37.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
382
+ "model.layers.37.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
383
+ "model.layers.37.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
384
+ "model.layers.37.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
385
+ "model.layers.37.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
386
+ "model.layers.37.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
387
+ "model.layers.37.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
388
+ "model.layers.37.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
389
+ "model.layers.37.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
390
+ "model.layers.37.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
391
+ "model.layers.37.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
392
+ "model.layers.38.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
393
+ "model.layers.38.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
394
+ "model.layers.38.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
395
+ "model.layers.38.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
396
+ "model.layers.38.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
397
+ "model.layers.38.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
398
+ "model.layers.38.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
399
+ "model.layers.38.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
400
+ "model.layers.38.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
401
+ "model.layers.38.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
402
+ "model.layers.38.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
403
+ "model.layers.38.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
404
+ "model.layers.39.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
405
+ "model.layers.39.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
406
+ "model.layers.39.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
407
+ "model.layers.39.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
408
+ "model.layers.39.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
409
+ "model.layers.39.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
410
+ "model.layers.39.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
411
+ "model.layers.39.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
412
+ "model.layers.39.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
413
+ "model.layers.39.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
414
+ "model.layers.39.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
415
+ "model.layers.39.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
416
+ "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00006.bin",
417
+ "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00006.bin",
418
+ "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00006.bin",
419
+ "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00006.bin",
420
+ "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00006.bin",
421
+ "model.layers.4.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
422
+ "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
423
+ "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
424
+ "model.layers.4.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
425
+ "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
426
+ "model.layers.4.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
427
+ "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
428
+ "model.layers.40.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
429
+ "model.layers.40.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
430
+ "model.layers.40.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
431
+ "model.layers.40.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
432
+ "model.layers.40.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
433
+ "model.layers.40.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
434
+ "model.layers.40.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
435
+ "model.layers.40.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
436
+ "model.layers.40.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
437
+ "model.layers.40.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
438
+ "model.layers.40.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
439
+ "model.layers.40.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
440
+ "model.layers.41.input_layernorm.weight": "pytorch_model-00005-of-00006.bin",
441
+ "model.layers.41.mlp.down_proj.weight": "pytorch_model-00005-of-00006.bin",
442
+ "model.layers.41.mlp.gate_proj.weight": "pytorch_model-00005-of-00006.bin",
443
+ "model.layers.41.mlp.up_proj.weight": "pytorch_model-00005-of-00006.bin",
444
+ "model.layers.41.post_attention_layernorm.weight": "pytorch_model-00005-of-00006.bin",
445
+ "model.layers.41.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
446
+ "model.layers.41.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
447
+ "model.layers.41.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
448
+ "model.layers.41.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
449
+ "model.layers.41.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
450
+ "model.layers.41.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
451
+ "model.layers.41.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
452
+ "model.layers.42.input_layernorm.weight": "pytorch_model-00006-of-00006.bin",
453
+ "model.layers.42.mlp.down_proj.weight": "pytorch_model-00006-of-00006.bin",
454
+ "model.layers.42.mlp.gate_proj.weight": "pytorch_model-00006-of-00006.bin",
455
+ "model.layers.42.mlp.up_proj.weight": "pytorch_model-00006-of-00006.bin",
456
+ "model.layers.42.post_attention_layernorm.weight": "pytorch_model-00006-of-00006.bin",
457
+ "model.layers.42.self_attn.k_proj.bias": "pytorch_model-00005-of-00006.bin",
458
+ "model.layers.42.self_attn.k_proj.weight": "pytorch_model-00005-of-00006.bin",
459
+ "model.layers.42.self_attn.o_proj.weight": "pytorch_model-00005-of-00006.bin",
460
+ "model.layers.42.self_attn.q_proj.bias": "pytorch_model-00005-of-00006.bin",
461
+ "model.layers.42.self_attn.q_proj.weight": "pytorch_model-00005-of-00006.bin",
462
+ "model.layers.42.self_attn.v_proj.bias": "pytorch_model-00005-of-00006.bin",
463
+ "model.layers.42.self_attn.v_proj.weight": "pytorch_model-00005-of-00006.bin",
464
+ "model.layers.43.input_layernorm.weight": "pytorch_model-00006-of-00006.bin",
465
+ "model.layers.43.mlp.down_proj.weight": "pytorch_model-00006-of-00006.bin",
466
+ "model.layers.43.mlp.gate_proj.weight": "pytorch_model-00006-of-00006.bin",
467
+ "model.layers.43.mlp.up_proj.weight": "pytorch_model-00006-of-00006.bin",
468
+ "model.layers.43.post_attention_layernorm.weight": "pytorch_model-00006-of-00006.bin",
469
+ "model.layers.43.self_attn.k_proj.bias": "pytorch_model-00006-of-00006.bin",
470
+ "model.layers.43.self_attn.k_proj.weight": "pytorch_model-00006-of-00006.bin",
471
+ "model.layers.43.self_attn.o_proj.weight": "pytorch_model-00006-of-00006.bin",
472
+ "model.layers.43.self_attn.q_proj.bias": "pytorch_model-00006-of-00006.bin",
473
+ "model.layers.43.self_attn.q_proj.weight": "pytorch_model-00006-of-00006.bin",
474
+ "model.layers.43.self_attn.v_proj.bias": "pytorch_model-00006-of-00006.bin",
475
+ "model.layers.43.self_attn.v_proj.weight": "pytorch_model-00006-of-00006.bin",
476
+ "model.layers.44.input_layernorm.weight": "pytorch_model-00006-of-00006.bin",
477
+ "model.layers.44.mlp.down_proj.weight": "pytorch_model-00006-of-00006.bin",
478
+ "model.layers.44.mlp.gate_proj.weight": "pytorch_model-00006-of-00006.bin",
479
+ "model.layers.44.mlp.up_proj.weight": "pytorch_model-00006-of-00006.bin",
480
+ "model.layers.44.post_attention_layernorm.weight": "pytorch_model-00006-of-00006.bin",
481
+ "model.layers.44.self_attn.k_proj.bias": "pytorch_model-00006-of-00006.bin",
482
+ "model.layers.44.self_attn.k_proj.weight": "pytorch_model-00006-of-00006.bin",
483
+ "model.layers.44.self_attn.o_proj.weight": "pytorch_model-00006-of-00006.bin",
484
+ "model.layers.44.self_attn.q_proj.bias": "pytorch_model-00006-of-00006.bin",
485
+ "model.layers.44.self_attn.q_proj.weight": "pytorch_model-00006-of-00006.bin",
486
+ "model.layers.44.self_attn.v_proj.bias": "pytorch_model-00006-of-00006.bin",
487
+ "model.layers.44.self_attn.v_proj.weight": "pytorch_model-00006-of-00006.bin",
488
+ "model.layers.45.input_layernorm.weight": "pytorch_model-00006-of-00006.bin",
489
+ "model.layers.45.mlp.down_proj.weight": "pytorch_model-00006-of-00006.bin",
490
+ "model.layers.45.mlp.gate_proj.weight": "pytorch_model-00006-of-00006.bin",
491
+ "model.layers.45.mlp.up_proj.weight": "pytorch_model-00006-of-00006.bin",
492
+ "model.layers.45.post_attention_layernorm.weight": "pytorch_model-00006-of-00006.bin",
493
+ "model.layers.45.self_attn.k_proj.bias": "pytorch_model-00006-of-00006.bin",
494
+ "model.layers.45.self_attn.k_proj.weight": "pytorch_model-00006-of-00006.bin",
495
+ "model.layers.45.self_attn.o_proj.weight": "pytorch_model-00006-of-00006.bin",
496
+ "model.layers.45.self_attn.q_proj.bias": "pytorch_model-00006-of-00006.bin",
497
+ "model.layers.45.self_attn.q_proj.weight": "pytorch_model-00006-of-00006.bin",
498
+ "model.layers.45.self_attn.v_proj.bias": "pytorch_model-00006-of-00006.bin",
499
+ "model.layers.45.self_attn.v_proj.weight": "pytorch_model-00006-of-00006.bin",
500
+ "model.layers.46.input_layernorm.weight": "pytorch_model-00006-of-00006.bin",
501
+ "model.layers.46.mlp.down_proj.weight": "pytorch_model-00006-of-00006.bin",
502
+ "model.layers.46.mlp.gate_proj.weight": "pytorch_model-00006-of-00006.bin",
503
+ "model.layers.46.mlp.up_proj.weight": "pytorch_model-00006-of-00006.bin",
504
+ "model.layers.46.post_attention_layernorm.weight": "pytorch_model-00006-of-00006.bin",
505
+ "model.layers.46.self_attn.k_proj.bias": "pytorch_model-00006-of-00006.bin",
506
+ "model.layers.46.self_attn.k_proj.weight": "pytorch_model-00006-of-00006.bin",
507
+ "model.layers.46.self_attn.o_proj.weight": "pytorch_model-00006-of-00006.bin",
508
+ "model.layers.46.self_attn.q_proj.bias": "pytorch_model-00006-of-00006.bin",
509
+ "model.layers.46.self_attn.q_proj.weight": "pytorch_model-00006-of-00006.bin",
510
+ "model.layers.46.self_attn.v_proj.bias": "pytorch_model-00006-of-00006.bin",
511
+ "model.layers.46.self_attn.v_proj.weight": "pytorch_model-00006-of-00006.bin",
512
+ "model.layers.47.input_layernorm.weight": "pytorch_model-00006-of-00006.bin",
513
+ "model.layers.47.mlp.down_proj.weight": "pytorch_model-00006-of-00006.bin",
514
+ "model.layers.47.mlp.gate_proj.weight": "pytorch_model-00006-of-00006.bin",
515
+ "model.layers.47.mlp.up_proj.weight": "pytorch_model-00006-of-00006.bin",
516
+ "model.layers.47.post_attention_layernorm.weight": "pytorch_model-00006-of-00006.bin",
517
+ "model.layers.47.self_attn.k_proj.bias": "pytorch_model-00006-of-00006.bin",
518
+ "model.layers.47.self_attn.k_proj.weight": "pytorch_model-00006-of-00006.bin",
519
+ "model.layers.47.self_attn.o_proj.weight": "pytorch_model-00006-of-00006.bin",
520
+ "model.layers.47.self_attn.q_proj.bias": "pytorch_model-00006-of-00006.bin",
521
+ "model.layers.47.self_attn.q_proj.weight": "pytorch_model-00006-of-00006.bin",
522
+ "model.layers.47.self_attn.v_proj.bias": "pytorch_model-00006-of-00006.bin",
523
+ "model.layers.47.self_attn.v_proj.weight": "pytorch_model-00006-of-00006.bin",
524
+ "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00006.bin",
525
+ "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00006.bin",
526
+ "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00006.bin",
527
+ "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00006.bin",
528
+ "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00006.bin",
529
+ "model.layers.5.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
530
+ "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
531
+ "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
532
+ "model.layers.5.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
533
+ "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
534
+ "model.layers.5.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
535
+ "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
536
+ "model.layers.6.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
537
+ "model.layers.6.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
538
+ "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
539
+ "model.layers.6.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
540
+ "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
541
+ "model.layers.6.self_attn.k_proj.bias": "pytorch_model-00001-of-00006.bin",
542
+ "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00006.bin",
543
+ "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00006.bin",
544
+ "model.layers.6.self_attn.q_proj.bias": "pytorch_model-00001-of-00006.bin",
545
+ "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00006.bin",
546
+ "model.layers.6.self_attn.v_proj.bias": "pytorch_model-00001-of-00006.bin",
547
+ "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00006.bin",
548
+ "model.layers.7.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
549
+ "model.layers.7.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
550
+ "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
551
+ "model.layers.7.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
552
+ "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
553
+ "model.layers.7.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
554
+ "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
555
+ "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
556
+ "model.layers.7.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
557
+ "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
558
+ "model.layers.7.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
559
+ "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
560
+ "model.layers.8.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
561
+ "model.layers.8.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
562
+ "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
563
+ "model.layers.8.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
564
+ "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
565
+ "model.layers.8.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
566
+ "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
567
+ "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
568
+ "model.layers.8.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
569
+ "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
570
+ "model.layers.8.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
571
+ "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
572
+ "model.layers.9.input_layernorm.weight": "pytorch_model-00002-of-00006.bin",
573
+ "model.layers.9.mlp.down_proj.weight": "pytorch_model-00002-of-00006.bin",
574
+ "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00002-of-00006.bin",
575
+ "model.layers.9.mlp.up_proj.weight": "pytorch_model-00002-of-00006.bin",
576
+ "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00002-of-00006.bin",
577
+ "model.layers.9.self_attn.k_proj.bias": "pytorch_model-00002-of-00006.bin",
578
+ "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00002-of-00006.bin",
579
+ "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00002-of-00006.bin",
580
+ "model.layers.9.self_attn.q_proj.bias": "pytorch_model-00002-of-00006.bin",
581
+ "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00002-of-00006.bin",
582
+ "model.layers.9.self_attn.v_proj.bias": "pytorch_model-00002-of-00006.bin",
583
+ "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00002-of-00006.bin",
584
+ "model.norm.weight": "pytorch_model-00006-of-00006.bin",
585
+ "ts_encoder.mlp.0.bias": "pytorch_model-00006-of-00006.bin",
586
+ "ts_encoder.mlp.0.weight": "pytorch_model-00006-of-00006.bin",
587
+ "ts_encoder.mlp.2.bias": "pytorch_model-00006-of-00006.bin",
588
+ "ts_encoder.mlp.2.weight": "pytorch_model-00006-of-00006.bin",
589
+ "ts_encoder.mlp.4.bias": "pytorch_model-00006-of-00006.bin",
590
+ "ts_encoder.mlp.4.weight": "pytorch_model-00006-of-00006.bin",
591
+ "ts_encoder.mlp.6.bias": "pytorch_model-00006-of-00006.bin",
592
+ "ts_encoder.mlp.6.weight": "pytorch_model-00006-of-00006.bin",
593
+ "ts_encoder.mlp.8.bias": "pytorch_model-00006-of-00006.bin",
594
+ "ts_encoder.mlp.8.weight": "pytorch_model-00006-of-00006.bin"
595
+ }
596
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<ts>",
4
+ "<ts/>"
5
+ ],
6
+ "eos_token": {
7
+ "content": "<|im_end|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "pad_token": {
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ }
20
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:309a7f94e51f1104e6687b31284915a0349755302b483d851f466650dc2ebc67
3
+ size 11422259
tokenizer_config.json ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<ts>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ },
189
+ "151666": {
190
+ "content": "<ts/>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": true
196
+ }
197
+ },
198
+ "additional_special_tokens": [
199
+ "<ts>",
200
+ "<ts/>"
201
+ ],
202
+ "bos_token": null,
203
+ "chat_template": "{% set system_message = 'You are a helpful assistant.' %}{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\n' + content + '<|im_end|>\n<|im_start|>assistant\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|im_end|>' + '\n' }}{% endif %}{% endfor %}",
204
+ "clean_up_tokenization_spaces": false,
205
+ "eos_token": "<|im_end|>",
206
+ "errors": "replace",
207
+ "model_max_length": 131072,
208
+ "pad_token": "<|endoftext|>",
209
+ "padding_side": "right",
210
+ "split_special_tokens": false,
211
+ "tokenizer_class": "Qwen2Tokenizer",
212
+ "unk_token": null
213
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff