Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton

#6
by wangzihan99 - opened
NOTICE CHANGED
@@ -49,4 +49,232 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
49
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
50
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
51
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
52
- SOFTWARE.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
50
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
51
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
52
+ SOFTWARE.
53
+
54
+ ------------- LICENSE FOR stanford_alpaca code --------------
55
+
56
+ Apache License
57
+ Version 2.0, January 2004
58
+ http://www.apache.org/licenses/
59
+
60
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
61
+
62
+ 1. Definitions.
63
+
64
+ "License" shall mean the terms and conditions for use, reproduction,
65
+ and distribution as defined by Sections 1 through 9 of this document.
66
+
67
+ "Licensor" shall mean the copyright owner or entity authorized by
68
+ the copyright owner that is granting the License.
69
+
70
+ "Legal Entity" shall mean the union of the acting entity and all
71
+ other entities that control, are controlled by, or are under common
72
+ control with that entity. For the purposes of this definition,
73
+ "control" means (i) the power, direct or indirect, to cause the
74
+ direction or management of such entity, whether by contract or
75
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
76
+ outstanding shares, or (iii) beneficial ownership of such entity.
77
+
78
+ "You" (or "Your") shall mean an individual or Legal Entity
79
+ exercising permissions granted by this License.
80
+
81
+ "Source" form shall mean the preferred form for making modifications,
82
+ including but not limited to software source code, documentation
83
+ source, and configuration files.
84
+
85
+ "Object" form shall mean any form resulting from mechanical
86
+ transformation or translation of a Source form, including but
87
+ not limited to compiled object code, generated documentation,
88
+ and conversions to other media types.
89
+
90
+ "Work" shall mean the work of authorship, whether in Source or
91
+ Object form, made available under the License, as indicated by a
92
+ copyright notice that is included in or attached to the work
93
+ (an example is provided in the Appendix below).
94
+
95
+ "Derivative Works" shall mean any work, whether in Source or Object
96
+ form, that is based on (or derived from) the Work and for which the
97
+ editorial revisions, annotations, elaborations, or other modifications
98
+ represent, as a whole, an original work of authorship. For the purposes
99
+ of this License, Derivative Works shall not include works that remain
100
+ separable from, or merely link (or bind by name) to the interfaces of,
101
+ the Work and Derivative Works thereof.
102
+
103
+ "Contribution" shall mean any work of authorship, including
104
+ the original version of the Work and any modifications or additions
105
+ to that Work or Derivative Works thereof, that is intentionally
106
+ submitted to Licensor for inclusion in the Work by the copyright owner
107
+ or by an individual or Legal Entity authorized to submit on behalf of
108
+ the copyright owner. For the purposes of this definition, "submitted"
109
+ means any form of electronic, verbal, or written communication sent
110
+ to the Licensor or its representatives, including but not limited to
111
+ communication on electronic mailing lists, source code control systems,
112
+ and issue tracking systems that are managed by, or on behalf of, the
113
+ Licensor for the purpose of discussing and improving the Work, but
114
+ excluding communication that is conspicuously marked or otherwise
115
+ designated in writing by the copyright owner as "Not a Contribution."
116
+
117
+ "Contributor" shall mean Licensor and any individual or Legal Entity
118
+ on behalf of whom a Contribution has been received by Licensor and
119
+ subsequently incorporated within the Work.
120
+
121
+ 2. Grant of Copyright License. Subject to the terms and conditions of
122
+ this License, each Contributor hereby grants to You a perpetual,
123
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
124
+ copyright license to reproduce, prepare Derivative Works of,
125
+ publicly display, publicly perform, sublicense, and distribute the
126
+ Work and such Derivative Works in Source or Object form.
127
+
128
+ 3. Grant of Patent License. Subject to the terms and conditions of
129
+ this License, each Contributor hereby grants to You a perpetual,
130
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
131
+ (except as stated in this section) patent license to make, have made,
132
+ use, offer to sell, sell, import, and otherwise transfer the Work,
133
+ where such license applies only to those patent claims licensable
134
+ by such Contributor that are necessarily infringed by their
135
+ Contribution(s) alone or by combination of their Contribution(s)
136
+ with the Work to which such Contribution(s) was submitted. If You
137
+ institute patent litigation against any entity (including a
138
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
139
+ or a Contribution incorporated within the Work constitutes direct
140
+ or contributory patent infringement, then any patent licenses
141
+ granted to You under this License for that Work shall terminate
142
+ as of the date such litigation is filed.
143
+
144
+ 4. Redistribution. You may reproduce and distribute copies of the
145
+ Work or Derivative Works thereof in any medium, with or without
146
+ modifications, and in Source or Object form, provided that You
147
+ meet the following conditions:
148
+
149
+ (a) You must give any other recipients of the Work or
150
+ Derivative Works a copy of this License; and
151
+
152
+ (b) You must cause any modified files to carry prominent notices
153
+ stating that You changed the files; and
154
+
155
+ (c) You must retain, in the Source form of any Derivative Works
156
+ that You distribute, all copyright, patent, trademark, and
157
+ attribution notices from the Source form of the Work,
158
+ excluding those notices that do not pertain to any part of
159
+ the Derivative Works; and
160
+
161
+ (d) If the Work includes a "NOTICE" text file as part of its
162
+ distribution, then any Derivative Works that You distribute must
163
+ include a readable copy of the attribution notices contained
164
+ within such NOTICE file, excluding those notices that do not
165
+ pertain to any part of the Derivative Works, in at least one
166
+ of the following places: within a NOTICE text file distributed
167
+ as part of the Derivative Works; within the Source form or
168
+ documentation, if provided along with the Derivative Works; or,
169
+ within a display generated by the Derivative Works, if and
170
+ wherever such third-party notices normally appear. The contents
171
+ of the NOTICE file are for informational purposes only and
172
+ do not modify the License. You may add Your own attribution
173
+ notices within Derivative Works that You distribute, alongside
174
+ or as an addendum to the NOTICE text from the Work, provided
175
+ that such additional attribution notices cannot be construed
176
+ as modifying the License.
177
+
178
+ You may add Your own copyright statement to Your modifications and
179
+ may provide additional or different license terms and conditions
180
+ for use, reproduction, or distribution of Your modifications, or
181
+ for any such Derivative Works as a whole, provided Your use,
182
+ reproduction, and distribution of the Work otherwise complies with
183
+ the conditions stated in this License.
184
+
185
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
186
+ any Contribution intentionally submitted for inclusion in the Work
187
+ by You to the Licensor shall be under the terms and conditions of
188
+ this License, without any additional terms or conditions.
189
+ Notwithstanding the above, nothing herein shall supersede or modify
190
+ the terms of any separate license agreement you may have executed
191
+ with Licensor regarding such Contributions.
192
+
193
+ 6. Trademarks. This License does not grant permission to use the trade
194
+ names, trademarks, service marks, or product names of the Licensor,
195
+ except as required for reasonable and customary use in describing the
196
+ origin of the Work and reproducing the content of the NOTICE file.
197
+
198
+ 7. Disclaimer of Warranty. Unless required by applicable law or
199
+ agreed to in writing, Licensor provides the Work (and each
200
+ Contributor provides its Contributions) on an "AS IS" BASIS,
201
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
202
+ implied, including, without limitation, any warranties or conditions
203
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
204
+ PARTICULAR PURPOSE. You are solely responsible for determining the
205
+ appropriateness of using or redistributing the Work and assume any
206
+ risks associated with Your exercise of permissions under this License.
207
+
208
+ 8. Limitation of Liability. In no event and under no legal theory,
209
+ whether in tort (including negligence), contract, or otherwise,
210
+ unless required by applicable law (such as deliberate and grossly
211
+ negligent acts) or agreed to in writing, shall any Contributor be
212
+ liable to You for damages, including any direct, indirect, special,
213
+ incidental, or consequential damages of any character arising as a
214
+ result of this License or out of the use or inability to use the
215
+ Work (including but not limited to damages for loss of goodwill,
216
+ work stoppage, computer failure or malfunction, or any and all
217
+ other commercial damages or losses), even if such Contributor
218
+ has been advised of the possibility of such damages.
219
+
220
+ 9. Accepting Warranty or Additional Liability. While redistributing
221
+ the Work or Derivative Works thereof, You may choose to offer,
222
+ and charge a fee for, acceptance of support, warranty, indemnity,
223
+ or other liability obligations and/or rights consistent with this
224
+ License. However, in accepting such obligations, You may act only
225
+ on Your own behalf and on Your sole responsibility, not on behalf
226
+ of any other Contributor, and only if You agree to indemnify,
227
+ defend, and hold each Contributor harmless for any liability
228
+ incurred by, or claims asserted against, such Contributor by reason
229
+ of your accepting any such warranty or additional liability.
230
+
231
+ END OF TERMS AND CONDITIONS
232
+
233
+ APPENDIX: How to apply the Apache License to your work.
234
+
235
+ To apply the Apache License to your work, attach the following
236
+ boilerplate notice, with the fields enclosed by brackets "[]"
237
+ replaced with your own identifying information. (Don't include
238
+ the brackets!) The text should be enclosed in the appropriate
239
+ comment syntax for the file format. We also recommend that a
240
+ file or class name and description of purpose be included on the
241
+ same "printed page" as the copyright notice for easier
242
+ identification within third-party archives.
243
+
244
+ Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
245
+
246
+ Licensed under the Apache License, Version 2.0 (the "License");
247
+ you may not use this file except in compliance with the License.
248
+ You may obtain a copy of the License at
249
+
250
+ http://www.apache.org/licenses/LICENSE-2.0
251
+
252
+ Unless required by applicable law or agreed to in writing, software
253
+ distributed under the License is distributed on an "AS IS" BASIS,
254
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
255
+ See the License for the specific language governing permissions and
256
+ limitations under the License.
257
+
258
+ ------------- LICENSE FOR PanQiWei AutoGPTQ code --------------
259
+
260
+ MIT License
261
+
262
+ Copyright (c) 2023 潘其威(William)
263
+
264
+ Permission is hereby granted, free of charge, to any person obtaining a copy
265
+ of this software and associated documentation files (the "Software"), to deal
266
+ in the Software without restriction, including without limitation the rights
267
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
268
+ copies of the Software, and to permit persons to whom the Software is
269
+ furnished to do so, subject to the following conditions:
270
+
271
+ The above copyright notice and this permission notice shall be included in all
272
+ copies or substantial portions of the Software.
273
+
274
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
275
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
276
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
277
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
278
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
279
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
280
+ SOFTWARE.
README.md CHANGED
@@ -16,11 +16,11 @@ inference: false
16
  <br>
17
 
18
  <p align="center">
19
- 🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a>&nbsp&nbsp | &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
20
  <br>
21
- <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp DingTalk (钉钉) &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
22
  </p>
23
- <br><br>
24
 
25
  ## 介绍(Introduction)
26
 
@@ -597,9 +597,9 @@ If you find our work helpful, feel free to give us a cite.
597
 
598
  ## 使用协议(License Agreement)
599
 
600
- 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看[LICENSE](https://github.com/QwenLM/Qwen/blob/main/LICENSE)了解具体的开源协议细节。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
601
 
602
- Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/LICENSE) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
603
  <br>
604
 
605
 
 
16
  <br>
17
 
18
  <p align="center">
19
+ 🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp | &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
20
  <br>
21
+ <a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
22
  </p>
23
+ <br>
24
 
25
  ## 介绍(Introduction)
26
 
 
597
 
598
  ## 使用协议(License Agreement)
599
 
600
+ 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看[LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT)了解具体的开源协议细节。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
601
 
602
+ Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
603
  <br>
604
 
605
 
assets/logo.jpg CHANGED
assets/wechat.png CHANGED
config.json CHANGED
@@ -16,7 +16,7 @@
16
  "initializer_range": 0.02,
17
  "kv_channels": 128,
18
  "layer_norm_epsilon": 1e-06,
19
- "max_position_embeddings": 8192,
20
  "model_type": "qwen",
21
  "no_bias": true,
22
  "num_attention_heads": 32,
@@ -44,6 +44,7 @@
44
  "use_cache": true,
45
  "use_dynamic_ntk": true,
46
  "use_flash_attn": "auto",
 
47
  "use_logn_attn": true,
48
  "vocab_size": 151936
49
  }
 
16
  "initializer_range": 0.02,
17
  "kv_channels": 128,
18
  "layer_norm_epsilon": 1e-06,
19
+ "max_position_embeddings": 32768,
20
  "model_type": "qwen",
21
  "no_bias": true,
22
  "num_attention_heads": 32,
 
44
  "use_cache": true,
45
  "use_dynamic_ntk": true,
46
  "use_flash_attn": "auto",
47
+ "use_triton": "auto",
48
  "use_logn_attn": true,
49
  "vocab_size": 151936
50
  }
configuration_qwen.py CHANGED
@@ -32,6 +32,7 @@ class QWenConfig(PretrainedConfig):
32
  use_dynamic_ntk=True,
33
  use_logn_attn=True,
34
  use_flash_attn="auto",
 
35
  intermediate_size=22016,
36
  no_bias=True,
37
  tie_word_embeddings=False,
@@ -61,6 +62,7 @@ class QWenConfig(PretrainedConfig):
61
  self.use_dynamic_ntk = use_dynamic_ntk
62
  self.use_logn_attn = use_logn_attn
63
  self.use_flash_attn = use_flash_attn
 
64
  self.no_bias = no_bias
65
  self.use_cache_quantization = use_cache_quantization
66
  self.use_cache_kernel = use_cache_kernel
 
32
  use_dynamic_ntk=True,
33
  use_logn_attn=True,
34
  use_flash_attn="auto",
35
+ use_triton="auto",
36
  intermediate_size=22016,
37
  no_bias=True,
38
  tie_word_embeddings=False,
 
62
  self.use_dynamic_ntk = use_dynamic_ntk
63
  self.use_logn_attn = use_logn_attn
64
  self.use_flash_attn = use_flash_attn
65
+ self.use_triton = use_triton
66
  self.no_bias = no_bias
67
  self.use_cache_quantization = use_cache_quantization
68
  self.use_cache_kernel = use_cache_kernel
generation_config.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
- "chat_format": "chatml",
3
- "eos_token_id": 151643,
4
- "pad_token_id": 151643,
5
- "max_window_size": 6144,
6
- "max_new_tokens": 512,
7
- "do_sample": true,
8
- "top_k": 0,
9
- "top_p": 0.8,
10
- "repetition_penalty": 1.1,
11
- "transformers_version": "4.31.0"
12
- }
 
1
  {
2
+ "chat_format": "chatml",
3
+ "eos_token_id": 151643,
4
+ "pad_token_id": 151643,
5
+ "max_window_size": 24000,
6
+ "max_new_tokens": 512,
7
+ "do_sample": true,
8
+ "top_k": 0,
9
+ "top_p": 0.8,
10
+ "repetition_penalty": 1.1,
11
+ "transformers_version": "4.31.0"
12
+ }
modeling_qwen.py CHANGED
@@ -13,7 +13,6 @@ import torch
13
  import torch.nn.functional as F
14
  import torch.utils.checkpoint
15
  import warnings
16
- from torch.cuda.amp import autocast
17
 
18
  from torch.nn import CrossEntropyLoss
19
  from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
@@ -36,7 +35,7 @@ except ImportError:
36
  from torch import nn
37
 
38
  SUPPORT_CUDA = torch.cuda.is_available()
39
- SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.is_bf16_supported()
40
  SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
41
  SUPPORT_TORCH2 = hasattr(torch, '__version__') and int(torch.__version__.split(".")[0]) >= 2
42
 
@@ -77,11 +76,14 @@ We detect you have activated flash attention support, but running model computat
77
  """
78
 
79
  apply_rotary_emb_func = None
 
80
  rms_norm = None
 
81
  flash_attn_unpadded_func = None
 
82
 
83
  def _import_flash_attn():
84
- global apply_rotary_emb_func, rms_norm, flash_attn_unpadded_func
85
  try:
86
  from flash_attn.layers.rotary import apply_rotary_emb_func as __apply_rotary_emb_func
87
  apply_rotary_emb_func = __apply_rotary_emb_func
@@ -102,20 +104,42 @@ def _import_flash_attn():
102
 
103
  try:
104
  import flash_attn
 
105
  if not hasattr(flash_attn, '__version__'):
106
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
107
  else:
108
  if int(flash_attn.__version__.split(".")[0]) >= 2:
 
 
109
  from flash_attn.flash_attn_interface import flash_attn_varlen_func as __flash_attn_unpadded_func
110
  else:
111
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
112
  flash_attn_unpadded_func = __flash_attn_unpadded_func
 
113
  except ImportError:
114
  logger.warn(
115
  "Warning: import flash_attn fail, please install FlashAttention to get higher efficiency "
116
  "https://github.com/Dao-AILab/flash-attention"
117
  )
118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  def quantize_cache_v(fdata, bits, qmax, qmin):
120
  # b, s, head, h-dim->b, head, s, h-dim
121
  qtype = torch.uint8
@@ -182,6 +206,11 @@ class FlashSelfAttention(torch.nn.Module):
182
  seqlen_k = k.shape[1]
183
  seqlen_out = seqlen_q
184
 
 
 
 
 
 
185
  q, k, v = [rearrange(x, "b s ... -> (b s) ...") for x in [q, k, v]]
186
  cu_seqlens_q = torch.arange(
187
  0,
@@ -311,7 +340,7 @@ class QWenAttention(nn.Module):
311
  warnings.warn("Failed to import KV cache kernels.")
312
  self.cache_kernels = None
313
 
314
- def _attn(self, query, key, value, registered_causal_mask, attention_mask=None, head_mask=None):
315
  device = query.device
316
  if self.use_cache_quantization:
317
  qk, qk_scale, qk_zero = key
@@ -336,26 +365,13 @@ class QWenAttention(nn.Module):
336
  size_temp = value[0].size(-1)
337
  else:
338
  size_temp = value.size(-1)
339
- attn_weights = attn_weights / torch.full(
340
- [],
341
- size_temp ** 0.5,
342
- dtype=attn_weights.dtype,
343
- device=attn_weights.device,
344
- )
345
- if self.use_cache_quantization:
346
- query_length, key_length = query.size(-2), key[0].size(-2)
347
- else:
348
- query_length, key_length = query.size(-2), key.size(-2)
349
- causal_mask = registered_causal_mask[
350
- :, :, key_length - query_length : key_length, :key_length
351
- ]
352
  mask_value = torch.finfo(attn_weights.dtype).min
353
- mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(
354
- attn_weights.device
355
- )
356
- attn_weights = torch.where(
357
- causal_mask, attn_weights.to(attn_weights.dtype), mask_value
358
- )
359
 
360
  if attention_mask is not None:
361
  attn_weights = attn_weights + attention_mask
@@ -395,62 +411,6 @@ class QWenAttention(nn.Module):
395
 
396
  return attn_output, attn_weights
397
 
398
- def _upcast_and_reordered_attn(
399
- self, query, key, value, registered_causal_mask, attention_mask=None, head_mask=None
400
- ):
401
- bsz, num_heads, q_seq_len, dk = query.size()
402
- _, _, k_seq_len, _ = key.size()
403
-
404
- attn_weights = torch.empty(
405
- bsz * num_heads,
406
- q_seq_len,
407
- k_seq_len,
408
- dtype=torch.float32,
409
- device=query.device,
410
- )
411
-
412
- scale_factor = 1.0
413
- if self.scale_attn_weights:
414
- scale_factor /= float(value.size(-1)) ** 0.5
415
-
416
- with autocast(enabled=False):
417
- q, k = query.reshape(-1, q_seq_len, dk), key.transpose(-1, -2).reshape(
418
- -1, dk, k_seq_len
419
- )
420
- attn_weights = torch.baddbmm(
421
- attn_weights, q.float(), k.float(), beta=0, alpha=scale_factor
422
- )
423
- attn_weights = attn_weights.reshape(bsz, num_heads, q_seq_len, k_seq_len)
424
-
425
- query_length, key_length = query.size(-2), key.size(-2)
426
- causal_mask = registered_causal_mask[
427
- :, :, key_length - query_length : key_length, :key_length
428
- ]
429
- mask_value = torch.finfo(attn_weights.dtype).min
430
- mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(
431
- attn_weights.device
432
- )
433
- attn_weights = torch.where(causal_mask, attn_weights, mask_value)
434
-
435
- if attention_mask is not None:
436
- attn_weights = attn_weights + attention_mask
437
-
438
- attn_weights = nn.functional.softmax(attn_weights, dim=-1)
439
-
440
- if attn_weights.dtype != torch.float32:
441
- raise RuntimeError(
442
- "Error with upcasting, attn_weights does not have dtype torch.float32"
443
- )
444
- attn_weights = attn_weights.type(value.dtype)
445
- attn_weights = self.attn_dropout(attn_weights)
446
-
447
- if head_mask is not None:
448
- attn_weights = attn_weights * head_mask
449
-
450
- attn_output = torch.matmul(attn_weights, value)
451
-
452
- return attn_output, attn_weights
453
-
454
  def _split_heads(self, tensor, num_heads, attn_head_size):
455
  new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
456
  tensor = tensor.view(new_shape)
@@ -465,7 +425,6 @@ class QWenAttention(nn.Module):
465
  self,
466
  hidden_states: Optional[Tuple[torch.FloatTensor]],
467
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
468
- registered_causal_mask: Optional[torch.Tensor] = None,
469
  layer_past: Optional[Tuple[torch.Tensor]] = None,
470
  attention_mask: Optional[torch.FloatTensor] = None,
471
  head_mask: Optional[torch.FloatTensor] = None,
@@ -539,7 +498,8 @@ class QWenAttention(nn.Module):
539
  else:
540
  present = None
541
 
542
- if self.use_logn_attn and not self.training:
 
543
  if self.use_cache_quantization:
544
  seq_start = key[0].size(2) - query.size(1)
545
  seq_end = key[0].size(2)
@@ -558,12 +518,19 @@ class QWenAttention(nn.Module):
558
  q, k, v = query, key, value
559
  attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
560
  else:
 
 
 
 
 
 
 
561
  query = query.permute(0, 2, 1, 3)
562
  if not self.use_cache_quantization:
563
  key = key.permute(0, 2, 1, 3)
564
  value = value.permute(0, 2, 1, 3)
565
  if (
566
- registered_causal_mask is None
567
  and self.use_flash_attn
568
  and flash_attn_unpadded_func is not None
569
  and not self.is_fp32
@@ -572,13 +539,12 @@ class QWenAttention(nn.Module):
572
  raise Exception(_ERROR_INPUT_CPU_QUERY_WITH_FLASH_ATTN_ACTIVATED)
573
 
574
  if not self.use_cache_quantization and SUPPORT_TORCH2:
575
- causal_mask = registered_causal_mask[
576
- :, :, key.size(-2) - query.size(-2): key.size(-2), :key.size(-2)
577
- ]
578
  if attention_mask is not None:
579
  attention_mask = attention_mask.expand(
580
  -1, -1, causal_mask.size(2), -1
581
- ).masked_fill(~causal_mask, torch.finfo(query.dtype).min)
 
 
582
  else:
583
  attention_mask = causal_mask
584
  attn_output = F.scaled_dot_product_attention(
@@ -587,7 +553,7 @@ class QWenAttention(nn.Module):
587
  attn_weight = None
588
  else:
589
  attn_output, attn_weight = self._attn(
590
- query, key, value, registered_causal_mask, attention_mask, head_mask
591
  )
592
  context_layer = self._merge_heads(
593
  attn_output, self.num_heads, self.head_dim
@@ -603,6 +569,8 @@ class QWenAttention(nn.Module):
603
  and not self.is_fp32
604
  ):
605
  raise ValueError("Cannot output attentions while using flash-attn")
 
 
606
  else:
607
  outputs += (attn_weight,)
608
 
@@ -628,6 +596,7 @@ class QWenMLP(nn.Module):
628
  output = self.c_proj(intermediate_parallel)
629
  return output
630
 
 
631
  class QWenBlock(nn.Module):
632
  def __init__(self, config):
633
  super().__init__()
@@ -650,7 +619,6 @@ class QWenBlock(nn.Module):
650
  self,
651
  hidden_states: Optional[Tuple[torch.FloatTensor]],
652
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
653
- registered_causal_mask: Optional[torch.Tensor] = None,
654
  layer_past: Optional[Tuple[torch.Tensor]] = None,
655
  attention_mask: Optional[torch.FloatTensor] = None,
656
  head_mask: Optional[torch.FloatTensor] = None,
@@ -664,7 +632,6 @@ class QWenBlock(nn.Module):
664
  attn_outputs = self.attn(
665
  layernorm_output,
666
  rotary_pos_emb_list,
667
- registered_causal_mask=registered_causal_mask,
668
  layer_past=layer_past,
669
  attention_mask=attention_mask,
670
  head_mask=head_mask,
@@ -698,6 +665,7 @@ class QWenPreTrainedModel(PreTrainedModel):
698
  is_parallelizable = False
699
  supports_gradient_checkpointing = True
700
  _no_split_modules = ["QWenBlock"]
 
701
 
702
  def __init__(self, *inputs, **kwargs):
703
  super().__init__(*inputs, **kwargs)
@@ -764,21 +732,6 @@ class QWenModel(QWenPreTrainedModel):
764
 
765
  self.use_flash_attn = config.use_flash_attn
766
  self.is_fp32 = not (config.bf16 or config.fp16)
767
- if (
768
- self.use_flash_attn
769
- and flash_attn_unpadded_func is not None
770
- and not self.is_fp32
771
- ):
772
- self.registered_causal_mask = None
773
- else:
774
- max_positions = config.max_position_embeddings
775
- self.register_buffer(
776
- "registered_causal_mask",
777
- torch.tril(
778
- torch.ones((max_positions, max_positions), dtype=torch.bool)
779
- ).view(1, 1, max_positions, max_positions),
780
- persistent=False,
781
- )
782
 
783
  self.h = nn.ModuleList(
784
  [
@@ -950,7 +903,6 @@ class QWenModel(QWenPreTrainedModel):
950
  create_custom_forward(block),
951
  hidden_states,
952
  rotary_pos_emb_list,
953
- self.registered_causal_mask,
954
  None,
955
  attention_mask,
956
  head_mask[i],
@@ -962,7 +914,6 @@ class QWenModel(QWenPreTrainedModel):
962
  hidden_states,
963
  layer_past=layer_past,
964
  rotary_pos_emb_list=rotary_pos_emb_list,
965
- registered_causal_mask=self.registered_causal_mask,
966
  attention_mask=attention_mask,
967
  head_mask=head_mask[i],
968
  encoder_hidden_states=encoder_hidden_states,
@@ -1006,11 +957,6 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1006
  assert (
1007
  config.bf16 + config.fp16 + config.fp32 <= 1
1008
  ), "Only one of \"bf16\", \"fp16\", \"fp32\" can be true"
1009
- logger.warn(
1010
- "Warning: please make sure that you are using the latest codes and checkpoints, "
1011
- "especially if you used Qwen-7B before 09.25.2023."
1012
- "请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。"
1013
- )
1014
 
1015
  autoset_precision = config.bf16 + config.fp16 + config.fp32 == 0
1016
 
@@ -1052,6 +998,12 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1052
  if config.use_flash_attn:
1053
  _import_flash_attn()
1054
 
 
 
 
 
 
 
1055
  self.transformer = QWenModel(config)
1056
  self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1057
 
@@ -1063,7 +1015,6 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1063
  self.lm_head.half()
1064
  self.post_init()
1065
 
1066
-
1067
  def get_output_embeddings(self):
1068
  return self.lm_head
1069
 
@@ -1073,22 +1024,13 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1073
  def prepare_inputs_for_generation(
1074
  self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs
1075
  ):
1076
- token_type_ids = kwargs.get("token_type_ids", None)
1077
  if past_key_values:
1078
  input_ids = input_ids[:, -1].unsqueeze(-1)
1079
- if token_type_ids is not None:
1080
- token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
1081
-
1082
- attention_mask = kwargs.get("attention_mask", None)
1083
- position_ids = kwargs.get("position_ids", None)
1084
 
1085
- if attention_mask is not None and position_ids is None:
1086
- position_ids = attention_mask.long().cumsum(-1) - 1
1087
- position_ids.masked_fill_(attention_mask == 0, 1)
1088
- if past_key_values:
1089
- position_ids = position_ids[:, -1].unsqueeze(-1)
1090
  else:
1091
- position_ids = None
1092
 
1093
  if inputs_embeds is not None and past_key_values is None:
1094
  model_inputs = {"inputs_embeds": inputs_embeds}
@@ -1099,9 +1041,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1099
  {
1100
  "past_key_values": past_key_values,
1101
  "use_cache": kwargs.get("use_cache"),
1102
- "position_ids": position_ids,
1103
  "attention_mask": attention_mask,
1104
- "token_type_ids": token_type_ids,
1105
  }
1106
  )
1107
  return model_inputs
@@ -1372,8 +1312,7 @@ class RotaryEmbedding(torch.nn.Module):
1372
  self._ntk_alpha_cached = 1.0
1373
  self._ntk_alpha_cached_list = [1.0]
1374
 
1375
- def update_rotary_pos_emb_cache(self, max_seq_len, offset=0, ntk_alpha=1.0):
1376
- seqlen = max_seq_len + offset
1377
  if seqlen > self._seq_len_cached or ntk_alpha != self._ntk_alpha_cached:
1378
  base = self.base * ntk_alpha ** (self.dim / (self.dim - 2))
1379
  self.inv_freq = 1.0 / (
@@ -1396,10 +1335,10 @@ class RotaryEmbedding(torch.nn.Module):
1396
  cos, sin = emb.cos(), emb.sin()
1397
  self._rotary_pos_emb_cache = [cos, sin]
1398
 
1399
- def forward(self, max_seq_len, offset=0, ntk_alpha=1.0):
1400
- self.update_rotary_pos_emb_cache(max_seq_len, offset, ntk_alpha)
1401
  cos, sin = self._rotary_pos_emb_cache
1402
- return [cos[:, offset : offset + max_seq_len], sin[:, offset : offset + max_seq_len]]
1403
 
1404
 
1405
  def _rotate_half(x):
@@ -1411,21 +1350,30 @@ def _rotate_half(x):
1411
 
1412
 
1413
  def apply_rotary_pos_emb(t, freqs):
 
 
 
 
 
 
 
 
 
1414
  cos, sin = freqs
1415
- if apply_rotary_emb_func is not None and t.is_cuda:
1416
- t_ = t.float()
1417
- cos = cos.squeeze(0).squeeze(1)[:, : cos.shape[-1] // 2]
1418
- sin = sin.squeeze(0).squeeze(1)[:, : sin.shape[-1] // 2]
1419
- output = apply_rotary_emb_func(t_, cos, sin).type_as(t)
1420
- return output
 
 
 
 
1421
  else:
1422
- rot_dim = freqs[0].shape[-1]
1423
- cos, sin = freqs
1424
- t_, t_pass_ = t[..., :rot_dim], t[..., rot_dim:]
1425
- t_ = t_.float()
1426
- t_pass_ = t_pass_.float()
1427
- t_ = (t_ * cos) + (_rotate_half(t_) * sin)
1428
- return torch.cat((t_, t_pass_), dim=-1).type_as(t)
1429
 
1430
 
1431
  class RMSNorm(torch.nn.Module):
@@ -1438,7 +1386,9 @@ class RMSNorm(torch.nn.Module):
1438
  return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
1439
 
1440
  def forward(self, x):
1441
- if rms_norm is not None and x.is_cuda:
 
 
1442
  return rms_norm(x, self.weight, self.eps)
1443
  else:
1444
  output = self._norm(x.float()).type_as(x)
 
13
  import torch.nn.functional as F
14
  import torch.utils.checkpoint
15
  import warnings
 
16
 
17
  from torch.nn import CrossEntropyLoss
18
  from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
 
35
  from torch import nn
36
 
37
  SUPPORT_CUDA = torch.cuda.is_available()
38
+ SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 8
39
  SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
40
  SUPPORT_TORCH2 = hasattr(torch, '__version__') and int(torch.__version__.split(".")[0]) >= 2
41
 
 
76
  """
77
 
78
  apply_rotary_emb_func = None
79
+ apply_rotary_emb_func_triton = None
80
  rms_norm = None
81
+ rms_norm_triton = None
82
  flash_attn_unpadded_func = None
83
+ flash_attn_func = None
84
 
85
  def _import_flash_attn():
86
+ global apply_rotary_emb_func, rms_norm, flash_attn_unpadded_func, flash_attn_func
87
  try:
88
  from flash_attn.layers.rotary import apply_rotary_emb_func as __apply_rotary_emb_func
89
  apply_rotary_emb_func = __apply_rotary_emb_func
 
104
 
105
  try:
106
  import flash_attn
107
+ _flash_attn_func = None
108
  if not hasattr(flash_attn, '__version__'):
109
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
110
  else:
111
  if int(flash_attn.__version__.split(".")[0]) >= 2:
112
+ if int(flash_attn.__version__.split(".")[1]) >= 1:
113
+ from flash_attn.flash_attn_interface import flash_attn_func as _flash_attn_func
114
  from flash_attn.flash_attn_interface import flash_attn_varlen_func as __flash_attn_unpadded_func
115
  else:
116
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
117
  flash_attn_unpadded_func = __flash_attn_unpadded_func
118
+ flash_attn_func = _flash_attn_func
119
  except ImportError:
120
  logger.warn(
121
  "Warning: import flash_attn fail, please install FlashAttention to get higher efficiency "
122
  "https://github.com/Dao-AILab/flash-attention"
123
  )
124
 
125
+ def _import_triton():
126
+ global apply_rotary_emb_func_triton, rms_norm_triton
127
+ try:
128
+ from .triton_kernels import apply_rotary_emb as __apply_rotary_emb, rms_norm as __rms_norm
129
+ if apply_rotary_emb_func is not None:
130
+ logger.warn(
131
+ "Using Triton rotary kernel instead of flash_attn for inference."
132
+ )
133
+ apply_rotary_emb_func_triton = __apply_rotary_emb
134
+ if rms_norm is not None:
135
+ logger.warn(
136
+ "Using Triton rms_norm kernel instead of flash_attn for inference."
137
+ )
138
+ rms_norm_triton = __rms_norm
139
+ except ImportError:
140
+ logger.warn("Warning: Failed to import Triton kernels.")
141
+ return
142
+
143
  def quantize_cache_v(fdata, bits, qmax, qmin):
144
  # b, s, head, h-dim->b, head, s, h-dim
145
  qtype = torch.uint8
 
206
  seqlen_k = k.shape[1]
207
  seqlen_out = seqlen_q
208
 
209
+ if flash_attn_func is not None and batch_size == 1:
210
+ dropout_p = self.dropout_p if self.training else 0
211
+ output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
212
+ return output
213
+
214
  q, k, v = [rearrange(x, "b s ... -> (b s) ...") for x in [q, k, v]]
215
  cu_seqlens_q = torch.arange(
216
  0,
 
340
  warnings.warn("Failed to import KV cache kernels.")
341
  self.cache_kernels = None
342
 
343
+ def _attn(self, query, key, value, causal_mask=None, attention_mask=None, head_mask=None):
344
  device = query.device
345
  if self.use_cache_quantization:
346
  qk, qk_scale, qk_zero = key
 
365
  size_temp = value[0].size(-1)
366
  else:
367
  size_temp = value.size(-1)
368
+ attn_weights = attn_weights / (size_temp ** 0.5)
369
+
 
 
 
 
 
 
 
 
 
 
 
370
  mask_value = torch.finfo(attn_weights.dtype).min
371
+ if causal_mask is not None:
372
+ attn_weights = torch.where(
373
+ causal_mask, attn_weights.to(attn_weights.dtype), mask_value
374
+ )
 
 
375
 
376
  if attention_mask is not None:
377
  attn_weights = attn_weights + attention_mask
 
411
 
412
  return attn_output, attn_weights
413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
414
  def _split_heads(self, tensor, num_heads, attn_head_size):
415
  new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
416
  tensor = tensor.view(new_shape)
 
425
  self,
426
  hidden_states: Optional[Tuple[torch.FloatTensor]],
427
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
 
428
  layer_past: Optional[Tuple[torch.Tensor]] = None,
429
  attention_mask: Optional[torch.FloatTensor] = None,
430
  head_mask: Optional[torch.FloatTensor] = None,
 
498
  else:
499
  present = None
500
 
501
+ key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
502
+ if key_size > self.seq_length and self.use_logn_attn and not self.training:
503
  if self.use_cache_quantization:
504
  seq_start = key[0].size(2) - query.size(1)
505
  seq_end = key[0].size(2)
 
518
  q, k, v = query, key, value
519
  attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
520
  else:
521
+ key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
522
+ if query.size(1) == key_size:
523
+ causal_mask = torch.tril(
524
+ torch.ones((key_size, key_size), dtype=torch.bool, device=query.device)
525
+ ).view(1, 1, key_size, key_size)
526
+ else:
527
+ causal_mask = None
528
  query = query.permute(0, 2, 1, 3)
529
  if not self.use_cache_quantization:
530
  key = key.permute(0, 2, 1, 3)
531
  value = value.permute(0, 2, 1, 3)
532
  if (
533
+ causal_mask is None
534
  and self.use_flash_attn
535
  and flash_attn_unpadded_func is not None
536
  and not self.is_fp32
 
539
  raise Exception(_ERROR_INPUT_CPU_QUERY_WITH_FLASH_ATTN_ACTIVATED)
540
 
541
  if not self.use_cache_quantization and SUPPORT_TORCH2:
 
 
 
542
  if attention_mask is not None:
543
  attention_mask = attention_mask.expand(
544
  -1, -1, causal_mask.size(2), -1
545
+ )
546
+ if causal_mask is not None:
547
+ attention_mask.masked_fill_(~causal_mask, torch.finfo(query.dtype).min)
548
  else:
549
  attention_mask = causal_mask
550
  attn_output = F.scaled_dot_product_attention(
 
553
  attn_weight = None
554
  else:
555
  attn_output, attn_weight = self._attn(
556
+ query, key, value, causal_mask, attention_mask, head_mask
557
  )
558
  context_layer = self._merge_heads(
559
  attn_output, self.num_heads, self.head_dim
 
569
  and not self.is_fp32
570
  ):
571
  raise ValueError("Cannot output attentions while using flash-attn")
572
+ elif not self.use_cache_quantization and SUPPORT_TORCH2:
573
+ raise ValueError("Cannot output attentions while using scaled_dot_product_attention")
574
  else:
575
  outputs += (attn_weight,)
576
 
 
596
  output = self.c_proj(intermediate_parallel)
597
  return output
598
 
599
+
600
  class QWenBlock(nn.Module):
601
  def __init__(self, config):
602
  super().__init__()
 
619
  self,
620
  hidden_states: Optional[Tuple[torch.FloatTensor]],
621
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
 
622
  layer_past: Optional[Tuple[torch.Tensor]] = None,
623
  attention_mask: Optional[torch.FloatTensor] = None,
624
  head_mask: Optional[torch.FloatTensor] = None,
 
632
  attn_outputs = self.attn(
633
  layernorm_output,
634
  rotary_pos_emb_list,
 
635
  layer_past=layer_past,
636
  attention_mask=attention_mask,
637
  head_mask=head_mask,
 
665
  is_parallelizable = False
666
  supports_gradient_checkpointing = True
667
  _no_split_modules = ["QWenBlock"]
668
+ _skip_keys_device_placement = "past_key_values"
669
 
670
  def __init__(self, *inputs, **kwargs):
671
  super().__init__(*inputs, **kwargs)
 
732
 
733
  self.use_flash_attn = config.use_flash_attn
734
  self.is_fp32 = not (config.bf16 or config.fp16)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
735
 
736
  self.h = nn.ModuleList(
737
  [
 
903
  create_custom_forward(block),
904
  hidden_states,
905
  rotary_pos_emb_list,
 
906
  None,
907
  attention_mask,
908
  head_mask[i],
 
914
  hidden_states,
915
  layer_past=layer_past,
916
  rotary_pos_emb_list=rotary_pos_emb_list,
 
917
  attention_mask=attention_mask,
918
  head_mask=head_mask[i],
919
  encoder_hidden_states=encoder_hidden_states,
 
957
  assert (
958
  config.bf16 + config.fp16 + config.fp32 <= 1
959
  ), "Only one of \"bf16\", \"fp16\", \"fp32\" can be true"
 
 
 
 
 
960
 
961
  autoset_precision = config.bf16 + config.fp16 + config.fp32 == 0
962
 
 
998
  if config.use_flash_attn:
999
  _import_flash_attn()
1000
 
1001
+ if config.use_triton == "auto":
1002
+ logger.warn("Try importing Triton kernels for faster inference...")
1003
+ config.use_triton = SUPPORT_TORCH2
1004
+ if config.use_triton:
1005
+ _import_triton()
1006
+
1007
  self.transformer = QWenModel(config)
1008
  self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1009
 
 
1015
  self.lm_head.half()
1016
  self.post_init()
1017
 
 
1018
  def get_output_embeddings(self):
1019
  return self.lm_head
1020
 
 
1024
  def prepare_inputs_for_generation(
1025
  self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs
1026
  ):
 
1027
  if past_key_values:
1028
  input_ids = input_ids[:, -1].unsqueeze(-1)
 
 
 
 
 
1029
 
1030
+ if input_ids.size(0) == 1:
1031
+ attention_mask = None
 
 
 
1032
  else:
1033
+ attention_mask = kwargs.get("attention_mask", None)
1034
 
1035
  if inputs_embeds is not None and past_key_values is None:
1036
  model_inputs = {"inputs_embeds": inputs_embeds}
 
1041
  {
1042
  "past_key_values": past_key_values,
1043
  "use_cache": kwargs.get("use_cache"),
 
1044
  "attention_mask": attention_mask,
 
1045
  }
1046
  )
1047
  return model_inputs
 
1312
  self._ntk_alpha_cached = 1.0
1313
  self._ntk_alpha_cached_list = [1.0]
1314
 
1315
+ def update_rotary_pos_emb_cache(self, seqlen, ntk_alpha=1.0):
 
1316
  if seqlen > self._seq_len_cached or ntk_alpha != self._ntk_alpha_cached:
1317
  base = self.base * ntk_alpha ** (self.dim / (self.dim - 2))
1318
  self.inv_freq = 1.0 / (
 
1335
  cos, sin = emb.cos(), emb.sin()
1336
  self._rotary_pos_emb_cache = [cos, sin]
1337
 
1338
+ def forward(self, max_seq_len, ntk_alpha=1.0):
1339
+ self.update_rotary_pos_emb_cache(max_seq_len, ntk_alpha)
1340
  cos, sin = self._rotary_pos_emb_cache
1341
+ return [cos[:, :max_seq_len], sin[:, :max_seq_len]]
1342
 
1343
 
1344
  def _rotate_half(x):
 
1350
 
1351
 
1352
  def apply_rotary_pos_emb(t, freqs):
1353
+ """ Apply rotary embedding to the first rotary_dim of the iput
1354
+
1355
+ Arguments:
1356
+ t (tensor(batch_size, seq_len, n_head, head_dim)):
1357
+ the input embedding/hidden states
1358
+ freqs (list[tensor(1, seq_len, 1, rotary_dim), tensor(1, seq_len, 1, rotary_dim)]):
1359
+ the cached cos/sin position embeddings
1360
+ """
1361
+ rot_dim = freqs[0].shape[-1]
1362
  cos, sin = freqs
1363
+ t_float = t.float()
1364
+ if apply_rotary_emb_func_triton is not None and t.is_cuda and (not t.requires_grad):
1365
+ return apply_rotary_emb_func_triton(t, cos, sin)
1366
+ elif apply_rotary_emb_func is not None and t.is_cuda:
1367
+ # apply_rotary_emb in flash_attn requires cos/sin to be of
1368
+ # shape (seqlen, rotary_dim / 2) and apply rotary embedding
1369
+ # to the first rotary_dim of the input
1370
+ cos = cos.squeeze(0).squeeze(1)[:, : rot_dim // 2]
1371
+ sin = sin.squeeze(0).squeeze(1)[:, : rot_dim // 2]
1372
+ return apply_rotary_emb_func(t_float, cos, sin).type_as(t)
1373
  else:
1374
+ t_rot, t_pass = t_float[..., :rot_dim], t_float[..., rot_dim:]
1375
+ t_rot = (t_rot * cos) + (_rotate_half(t_rot) * sin)
1376
+ return torch.cat((t_rot, t_pass), dim=-1).type_as(t)
 
 
 
 
1377
 
1378
 
1379
  class RMSNorm(torch.nn.Module):
 
1386
  return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
1387
 
1388
  def forward(self, x):
1389
+ if rms_norm_triton is not None and x.is_cuda and (not x.requires_grad):
1390
+ return rms_norm_triton(x, self.weight, self.eps)
1391
+ elif rms_norm is not None and x.is_cuda:
1392
  return rms_norm(x, self.weight, self.eps)
1393
  else:
1394
  output = self._norm(x.float()).type_as(x)
tokenizer_config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "model_max_length": 8192,
3
  "tokenizer_class": "QWenTokenizer",
4
  "auto_map": {
5
  "AutoTokenizer": [
 
1
  {
2
+ "model_max_length": 32768,
3
  "tokenizer_class": "QWenTokenizer",
4
  "auto_map": {
5
  "AutoTokenizer": [
triton_kernels.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Callable, Dict, Hashable, Tuple
2
+
3
+ import torch
4
+ import triton
5
+ import triton.language as tl
6
+ from triton.compiler import CompiledKernel
7
+ from triton.runtime import JITFunction
8
+
9
+ try:
10
+ import triton.language.math as tlmath # Triton 2.1
11
+ except ImportError:
12
+ import triton.language.libdevice as tlmath # Triton 2.0
13
+
14
+
15
+ class TritonKernel:
16
+ def __init__(
17
+ self,
18
+ kernel_fn: JITFunction,
19
+ grid_fn: Callable[[Tuple[Any, ...]], Tuple[int, int, int]],
20
+ ) -> None:
21
+ self.kernel_fn_ = kernel_fn
22
+ self.grid_fn_ = grid_fn
23
+ self.kernel_cache_: Dict[Hashable, CompiledKernel] = {}
24
+
25
+ def run(self, *args, **kwargs):
26
+ # Set current device
27
+ input_device = args[0].device
28
+ prev_dev_idx, cur_dev_idx = -1, torch.cuda.current_device()
29
+ if input_device.index != cur_dev_idx:
30
+ prev_dev_idx = cur_dev_idx
31
+ torch.cuda.set_device(input_device.index)
32
+
33
+ # Compute grid
34
+ grid = self.grid_fn_(args)
35
+
36
+ # Use cached kernel if possible
37
+ kernel_key = (input_device,) + tuple(kwargs.items())
38
+ if kernel_key in self.kernel_cache_:
39
+ kernel = self.kernel_cache_[kernel_key]
40
+ kernel[grid](*args)
41
+ else:
42
+ # Compile and store new kernel
43
+ kernel = self.kernel_fn_[grid](*args, **kwargs)
44
+ self.kernel_cache_[kernel_key] = kernel
45
+
46
+ # Restore previous device
47
+ torch.cuda.set_device(prev_dev_idx)
48
+
49
+
50
+ @triton.jit
51
+ def _apply_rope_fwd_kernel(X, Cos, Sin, Y, HEAD_DIM: tl.constexpr):
52
+ batch_idx, tok_idx, head_idx = tl.program_id(0), tl.program_id(1), tl.program_id(2)
53
+ seq_len, num_heads = tl.num_programs(1), tl.num_programs(2)
54
+ block_idx = tl.arange(0, HEAD_DIM)
55
+ x_base_idx = ((batch_idx * seq_len + tok_idx) * num_heads * 3 + head_idx) * HEAD_DIM
56
+ x = tl.load(X + x_base_idx + block_idx)
57
+ freq_idx = tok_idx * HEAD_DIM + block_idx
58
+ cos = tl.load(Cos + freq_idx)
59
+ rot_idx = (HEAD_DIM // 2 + block_idx) % HEAD_DIM
60
+ x_rot = tl.load(X + x_base_idx + rot_idx)
61
+ x_rot = tl.where(block_idx >= HEAD_DIM // 2, x_rot, -x_rot)
62
+ sin = tl.load(Sin + freq_idx)
63
+ y_idx = (
64
+ (batch_idx * seq_len + tok_idx) * num_heads + head_idx
65
+ ) * HEAD_DIM + block_idx
66
+ y = x * cos + x_rot * sin
67
+ tl.store(Y + y_idx, y.to(Y.dtype.element_ty))
68
+
69
+
70
+ apply_rope_fwd_kernel = TritonKernel(
71
+ _apply_rope_fwd_kernel, lambda args: tuple(args[0].shape[:3])
72
+ )
73
+
74
+
75
+ def apply_rotary_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
76
+ y = torch.empty(x.shape, dtype=x.dtype, device=x.device)
77
+ apply_rope_fwd_kernel.run(x, cos, sin, y, HEAD_DIM=x.size(-1))
78
+ return y
79
+
80
+
81
+ @triton.jit
82
+ def _rms_norm_fwd_kernel(X, W, Y, eps, hidden_dim, BLOCK_SIZE: tl.constexpr):
83
+ tok_idx = tl.program_id(0)
84
+
85
+ mean_sq = tl.zeros([BLOCK_SIZE], tl.float32)
86
+ for offset in range(0, hidden_dim, BLOCK_SIZE):
87
+ dim_idx = offset + tl.arange(0, BLOCK_SIZE)
88
+ x = tl.load(
89
+ X + tok_idx * hidden_dim + dim_idx, mask=dim_idx < hidden_dim, other=0
90
+ ).to(tl.float32)
91
+ mean_sq += x * x / hidden_dim
92
+ rrms = tlmath.rsqrt(tl.sum(mean_sq, 0) + eps)
93
+
94
+ for offset in range(0, hidden_dim, BLOCK_SIZE):
95
+ dim_idx = offset + tl.arange(0, BLOCK_SIZE)
96
+ dim_mask = dim_idx < hidden_dim
97
+ hidden_idx = tok_idx * hidden_dim + dim_idx
98
+ x = tl.load(X + hidden_idx, mask=dim_mask, other=0)
99
+ w = tl.load(W + dim_idx, mask=dim_mask, other=0)
100
+ y = x * rrms * w
101
+ tl.store(Y + hidden_idx, y.to(Y.dtype.element_ty), mask=dim_mask)
102
+
103
+
104
+ rms_norm_fwd_kernel = TritonKernel(
105
+ _rms_norm_fwd_kernel, lambda args: (args[0].shape[:-1].numel(), 1, 1)
106
+ )
107
+
108
+
109
+ def rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float):
110
+ y = torch.empty_like(x)
111
+ hidden_dim = x.size(-1)
112
+ rms_norm_fwd_kernel.run(
113
+ x, weight, y, eps, hidden_dim, BLOCK_SIZE=triton.next_power_of_2(hidden_dim)
114
+ )
115
+ return y