kvaishnavi commited on
Commit
bee0906
1 Parent(s): 50f6a36

Upload Phi-3-small-8k-instruct ONNX models

Browse files
LICENSE ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) Microsoft Corporation.
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE
22
+
23
+ Apache License
24
+ Version 2.0, January 2004
25
+ http://www.apache.org/licenses/
26
+
27
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
28
+
29
+ 1. Definitions.
30
+
31
+ "License" shall mean the terms and conditions for use, reproduction,
32
+ and distribution as defined by Sections 1 through 9 of this document.
33
+
34
+ "Licensor" shall mean the copyright owner or entity authorized by
35
+ the copyright owner that is granting the License.
36
+
37
+ "Legal Entity" shall mean the union of the acting entity and all
38
+ other entities that control, are controlled by, or are under common
39
+ control with that entity. For the purposes of this definition,
40
+ "control" means (i) the power, direct or indirect, to cause the
41
+ direction or management of such entity, whether by contract or
42
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
43
+ outstanding shares, or (iii) beneficial ownership of such entity.
44
+
45
+ "You" (or "Your") shall mean an individual or Legal Entity
46
+ exercising permissions granted by this License.
47
+
48
+ "Source" form shall mean the preferred form for making modifications,
49
+ including but not limited to software source code, documentation
50
+ source, and configuration files.
51
+
52
+ "Object" form shall mean any form resulting from mechanical
53
+ transformation or translation of a Source form, including but
54
+ not limited to compiled object code, generated documentation,
55
+ and conversions to other media types.
56
+
57
+ "Work" shall mean the work of authorship, whether in Source or
58
+ Object form, made available under the License, as indicated by a
59
+ copyright notice that is included in or attached to the work
60
+ (an example is provided in the Appendix below).
61
+
62
+ "Derivative Works" shall mean any work, whether in Source or Object
63
+ form, that is based on (or derived from) the Work and for which the
64
+ editorial revisions, annotations, elaborations, or other modifications
65
+ represent, as a whole, an original work of authorship. For the purposes
66
+ of this License, Derivative Works shall not include works that remain
67
+ separable from, or merely link (or bind by name) to the interfaces of,
68
+ the Work and Derivative Works thereof.
69
+
70
+ "Contribution" shall mean any work of authorship, including
71
+ the original version of the Work and any modifications or additions
72
+ to that Work or Derivative Works thereof, that is intentionally
73
+ submitted to Licensor for inclusion in the Work by the copyright owner
74
+ or by an individual or Legal Entity authorized to submit on behalf of
75
+ the copyright owner. For the purposes of this definition, "submitted"
76
+ means any form of electronic, verbal, or written communication sent
77
+ to the Licensor or its representatives, including but not limited to
78
+ communication on electronic mailing lists, source code control systems,
79
+ and issue tracking systems that are managed by, or on behalf of, the
80
+ Licensor for the purpose of discussing and improving the Work, but
81
+ excluding communication that is conspicuously marked or otherwise
82
+ designated in writing by the copyright owner as "Not a Contribution."
83
+
84
+ "Contributor" shall mean Licensor and any individual or Legal Entity
85
+ on behalf of whom a Contribution has been received by Licensor and
86
+ subsequently incorporated within the Work.
87
+
88
+ 2. Grant of Copyright License. Subject to the terms and conditions of
89
+ this License, each Contributor hereby grants to You a perpetual,
90
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
91
+ copyright license to reproduce, prepare Derivative Works of,
92
+ publicly display, publicly perform, sublicense, and distribute the
93
+ Work and such Derivative Works in Source or Object form.
94
+
95
+ 3. Grant of Patent License. Subject to the terms and conditions of
96
+ this License, each Contributor hereby grants to You a perpetual,
97
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
98
+ (except as stated in this section) patent license to make, have made,
99
+ use, offer to sell, sell, import, and otherwise transfer the Work,
100
+ where such license applies only to those patent claims licensable
101
+ by such Contributor that are necessarily infringed by their
102
+ Contribution(s) alone or by combination of their Contribution(s)
103
+ with the Work to which such Contribution(s) was submitted. If You
104
+ institute patent litigation against any entity (including a
105
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
106
+ or a Contribution incorporated within the Work constitutes direct
107
+ or contributory patent infringement, then any patent licenses
108
+ granted to You under this License for that Work shall terminate
109
+ as of the date such litigation is filed.
110
+
111
+ 4. Redistribution. You may reproduce and distribute copies of the
112
+ Work or Derivative Works thereof in any medium, with or without
113
+ modifications, and in Source or Object form, provided that You
114
+ meet the following conditions:
115
+
116
+ (a) You must give any other recipients of the Work or
117
+ Derivative Works a copy of this License; and
118
+
119
+ (b) You must cause any modified files to carry prominent notices
120
+ stating that You changed the files; and
121
+
122
+ (c) You must retain, in the Source form of any Derivative Works
123
+ that You distribute, all copyright, patent, trademark, and
124
+ attribution notices from the Source form of the Work,
125
+ excluding those notices that do not pertain to any part of
126
+ the Derivative Works; and
127
+
128
+ (d) If the Work includes a "NOTICE" text file as part of its
129
+ distribution, then any Derivative Works that You distribute must
130
+ include a readable copy of the attribution notices contained
131
+ within such NOTICE file, excluding those notices that do not
132
+ pertain to any part of the Derivative Works, in at least one
133
+ of the following places: within a NOTICE text file distributed
134
+ as part of the Derivative Works; within the Source form or
135
+ documentation, if provided along with the Derivative Works; or,
136
+ within a display generated by the Derivative Works, if and
137
+ wherever such third-party notices normally appear. The contents
138
+ of the NOTICE file are for informational purposes only and
139
+ do not modify the License. You may add Your own attribution
140
+ notices within Derivative Works that You distribute, alongside
141
+ or as an addendum to the NOTICE text from the Work, provided
142
+ that such additional attribution notices cannot be construed
143
+ as modifying the License.
144
+
145
+ You may add Your own copyright statement to Your modifications and
146
+ may provide additional or different license terms and conditions
147
+ for use, reproduction, or distribution of Your modifications, or
148
+ for any such Derivative Works as a whole, provided Your use,
149
+ reproduction, and distribution of the Work otherwise complies with
150
+ the conditions stated in this License.
151
+
152
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
153
+ any Contribution intentionally submitted for inclusion in the Work
154
+ by You to the Licensor shall be under the terms and conditions of
155
+ this License, without any additional terms or conditions.
156
+ Notwithstanding the above, nothing herein shall supersede or modify
157
+ the terms of any separate license agreement you may have executed
158
+ with Licensor regarding such Contributions.
159
+
160
+ 6. Trademarks. This License does not grant permission to use the trade
161
+ names, trademarks, service marks, or product names of the Licensor,
162
+ except as required for reasonable and customary use in describing the
163
+ origin of the Work and reproducing the content of the NOTICE file.
164
+
165
+ 7. Disclaimer of Warranty. Unless required by applicable law or
166
+ agreed to in writing, Licensor provides the Work (and each
167
+ Contributor provides its Contributions) on an "AS IS" BASIS,
168
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
169
+ implied, including, without limitation, any warranties or conditions
170
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
171
+ PARTICULAR PURPOSE. You are solely responsible for determining the
172
+ appropriateness of using or redistributing the Work and assume any
173
+ risks associated with Your exercise of permissions under this License.
174
+
175
+ 8. Limitation of Liability. In no event and under no legal theory,
176
+ whether in tort (including negligence), contract, or otherwise,
177
+ unless required by applicable law (such as deliberate and grossly
178
+ negligent acts) or agreed to in writing, shall any Contributor be
179
+ liable to You for damages, including any direct, indirect, special,
180
+ incidental, or consequential damages of any character arising as a
181
+ result of this License or out of the use or inability to use the
182
+ Work (including but not limited to damages for loss of goodwill,
183
+ work stoppage, computer failure or malfunction, or any and all
184
+ other commercial damages or losses), even if such Contributor
185
+ has been advised of the possibility of such damages.
186
+
187
+ 9. Accepting Warranty or Additional Liability. While redistributing
188
+ the Work or Derivative Works thereof, You may choose to offer,
189
+ and charge a fee for, acceptance of support, warranty, indemnity,
190
+ or other liability obligations and/or rights consistent with this
191
+ License. However, in accepting such obligations, You may act only
192
+ on Your own behalf and on Your sole responsibility, not on behalf
193
+ of any other Contributor, and only if You agree to indemnify,
194
+ defend, and hold each Contributor harmless for any liability
195
+ incurred by, or claims asserted against, such Contributor by reason
196
+ of your accepting any such warranty or additional liability.
197
+
198
+ END OF TERMS AND CONDITIONS
199
+
200
+ ============================================================================
201
+
202
+ Copyright 2016-2019 Intel Corporation
203
+ Copyright 2018 YANDEX LLC
204
+
205
+ Licensed under the Apache License, Version 2.0 (the "License");
206
+ you may not use this file except in compliance with the License.
207
+ You may obtain a copy of the License at
208
+
209
+ http://www.apache.org/licenses/LICENSE-2.0
210
+
211
+ Unless required by applicable law or agreed to in writing, software
212
+ distributed under the License is distributed on an "AS IS" BASIS,
213
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
214
+ See the License for the specific language governing permissions and
215
+ limitations under the License.
216
+
217
+ This distribution includes third party software ("third party programs").
218
+ This third party software, even if included with the distribution of
219
+ the Intel software, may be governed by separate license terms, including
220
+ without limitation, third party license terms, other Intel software license
221
+ terms, and open source software license terms. These separate license terms
222
+ govern your use of the third party programs as set forth in the
223
+ "THIRD-PARTY-PROGRAMS" file.
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - ONNX
6
+ - DML
7
+ - ONNXRuntime
8
+ - phi3
9
+ - nlp
10
+ - conversational
11
+ - custom_code
12
+ inference: false
13
+ ---
14
+
15
+ # Phi-3 Small-8K-Instruct ONNX CUDA models
16
+
17
+ <!-- Provide a quick summary of what the model is/does. -->
18
+ This repository hosts the optimized versions of [Phi-3-small-8k-instruct](https://aka.ms/phi3-Small-8k-instruct) to accelerate inference with ONNX Runtime for your machines with NVIDIA GPUs.
19
+
20
+ Phi-3 Small is a 7B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets, which include both synthetic data and filtered publicly available website data, with a focus on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the small version in two variants: [8K](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) and [128K](https://huggingface.co/microsoft/Phi-3-small-128k-instruct), which are the context lengths (in tokens) that they can support.
21
+
22
+ The base model has undergone a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context, and logical reasoning, Phi-3-Small-8K-Instruct showcased a robust and state-of-the-art performance among models of the same-size and next-size-up.
23
+
24
+ Optimized variants of the Phi-3 Small models are published here in [ONNX](https://onnx.ai) format and run with [ONNX Runtime](https://onnxruntime.ai/) on GPU across devices, including server platforms, Windows, and Linux.
25
+
26
+ ## ONNX Models
27
+
28
+ Here are some of the optimized configurations we have added:
29
+ 1. ONNX model for FP16 CUDA: ONNX model for NVIDIA GPUs.
30
+ 2. ONNX model for INT4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
31
+
32
+ Note: Using the Hugging Face CLI, you can download sub folders and not all models if you are limited on disk space. The FP16 model is recommended for larger batch sizes, while the INT4 model optimizes performance for lower batch sizes.
33
+
34
+ Example:
35
+ ```
36
+ # Download just the FP16 model
37
+ $ huggingface-cli download microsoft/Phi-3-small-8k-instruct-onnx-cuda --include cuda-fp16/* --local-dir . --local-dir-use-symlinks False
38
+ ```
39
+
40
+ ## How to Get Started with the Model
41
+ To support the Phi-3 models across a range of devices, platforms, and EP backends, we introduce a new API to wrap several aspects of generative AI inferencing. This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps [here](http://aka.ms/generate-tutorial). You can also test the models with this [chat app](https://github.com/microsoft/onnxruntime-genai/tree/main/examples/chat_app).
42
+
43
+ ## Hardware Supported
44
+
45
+ The ONNX models are tested on:
46
+ - 1 A100 GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
47
+
48
+ Minimum Configuration Required:
49
+ - CUDA: Streaming Multiprocessors (SMs) >= 70 (i.e. V100 or newer)
50
+
51
+ ### Model Description
52
+
53
+ - **Developed by:** Microsoft
54
+ - **Model type:** ONNX
55
+ - **Language(s) (NLP):** Python, C, C++
56
+ - **License:** MIT
57
+ - **Model Description:** This is a conversion of the Phi-3 Small-8K-Instruct model for ONNX Runtime inference.
58
+
59
+ ## Additional Details
60
+ - [**Phi-3 Small, Medium, and Vision Blog**](https://aka.ms/phi3_ONNXBuild24) and [**Phi-3 Mini Blog**](https://aka.ms/phi3-optimizations)
61
+ - [**Phi-3 Model Blog Link**](https://aka.ms/phi3blog-april)
62
+ - [**Phi-3 Model Card**]( https://aka.ms/phi3-Small-8K-instruct)
63
+ - [**Phi-3 Technical Report**](https://aka.ms/phi3-tech-report)
64
+ - [**Phi-3 on Azure AI Studio**](https://aka.ms/phi3-azure-ai)
65
+
66
+ ## Performance Metrics
67
+
68
+ Phi-3 Small-8K-Instruct performs better with ONNX Runtime compared to PyTorch for all batch size, prompt length combinations. For FP16 CUDA, ORT performs up to 4X faster than PyTorch, while with INT4 CUDA, it's up to 10.9X faster than PyTorch.
69
+
70
+ The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on [1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series).
71
+
72
+ | Batch Size, Prompt Length | ORT FP16 CUDA | PyTorch Eager FP16 CUDA | Speed Up ORT/PyTorch |
73
+ |---------------------------|---------------|-------------------------|----------------------|
74
+ | 1, 16 | 74.62 | 16.81 | 4.44 |
75
+ | 4, 16 | 290.36 | 65.56 | 4.43 |
76
+ | 16,16 | 1036.93 | 267.33 | 3.88 |
77
+
78
+
79
+ | Batch Size, Prompt Length | ORT INT4 CUDA | PyTorch Eager INT4 CUDA | Speed Up ORT/PyTorch |
80
+ |---------------------------|---------------|-------------------------|----------------------|
81
+ | 1, 16 | 140.68 | 12.93 | 10.88 |
82
+ | 4, 16 | 152.90 | 44.04 | 3.47 |
83
+ | 16,16 | 582.07 | 160.57 | 3.62 |
84
+
85
+
86
+ ### Package Versions
87
+
88
+ | Pip package name | Version |
89
+ |------------------|---------|
90
+ | torch | 2.3.0 |
91
+ | triton | 2.3.0 |
92
+ | onnxruntime-gpu | 1.18.0 |
93
+ | transformers | 4.40.2 |
94
+ | bitsandbytes | 0.43.1 |
95
+
96
+ ## Appendix
97
+
98
+ ## Model Card Contact
99
+ parinitarahi, kvaishnavi, natke
100
+
101
+ ## Contributors
102
+ Kunal Vaishnavi, Sunghoon Choi, Yufeng Li, Tianlei Wu, Sheetal Arun Kadam, Rui Ren, Baiju Meswani, Natalie Kershaw, Parinita Rahi
config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Phi-3-small-8k-instruct",
3
+ "architectures": [
4
+ "Phi3SmallForCausalLM"
5
+ ],
6
+ "attention_dropout_prob": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_phi3_small.Phi3SmallConfig",
9
+ "AutoModelForCausalLM": "modeling_phi3_small.Phi3SmallForCausalLM",
10
+ "AutoTokenizer": "tokenization_phi3_small.Phi3SmallTokenizer"
11
+ },
12
+ "blocksparse_block_size": 64,
13
+ "blocksparse_homo_head_pattern": false,
14
+ "blocksparse_num_local_blocks": 16,
15
+ "blocksparse_triton_kernel_block_size": 64,
16
+ "blocksparse_vert_stride": 8,
17
+ "bos_token_id": 100257,
18
+ "dense_attention_every_n_layers": 2,
19
+ "embedding_dropout_prob": 0.1,
20
+ "eos_token_id": 100257,
21
+ "ff_dim_multiplier": null,
22
+ "ff_intermediate_size": 14336,
23
+ "ffn_dropout_prob": 0.1,
24
+ "gegelu_limit": 20.0,
25
+ "gegelu_pad_to_256": true,
26
+ "hidden_act": "gegelu",
27
+ "hidden_size": 4096,
28
+ "initializer_range": 0.02,
29
+ "layer_norm_epsilon": 1e-05,
30
+ "max_position_embeddings": 8192,
31
+ "model_type": "phi3small",
32
+ "mup_attn_multiplier": 1.0,
33
+ "mup_embedding_multiplier": 10.0,
34
+ "mup_use_scaling": true,
35
+ "mup_width_multiplier": 8.0,
36
+ "num_attention_heads": 32,
37
+ "num_hidden_layers": 32,
38
+ "num_key_value_heads": 8,
39
+ "pad_sequence_to_multiple_of_64": true,
40
+ "reorder_and_upcast_attn": false,
41
+ "rope_embedding_base": 1000000,
42
+ "rope_position_scale": 1.0,
43
+ "torch_dtype": "bfloat16",
44
+ "transformers_version": "4.38.1",
45
+ "use_cache": true,
46
+ "vocab_size": 100352
47
+ }
configuration_phi3_small.py ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ from typing import Any, Dict, List, Optional, Union
17
+
18
+ from transformers.configuration_utils import PretrainedConfig
19
+ from transformers.utils import logging
20
+
21
+ from functools import cached_property
22
+
23
+ """ Phi3Small model configuration """
24
+ logger = logging.get_logger(__name__)
25
+
26
+
27
+ def next_mult(x, y):
28
+ return (x + y - 1) // y * y
29
+
30
+ class Phi3SmallConfig(PretrainedConfig):
31
+ """
32
+ This is the configuration class to store the configuration of a `Phi3Small` model. It is used to
33
+ instantiate a Phi-3-small model according to the specified arguments, defining the model architecture.
34
+ Instantiating a configuration with the defaults will yield a similar configuration to that of the Phi-3-small
35
+ [phi3](https://arxiv.org/pdf/2404.14219) architecture.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 100352):
43
+ Vocabulary size of the Phi3Small model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling `Phi3Small`.
45
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
46
+ The maximum sequence length that this model might safely be used with.
47
+ rope_embedding_base (`float`, *optional*, defaults to 10^6):
48
+ The base value for the RoPE (Relative Position Encoding) embedding.
49
+ rope_position_scale (`float`, *optional*, defaults to 1.0):
50
+ The scale factor for the RoPE position encoding.
51
+ rope_scaling (`Optional[Dict[str, Union[float, List[float], int]]]`, *optional*, defaults to None):
52
+ The scaling configuration used for LongRoPE.
53
+ hidden_size (`int`, *optional*, defaults to 4096):
54
+ The size of the hidden layers in the model.
55
+ num_hidden_layers (`int`, *optional*, defaults to 32):
56
+ The number of layers in the model.
57
+ num_attention_heads (`int`, *optional*, defaults to 32):
58
+ The number of query heads in the model.
59
+ num_key_value_heads (`int`, *optional*, defaults to 8):
60
+ The number of key-value heads in the model.
61
+ hidden_act (`str`, *optional*, defaults to "gegelu"):
62
+ The activation function used in the model.
63
+ gegelu_limit (`float`, *optional*, defaults to 20.0):
64
+ The limit value for the GELU activation function (for numerical stability).
65
+ gegelu_pad_to_256 (`bool`, *optional*, defaults to True):
66
+ Whether to pad the intermediate size to a multiple of 256 (for faster matmul ops).
67
+ ff_dim_multiplier (`Optional[int]`, *optional*, defaults to None):
68
+ The dimension multiplier for the feed-forward layers.
69
+ ff_intermediate_size (`Optional[int]`, *optional*, defaults to 14336):
70
+ The intermediate size for the feed-forward layers.
71
+ One of `ff_dim_multiplier` or `ff_intermediate_size` must be specified.
72
+ blocksparse_homo_head_pattern (`bool`, *optional*, defaults to False):
73
+ Whether to use a homogeneous head pattern for block-sparse attention.
74
+ blocksparse_block_size (`int`, *optional*, defaults to 64):
75
+ The block size for block-sparse attention.
76
+ blocksparse_num_local_blocks (`int`, *optional*, defaults to 16):
77
+ The number of local blocks for block-sparse attention.
78
+ The local window used in blocksparse equals `blocksparse_num_local_blocks * blocksparse_block_size`
79
+ blocksparse_vert_stride (`int`, *optional*, defaults to 8):
80
+ The vertical stride for block-sparse attention.
81
+ blocksparse_triton_kernel_block_size (`int`, *optional*, defaults to 64):
82
+ The kernel block size for block-sparse attention.
83
+ dense_attention_every_n_layers (`Optional[int]`, *optional*, defaults to 2):
84
+ The frequency of all dense attention layers in the model
85
+ embedding_dropout_prob (`float`, *optional*, defaults to 0.1):
86
+ The dropout probability for the embedding layer.
87
+ attention_dropout_prob (`float`, *optional*, defaults to 0.0):
88
+ The dropout probability for the attention layers.
89
+ ffn_dropout_prob (`float`, *optional*, defaults to 0.1):
90
+ The dropout probability for the feed-forward layers.
91
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
92
+ The epsilon value for layer normalization.
93
+ initializer_range (`float`, *optional*, defaults to 0.02):
94
+ The range for weight initialization.
95
+ mup_use_scaling (`bool`, *optional*, defaults to True):
96
+ Whether to use scaling for MuP parameters (see: https://arxiv.org/abs/2203.03466).
97
+ mup_width_multiplier (`bool`, *optional*, defaults to 8.0):
98
+ The width multiplier for MuP.
99
+ mup_embedding_multiplier (`bool`, *optional*, defaults to 10.0):
100
+ The embedding multiplier for MuP.
101
+ mup_attn_multiplier (`bool`, *optional*, defaults to 1.0):
102
+ The attention multiplier for MuP.
103
+ use_cache (`bool`, *optional*, defaults to True):
104
+ Whether to use cache for the model.
105
+ bos_token_id (`int`, *optional*, defaults to 100257):
106
+ The token ID for the beginning of sentence.
107
+ eos_token_id (`int`, *optional*, defaults to 100257):
108
+ The token ID for the end of sentence.
109
+ reorder_and_upcast_attn (`bool`, *optional*, defaults to False):
110
+ Whether to reorder and upcast attention.
111
+ pad_sequence_to_multiple_of_64 (`bool`, *optional*, defaults to True):
112
+ Whether to pad the sequence length to a multiple of 64.
113
+ **kwargs:
114
+ Additional keyword arguments.
115
+
116
+ Example:
117
+
118
+ ```python
119
+ >>> from transformers import Phi3SmallConfig, Phi3SmallModel
120
+
121
+ >>> # Initializing a Phi3Small configuration
122
+ >>> configuration = Phi3SmallConfig()
123
+
124
+ >>> # Initializing a model (with random weights) from the configuration
125
+ >>> model = Phi3SmallModel(configuration)
126
+
127
+ >>> # Accessing the model configuration
128
+ >>> configuration = model.config
129
+ ```
130
+ """
131
+
132
+ model_type = "phi3small"
133
+ keys_to_ignore_at_inference = ["past_key_values"]
134
+
135
+
136
+ def __init__(
137
+ self,
138
+ # General information about the model
139
+ vocab_size: int =100352,
140
+ max_position_embeddings: int = 8192,
141
+ # RoPE Related Parameters
142
+ rope_embedding_base: float = 10**6,
143
+ rope_position_scale: float = 1.0,
144
+ rope_scaling: Optional[Dict[str, Union[float, List[float], int]]] = None,
145
+ # General Model Parameters
146
+ hidden_size: int = 4096,
147
+ num_hidden_layers: int = 32,
148
+ # KV Shared Attention Configurations
149
+ num_attention_heads: int = 32,
150
+ num_key_value_heads: int = 8,
151
+ # GEGELU Related Parameters
152
+ hidden_act: str = "gegelu",
153
+ gegelu_limit: float = 20.0,
154
+ gegelu_pad_to_256: bool = True,
155
+ ff_dim_multiplier: Optional[int] = None,
156
+ ff_intermediate_size: Optional[int] = 14336,
157
+ # Block Sparse Attention Parameters
158
+ blocksparse_homo_head_pattern: bool = False,
159
+ blocksparse_block_size: int = 64,
160
+ blocksparse_num_local_blocks: int = 16,
161
+ blocksparse_vert_stride: int = 8,
162
+ blocksparse_triton_kernel_block_size: int = 64,
163
+ # Frequency of block-sparsity
164
+ dense_attention_every_n_layers: Optional[int] = 2,
165
+ # Reegularization parameters
166
+ embedding_dropout_prob: float =0.1,
167
+ attention_dropout_prob: float = 0.0,
168
+ ffn_dropout_prob: float = 0.1,
169
+ layer_norm_epsilon=1e-5,
170
+ initializer_range=0.02,
171
+ # MuP parameters
172
+ mup_use_scaling: bool = True,
173
+ mup_width_multiplier: bool = 8.0,
174
+ mup_embedding_multiplier: bool = 10.0,
175
+ mup_attn_multiplier: bool =1.0,
176
+ use_cache=True,
177
+ # The model does not have a bos token id
178
+ # However, in order for some of the downstream libraries to not break
179
+ # we set this to be the same as the eos_token_id
180
+ bos_token_id: int = 100257,
181
+ eos_token_id: int = 100257,
182
+ reorder_and_upcast_attn=False,
183
+ # Configuration to pad sequence length to a multiple of 64
184
+ pad_sequence_to_multiple_of_64: bool = True,
185
+ **kwargs,
186
+ ):
187
+ self.vocab_size = vocab_size
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.rope_embedding_base = rope_embedding_base
190
+ self.rope_position_scale = rope_position_scale
191
+ self.rope_scaling = rope_scaling
192
+ self.hidden_size = hidden_size
193
+ # QK Shared Attention
194
+ self.num_hidden_layers = num_hidden_layers
195
+ self.num_attention_heads = num_attention_heads
196
+ self.num_key_value_heads = num_key_value_heads
197
+ # Block Sparse Attention Pattern
198
+ self.blocksparse_homo_head_pattern = blocksparse_homo_head_pattern
199
+ self.blocksparse_block_size = blocksparse_block_size
200
+ self.blocksparse_num_local_blocks = blocksparse_num_local_blocks
201
+ self.blocksparse_vert_stride = blocksparse_vert_stride
202
+ self.blocksparse_triton_kernel_block_size = blocksparse_triton_kernel_block_size
203
+ # Frequency of block sparsity
204
+ self.dense_attention_every_n_layers = dense_attention_every_n_layers
205
+ # Activation function
206
+ self.hidden_act = hidden_act
207
+ self.gegelu_limit = gegelu_limit
208
+ self.gegelu_pad_to_256 = gegelu_pad_to_256
209
+ self.ff_dim_multiplier = ff_dim_multiplier
210
+ self.ff_intermediate_size = ff_intermediate_size
211
+ if self.ff_dim_multiplier is None and self.ff_intermediate_size is None:
212
+ raise ValueError(f"Cannot have both {self.ff_dim_multiplier} and {self.ff_intermediate_size} as None")
213
+ if self.ff_dim_multiplier is not None and self.ff_intermediate_size is not None:
214
+ raise ValueError(f"Cannot specify both {self.ff_dim_multiplier} and {self.ff_intermediate_size}.")
215
+ # General regularization
216
+ self.embedding_dropout_prob = embedding_dropout_prob
217
+ self.attention_dropout_prob = attention_dropout_prob
218
+ self.ffn_dropout_prob = ffn_dropout_prob
219
+ self.layer_norm_epsilon = layer_norm_epsilon
220
+ self.initializer_range = initializer_range
221
+ # MuP parameters
222
+ self.mup_use_scaling = mup_use_scaling
223
+ self.mup_width_multiplier = mup_width_multiplier
224
+ self.mup_embedding_multiplier = mup_embedding_multiplier
225
+ self.mup_attn_multiplier = mup_attn_multiplier
226
+ self.use_cache = use_cache
227
+
228
+ self.reorder_and_upcast_attn = reorder_and_upcast_attn
229
+ self.pad_sequence_to_multiple_of_64 = pad_sequence_to_multiple_of_64
230
+
231
+ self.bos_token_id = bos_token_id
232
+ self.eos_token_id = eos_token_id
233
+
234
+ super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
235
+
236
+ @cached_property
237
+ def dummy_token_indices(self) -> List[int]:
238
+ # Importing here to avoid circular imports
239
+ from .tokenization_phi3_small import Phi3SmallTokenizer
240
+ tokenizer = Phi3SmallTokenizer()
241
+ return tokenizer.dummy_token_indices
242
+
243
+ @property
244
+ def intermediate_size(self) -> int:
245
+ if self.ff_intermediate_size is not None:
246
+ return self.ff_intermediate_size
247
+ intermediate_size = (self.ff_dim_multiplier) * (self.hidden_size // 3) * 2
248
+ if self.gegelu_pad_to_256:
249
+ intermediate_size = next_mult(intermediate_size, 256)
250
+ return intermediate_size
cuda-fp16/cl100k_base.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
cuda-fp16/config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Phi-3-small-8k-instruct",
3
+ "architectures": [
4
+ "Phi3SmallForCausalLM"
5
+ ],
6
+ "attention_dropout_prob": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_phi3_small.Phi3SmallConfig",
9
+ "AutoModelForCausalLM": "modeling_phi3_small.Phi3SmallForCausalLM",
10
+ "AutoTokenizer": "tokenization_phi3_small.Phi3SmallTokenizer"
11
+ },
12
+ "blocksparse_block_size": 64,
13
+ "blocksparse_homo_head_pattern": false,
14
+ "blocksparse_num_local_blocks": 16,
15
+ "blocksparse_triton_kernel_block_size": 64,
16
+ "blocksparse_vert_stride": 8,
17
+ "bos_token_id": 100257,
18
+ "dense_attention_every_n_layers": 2,
19
+ "embedding_dropout_prob": 0.1,
20
+ "eos_token_id": 100257,
21
+ "ff_dim_multiplier": null,
22
+ "ff_intermediate_size": 14336,
23
+ "ffn_dropout_prob": 0.1,
24
+ "gegelu_limit": 20.0,
25
+ "gegelu_pad_to_256": true,
26
+ "hidden_act": "gegelu",
27
+ "hidden_size": 4096,
28
+ "initializer_range": 0.02,
29
+ "layer_norm_epsilon": 1e-05,
30
+ "max_position_embeddings": 8192,
31
+ "model_type": "phi3small",
32
+ "mup_attn_multiplier": 1.0,
33
+ "mup_embedding_multiplier": 10.0,
34
+ "mup_use_scaling": true,
35
+ "mup_width_multiplier": 8.0,
36
+ "num_attention_heads": 32,
37
+ "num_hidden_layers": 32,
38
+ "num_key_value_heads": 8,
39
+ "pad_sequence_to_multiple_of_64": true,
40
+ "reorder_and_upcast_attn": false,
41
+ "rope_embedding_base": 1000000,
42
+ "rope_position_scale": 1.0,
43
+ "torch_dtype": "bfloat16",
44
+ "transformers_version": "4.38.1",
45
+ "use_cache": true,
46
+ "vocab_size": 100352
47
+ }
cuda-fp16/configuration_phi3_small.py ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ from typing import Any, Dict, List, Optional, Union
17
+
18
+ from transformers.configuration_utils import PretrainedConfig
19
+ from transformers.utils import logging
20
+
21
+ from functools import cached_property
22
+
23
+ """ Phi3Small model configuration """
24
+ logger = logging.get_logger(__name__)
25
+
26
+
27
+ def next_mult(x, y):
28
+ return (x + y - 1) // y * y
29
+
30
+ class Phi3SmallConfig(PretrainedConfig):
31
+ """
32
+ This is the configuration class to store the configuration of a `Phi3Small` model. It is used to
33
+ instantiate a Phi-3-small model according to the specified arguments, defining the model architecture.
34
+ Instantiating a configuration with the defaults will yield a similar configuration to that of the Phi-3-small
35
+ [phi3](https://arxiv.org/pdf/2404.14219) architecture.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 100352):
43
+ Vocabulary size of the Phi3Small model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling `Phi3Small`.
45
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
46
+ The maximum sequence length that this model might safely be used with.
47
+ rope_embedding_base (`float`, *optional*, defaults to 10^6):
48
+ The base value for the RoPE (Relative Position Encoding) embedding.
49
+ rope_position_scale (`float`, *optional*, defaults to 1.0):
50
+ The scale factor for the RoPE position encoding.
51
+ rope_scaling (`Optional[Dict[str, Union[float, List[float], int]]]`, *optional*, defaults to None):
52
+ The scaling configuration used for LongRoPE.
53
+ hidden_size (`int`, *optional*, defaults to 4096):
54
+ The size of the hidden layers in the model.
55
+ num_hidden_layers (`int`, *optional*, defaults to 32):
56
+ The number of layers in the model.
57
+ num_attention_heads (`int`, *optional*, defaults to 32):
58
+ The number of query heads in the model.
59
+ num_key_value_heads (`int`, *optional*, defaults to 8):
60
+ The number of key-value heads in the model.
61
+ hidden_act (`str`, *optional*, defaults to "gegelu"):
62
+ The activation function used in the model.
63
+ gegelu_limit (`float`, *optional*, defaults to 20.0):
64
+ The limit value for the GELU activation function (for numerical stability).
65
+ gegelu_pad_to_256 (`bool`, *optional*, defaults to True):
66
+ Whether to pad the intermediate size to a multiple of 256 (for faster matmul ops).
67
+ ff_dim_multiplier (`Optional[int]`, *optional*, defaults to None):
68
+ The dimension multiplier for the feed-forward layers.
69
+ ff_intermediate_size (`Optional[int]`, *optional*, defaults to 14336):
70
+ The intermediate size for the feed-forward layers.
71
+ One of `ff_dim_multiplier` or `ff_intermediate_size` must be specified.
72
+ blocksparse_homo_head_pattern (`bool`, *optional*, defaults to False):
73
+ Whether to use a homogeneous head pattern for block-sparse attention.
74
+ blocksparse_block_size (`int`, *optional*, defaults to 64):
75
+ The block size for block-sparse attention.
76
+ blocksparse_num_local_blocks (`int`, *optional*, defaults to 16):
77
+ The number of local blocks for block-sparse attention.
78
+ The local window used in blocksparse equals `blocksparse_num_local_blocks * blocksparse_block_size`
79
+ blocksparse_vert_stride (`int`, *optional*, defaults to 8):
80
+ The vertical stride for block-sparse attention.
81
+ blocksparse_triton_kernel_block_size (`int`, *optional*, defaults to 64):
82
+ The kernel block size for block-sparse attention.
83
+ dense_attention_every_n_layers (`Optional[int]`, *optional*, defaults to 2):
84
+ The frequency of all dense attention layers in the model
85
+ embedding_dropout_prob (`float`, *optional*, defaults to 0.1):
86
+ The dropout probability for the embedding layer.
87
+ attention_dropout_prob (`float`, *optional*, defaults to 0.0):
88
+ The dropout probability for the attention layers.
89
+ ffn_dropout_prob (`float`, *optional*, defaults to 0.1):
90
+ The dropout probability for the feed-forward layers.
91
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
92
+ The epsilon value for layer normalization.
93
+ initializer_range (`float`, *optional*, defaults to 0.02):
94
+ The range for weight initialization.
95
+ mup_use_scaling (`bool`, *optional*, defaults to True):
96
+ Whether to use scaling for MuP parameters (see: https://arxiv.org/abs/2203.03466).
97
+ mup_width_multiplier (`bool`, *optional*, defaults to 8.0):
98
+ The width multiplier for MuP.
99
+ mup_embedding_multiplier (`bool`, *optional*, defaults to 10.0):
100
+ The embedding multiplier for MuP.
101
+ mup_attn_multiplier (`bool`, *optional*, defaults to 1.0):
102
+ The attention multiplier for MuP.
103
+ use_cache (`bool`, *optional*, defaults to True):
104
+ Whether to use cache for the model.
105
+ bos_token_id (`int`, *optional*, defaults to 100257):
106
+ The token ID for the beginning of sentence.
107
+ eos_token_id (`int`, *optional*, defaults to 100257):
108
+ The token ID for the end of sentence.
109
+ reorder_and_upcast_attn (`bool`, *optional*, defaults to False):
110
+ Whether to reorder and upcast attention.
111
+ pad_sequence_to_multiple_of_64 (`bool`, *optional*, defaults to True):
112
+ Whether to pad the sequence length to a multiple of 64.
113
+ **kwargs:
114
+ Additional keyword arguments.
115
+
116
+ Example:
117
+
118
+ ```python
119
+ >>> from transformers import Phi3SmallConfig, Phi3SmallModel
120
+
121
+ >>> # Initializing a Phi3Small configuration
122
+ >>> configuration = Phi3SmallConfig()
123
+
124
+ >>> # Initializing a model (with random weights) from the configuration
125
+ >>> model = Phi3SmallModel(configuration)
126
+
127
+ >>> # Accessing the model configuration
128
+ >>> configuration = model.config
129
+ ```
130
+ """
131
+
132
+ model_type = "phi3small"
133
+ keys_to_ignore_at_inference = ["past_key_values"]
134
+
135
+
136
+ def __init__(
137
+ self,
138
+ # General information about the model
139
+ vocab_size: int =100352,
140
+ max_position_embeddings: int = 8192,
141
+ # RoPE Related Parameters
142
+ rope_embedding_base: float = 10**6,
143
+ rope_position_scale: float = 1.0,
144
+ rope_scaling: Optional[Dict[str, Union[float, List[float], int]]] = None,
145
+ # General Model Parameters
146
+ hidden_size: int = 4096,
147
+ num_hidden_layers: int = 32,
148
+ # KV Shared Attention Configurations
149
+ num_attention_heads: int = 32,
150
+ num_key_value_heads: int = 8,
151
+ # GEGELU Related Parameters
152
+ hidden_act: str = "gegelu",
153
+ gegelu_limit: float = 20.0,
154
+ gegelu_pad_to_256: bool = True,
155
+ ff_dim_multiplier: Optional[int] = None,
156
+ ff_intermediate_size: Optional[int] = 14336,
157
+ # Block Sparse Attention Parameters
158
+ blocksparse_homo_head_pattern: bool = False,
159
+ blocksparse_block_size: int = 64,
160
+ blocksparse_num_local_blocks: int = 16,
161
+ blocksparse_vert_stride: int = 8,
162
+ blocksparse_triton_kernel_block_size: int = 64,
163
+ # Frequency of block-sparsity
164
+ dense_attention_every_n_layers: Optional[int] = 2,
165
+ # Reegularization parameters
166
+ embedding_dropout_prob: float =0.1,
167
+ attention_dropout_prob: float = 0.0,
168
+ ffn_dropout_prob: float = 0.1,
169
+ layer_norm_epsilon=1e-5,
170
+ initializer_range=0.02,
171
+ # MuP parameters
172
+ mup_use_scaling: bool = True,
173
+ mup_width_multiplier: bool = 8.0,
174
+ mup_embedding_multiplier: bool = 10.0,
175
+ mup_attn_multiplier: bool =1.0,
176
+ use_cache=True,
177
+ # The model does not have a bos token id
178
+ # However, in order for some of the downstream libraries to not break
179
+ # we set this to be the same as the eos_token_id
180
+ bos_token_id: int = 100257,
181
+ eos_token_id: int = 100257,
182
+ reorder_and_upcast_attn=False,
183
+ # Configuration to pad sequence length to a multiple of 64
184
+ pad_sequence_to_multiple_of_64: bool = True,
185
+ **kwargs,
186
+ ):
187
+ self.vocab_size = vocab_size
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.rope_embedding_base = rope_embedding_base
190
+ self.rope_position_scale = rope_position_scale
191
+ self.rope_scaling = rope_scaling
192
+ self.hidden_size = hidden_size
193
+ # QK Shared Attention
194
+ self.num_hidden_layers = num_hidden_layers
195
+ self.num_attention_heads = num_attention_heads
196
+ self.num_key_value_heads = num_key_value_heads
197
+ # Block Sparse Attention Pattern
198
+ self.blocksparse_homo_head_pattern = blocksparse_homo_head_pattern
199
+ self.blocksparse_block_size = blocksparse_block_size
200
+ self.blocksparse_num_local_blocks = blocksparse_num_local_blocks
201
+ self.blocksparse_vert_stride = blocksparse_vert_stride
202
+ self.blocksparse_triton_kernel_block_size = blocksparse_triton_kernel_block_size
203
+ # Frequency of block sparsity
204
+ self.dense_attention_every_n_layers = dense_attention_every_n_layers
205
+ # Activation function
206
+ self.hidden_act = hidden_act
207
+ self.gegelu_limit = gegelu_limit
208
+ self.gegelu_pad_to_256 = gegelu_pad_to_256
209
+ self.ff_dim_multiplier = ff_dim_multiplier
210
+ self.ff_intermediate_size = ff_intermediate_size
211
+ if self.ff_dim_multiplier is None and self.ff_intermediate_size is None:
212
+ raise ValueError(f"Cannot have both {self.ff_dim_multiplier} and {self.ff_intermediate_size} as None")
213
+ if self.ff_dim_multiplier is not None and self.ff_intermediate_size is not None:
214
+ raise ValueError(f"Cannot specify both {self.ff_dim_multiplier} and {self.ff_intermediate_size}.")
215
+ # General regularization
216
+ self.embedding_dropout_prob = embedding_dropout_prob
217
+ self.attention_dropout_prob = attention_dropout_prob
218
+ self.ffn_dropout_prob = ffn_dropout_prob
219
+ self.layer_norm_epsilon = layer_norm_epsilon
220
+ self.initializer_range = initializer_range
221
+ # MuP parameters
222
+ self.mup_use_scaling = mup_use_scaling
223
+ self.mup_width_multiplier = mup_width_multiplier
224
+ self.mup_embedding_multiplier = mup_embedding_multiplier
225
+ self.mup_attn_multiplier = mup_attn_multiplier
226
+ self.use_cache = use_cache
227
+
228
+ self.reorder_and_upcast_attn = reorder_and_upcast_attn
229
+ self.pad_sequence_to_multiple_of_64 = pad_sequence_to_multiple_of_64
230
+
231
+ self.bos_token_id = bos_token_id
232
+ self.eos_token_id = eos_token_id
233
+
234
+ super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
235
+
236
+ @cached_property
237
+ def dummy_token_indices(self) -> List[int]:
238
+ # Importing here to avoid circular imports
239
+ from .tokenization_phi3_small import Phi3SmallTokenizer
240
+ tokenizer = Phi3SmallTokenizer()
241
+ return tokenizer.dummy_token_indices
242
+
243
+ @property
244
+ def intermediate_size(self) -> int:
245
+ if self.ff_intermediate_size is not None:
246
+ return self.ff_intermediate_size
247
+ intermediate_size = (self.ff_dim_multiplier) * (self.hidden_size // 3) * 2
248
+ if self.gegelu_pad_to_256:
249
+ intermediate_size = next_mult(intermediate_size, 256)
250
+ return intermediate_size
cuda-fp16/genai_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": {
3
+ "bos_token_id": 100257,
4
+ "context_length": 8192,
5
+ "decoder": {
6
+ "session_options": {
7
+ "log_id": "onnxruntime-genai",
8
+ "provider_options": [
9
+ {
10
+ "cuda": {
11
+ "enable_cuda_graph": "0"
12
+ }
13
+ }
14
+ ]
15
+ },
16
+ "filename": "phi3-small-8k-instruct-cuda-fp16.onnx",
17
+ "head_size": 128,
18
+ "hidden_size": 4096,
19
+ "inputs": {
20
+ "input_ids": "input_ids",
21
+ "attention_mask": "attention_mask",
22
+ "past_key_names": "past_key_values.%d.key",
23
+ "past_value_names": "past_key_values.%d.value"
24
+ },
25
+ "outputs": {
26
+ "logits": "logits",
27
+ "present_key_names": "present.%d.key",
28
+ "present_value_names": "present.%d.value"
29
+ },
30
+ "num_attention_heads": 32,
31
+ "num_hidden_layers": 32,
32
+ "num_key_value_heads": 8
33
+ },
34
+ "eos_token_id": [
35
+ 100257,
36
+ 100266
37
+ ],
38
+ "pad_token_id": 100257,
39
+ "type": "phi3small",
40
+ "vocab_size": 100352
41
+ },
42
+ "search": {
43
+ "diversity_penalty": 0.0,
44
+ "do_sample": false,
45
+ "early_stopping": true,
46
+ "length_penalty": 1.0,
47
+ "max_length": 8192,
48
+ "min_length": 0,
49
+ "no_repeat_ngram_size": 0,
50
+ "num_beams": 1,
51
+ "num_return_sequences": 1,
52
+ "past_present_share_buffer": true,
53
+ "repetition_penalty": 1.0,
54
+ "temperature": 1.0,
55
+ "top_k": 1,
56
+ "top_p": 1.0
57
+ }
58
+ }
cuda-fp16/phi3-small-8k-instruct-cuda-fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ecb271f1ac3296ce5318e192c3df06151fb0c7d69c15c54f85d1b3e68a4149b
3
+ size 316745
cuda-fp16/phi3-small-8k-instruct-cuda-fp16.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b623166c7dcd395ac3f57890b1c81026be0e1487c73af16a850d3d73a77eafe1
3
+ size 15609196672
cuda-fp16/special_tokens_map.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>"
4
+ }
cuda-fp16/tokenization_phi3_small.py ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adapted from https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/tokenization_qwen.py
2
+ import os
3
+ from typing import Collection, List, Optional, Dict, Set, Tuple, Union
4
+
5
+ from functools import cached_property
6
+
7
+ import base64
8
+
9
+ from transformers import PreTrainedTokenizer, AddedToken, AutoConfig
10
+ from transformers.models.auto.tokenization_auto import get_tokenizer_config
11
+ import tiktoken
12
+
13
+
14
+ """
15
+ This tokenizer is almost identical to tiktoken.get_encoding("cl100k_base")
16
+ with a few additional special tokens to support the ChatML format.
17
+
18
+ TODO(bapatra): Right now, I do not save the special tokens to the vocab file.
19
+ Maybe in the future, that would be useful? Can add that support later.
20
+
21
+ """
22
+
23
+ def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
24
+ with open(tiktoken_bpe_file, "rb") as f:
25
+ contents = f.read()
26
+ return {
27
+ base64.b64decode(token): int(rank)
28
+ for token, rank in (line.split() for line in contents.splitlines() if line)
29
+ }
30
+
31
+ # On the megatron codebase, we pad vocabularies to ensure matrix multiplication is fast.
32
+ # this in turn causes some indices to be empty. We account for these empty indices by adding
33
+ # dummy tokens to the tokenizer.
34
+
35
+ EFFECTIVE_PADDED_VOCAB_SIZE = 100352
36
+ ACTUAL_VOCAB_SIZE = 100276
37
+
38
+
39
+ DUMMY_TOKENS = {
40
+ f"<|dummy_id_{11 + offset}|>": 100276 + offset
41
+ for offset in range(1, EFFECTIVE_PADDED_VOCAB_SIZE - ACTUAL_VOCAB_SIZE)
42
+ }
43
+
44
+ SPECIAL_TOKENS = {
45
+ # tiktoken.get_encoding("cl100k_base")._special_tokens
46
+ '<|endoftext|>': 100257,
47
+ '<|fim_prefix|>': 100258,
48
+ '<|fim_middle|>': 100259,
49
+ '<|fim_suffix|>': 100260,
50
+ # Special tokens for post-training
51
+ "<|system|>": 100261,
52
+ "<|user|>": 100262,
53
+ "<|assistant|>": 100263,
54
+ # Dummy unused tokens
55
+ "<|dummy_id_0|>": 100264,
56
+ "<|dummy_id_1|>": 100265,
57
+ # Special tokens for post-training continued
58
+ "<|end|>": 100266,
59
+ # Some dummy tokens, so that tokenization is contiguous and does not cause issues
60
+ # Note that the 100256th token of tiktoken.get_encoding("cl100k_base") does not
61
+ # actually map to anything. So we use a dummy token here.
62
+ "<|dummy_id_2|>": 100256,
63
+ # Likewise, tokens from 100267 to 100275 are also unused
64
+ "<|dummy_id_3|>": 100267,
65
+ "<|dummy_id_4|>": 100268,
66
+ "<|dummy_id_5|>": 100269,
67
+ "<|dummy_id_6|>": 100270,
68
+ "<|dummy_id_7|>": 100271,
69
+ "<|dummy_id_8|>": 100272,
70
+ "<|dummy_id_9|>": 100273,
71
+ "<|dummy_id_10|>": 100274,
72
+ "<|dummy_id_11|>": 100275,
73
+ # The final end of prompt token
74
+ # (unused, but present as a part of tiktoken.get_encoding("cl100k_base")._special_tokens)
75
+ '<|endofprompt|>': 100276,
76
+ # Dummy tokens to account for padding of the tokenizer
77
+ # We pad to ensure tensor cores are used for vocab multiplication
78
+ **DUMMY_TOKENS
79
+ }
80
+
81
+ class Phi3SmallTokenizer(PreTrainedTokenizer):
82
+ vocab_files_names = {
83
+ "vocab_file": "cl100k_base.tiktoken"
84
+ }
85
+
86
+ model_input_names: List[str] = ["input_ids", "attention_mask"]
87
+ padding_side = "left"
88
+
89
+ def __init__(
90
+ self,
91
+ vocab_file: Optional[str] = None,
92
+ errors: str = "replace",
93
+ **kwargs
94
+ ) -> None:
95
+ # PreTrainedTokenizer's init calls _add_tokens, which in turn checks
96
+ # if the token is present in `self.special_tokens``. Hence instantiating it here.
97
+ # The way Qwen gets around this is by checking against SPECIAL_TOKENS
98
+ # But I think it's better to check against the objects own `special_tokens`
99
+ # in case we eventually want to allow the tokenizer to have special tokens.
100
+ self.special_tokens = SPECIAL_TOKENS
101
+
102
+ super().__init__(**kwargs)
103
+ self.errors = errors
104
+
105
+ base = tiktoken.get_encoding("cl100k_base")
106
+ if vocab_file is None:
107
+ self.mergeable_ranks: Dict[bytes, int] = base._mergeable_ranks
108
+ else:
109
+ self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)
110
+
111
+ self.pat_str = base._pat_str
112
+
113
+ enc = tiktoken.Encoding(
114
+ name="phi3small",
115
+ pat_str=self.pat_str,
116
+ mergeable_ranks=self.mergeable_ranks,
117
+ special_tokens=self.special_tokens,
118
+ )
119
+ self.tokenizer = enc
120
+
121
+ self.decoder: Dict[int, bytes] = {
122
+ v: k for k, v in self.mergeable_ranks.items()
123
+ }
124
+ self.decoder.update({v: k for k, v in self.special_tokens.items()})
125
+
126
+ self.eod_id = self.tokenizer.eot_token
127
+ self._eos_token = self._convert_id_to_token(self.eod_id)
128
+
129
+ # Setting the bos_token to be the same as the eos_token
130
+ # Note that this is **not** the correct thing to do, and is done
131
+ # just so that some of the downstream libraries do not break.
132
+ self._bos_token = self._eos_token
133
+
134
+ # Assign the special tokens to class variables
135
+ self.system_id = self.special_tokens["<|system|>"]
136
+ self.user_id = self.special_tokens["<|user|>"]
137
+ self.assistant_id = self.special_tokens["<|assistant|>"]
138
+ self.end_id = self.special_tokens["<|end|>"]
139
+
140
+ @cached_property
141
+ def dummy_token_indices(self) -> List[int]:
142
+ # There are some additional special tokens in the cl100k_base tokenizer
143
+ # that we do not use. Hence, we also consider them to be dummy tokens.
144
+ additional_tokens = [
145
+ "<|fim_prefix|>",
146
+ "<|fim_middle|>",
147
+ "<|fim_suffix|>",
148
+ "<|endofprompt|>"
149
+ ]
150
+ dummy_token_indices = [index for token, index in self.special_tokens.items() if "dummy_id" in token]
151
+ dummy_token_indices.extend([self.special_tokens[token] for token in additional_tokens])
152
+ return sorted(dummy_token_indices)
153
+
154
+ def __getstate__(self):
155
+ state = self.__dict__.copy()
156
+ del state["tokenizer"]
157
+ return state
158
+
159
+ def __setstate__(self, state):
160
+ self.__dict__ = state
161
+ enc = tiktoken.Encoding(
162
+ name="cl100k_im",
163
+ pat_str=self.pat_str,
164
+ mergeable_ranks=self.mergeable_ranks,
165
+ special_tokens=self.special_tokens,
166
+ )
167
+ self.tokenizer = enc
168
+
169
+ def __len__(self):
170
+ return self.tokenizer.n_vocab
171
+
172
+ @classmethod
173
+ def from_pretrained(
174
+ cls,
175
+ pretrained_model_name_or_path: Union[str, os.PathLike],
176
+ *init_inputs,
177
+ **kwargs,
178
+ ):
179
+ cls_kwargs = kwargs
180
+ # First try to load from the tokenization config if it exists
181
+ tokenization_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
182
+ if tokenization_config:
183
+ cls_kwargs.update(
184
+ dict(
185
+ model_max_length=tokenization_config["model_max_length"],
186
+ chat_template=tokenization_config.get("chat_template", None)
187
+ )
188
+ )
189
+ else:
190
+ config = AutoConfig.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)
191
+ cls_kwargs["model_max_length"] = config.max_position_embeddings
192
+ return cls(**cls_kwargs)
193
+
194
+ def get_vocab(self) -> Dict[Union[str, bytes], int]:
195
+ return {**self.mergeable_ranks, **self.special_tokens}
196
+
197
+ def convert_tokens_to_ids(
198
+ self,
199
+ tokens: Union[bytes, str, List[Union[bytes, str]]]
200
+ ) -> Union[int, List[int]]:
201
+ ids = []
202
+ if isinstance(tokens, (str, bytes)):
203
+ if tokens in self.special_tokens:
204
+ return self.special_tokens[tokens]
205
+ else:
206
+ return self.mergeable_ranks.get(tokens)
207
+ ids: List[int] = []
208
+ for token in tokens:
209
+ ids.append(self.convert_tokens_to_ids(token))
210
+ return ids
211
+
212
+ def _add_tokens(
213
+ self,
214
+ new_tokens: Union[List[str], List[AddedToken]],
215
+ special_tokens: bool = False,
216
+ ) -> int:
217
+ if not special_tokens and new_tokens:
218
+ raise ValueError("Only special tokens can be added to this tokenizer")
219
+ for token in new_tokens:
220
+ surface_form = token.content if isinstance(token, AddedToken) else token
221
+ if surface_form not in self.special_tokens:
222
+ raise ValueError(
223
+ "For now, we do not support unknown special tokens\n"
224
+ "In the future, if there is a need for this, we can add special tokens to the tokenizer\n"
225
+ "starting from rank 100261 - 100263 and then 100266 - 100275.\n"
226
+ "And finally, we can re-construct the enc object back\n"
227
+ )
228
+ return 0
229
+
230
+ def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
231
+ file_path = os.path.join(save_directory, "cl100k_base.tiktoken")
232
+ with open(file_path, "w") as f:
233
+ for token, rank in self.mergeable_ranks.items():
234
+ line = base64.b64encode(token).decode("utf-8") + " " + str(rank) + "\n"
235
+ f.write(line)
236
+ return (file_path,)
237
+
238
+ def tokenize(
239
+ self,
240
+ text: str,
241
+ allowed_special: Union[Set, str] = "all",
242
+ disallowed_special: Union[Collection, str] = (),
243
+ **kwargs
244
+ ) -> List[Union[bytes, str]]:
245
+ tokens: List[Union[bytes, str]] = []
246
+ for token_id in self.tokenizer.encode(
247
+ text, allowed_special=allowed_special, disallowed_special=disallowed_special
248
+ ):
249
+ tokens.append(self.decoder[token_id])
250
+ return tokens
251
+
252
+ def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
253
+ """
254
+ Converts a sequence of tokens in a single string.
255
+ """
256
+ text = ""
257
+ temp = b""
258
+ for t in tokens:
259
+ if isinstance(t, str):
260
+ if temp:
261
+ text += temp.decode("utf-8", errors=self.errors)
262
+ temp = b""
263
+ text += t
264
+ elif isinstance(t, bytes):
265
+ temp += t
266
+ else:
267
+ raise TypeError("token should only be of type types or str")
268
+ if temp:
269
+ text += temp.decode("utf-8", errors=self.errors)
270
+ return text
271
+
272
+ @property
273
+ def vocab_size(self):
274
+ return self.tokenizer.n_vocab
275
+
276
+ @property
277
+ def eos_token_id(self) -> int:
278
+ return self.eod_id
279
+
280
+ def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
281
+ """Converts an id to a token, special tokens included"""
282
+ if index in self.decoder:
283
+ return self.decoder[index]
284
+ raise ValueError("unknown ids")
285
+
286
+ def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
287
+ """Converts a token to an id using the vocab, special tokens included"""
288
+ if token in self.special_tokens:
289
+ return self.special_tokens[token]
290
+ if token in self.mergeable_ranks:
291
+ return self.mergeable_ranks[token]
292
+ raise ValueError("unknown token")
293
+
294
+ def _tokenize(self, text: str, **kwargs):
295
+ """
296
+ Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
297
+ vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
298
+ Do NOT take care of added tokens.
299
+ """
300
+ raise NotImplementedError
301
+
302
+ def _decode(
303
+ self,
304
+ token_ids: Union[int, List[int]],
305
+ skip_special_tokens: bool = False,
306
+ errors: str = None,
307
+ **kwargs,
308
+ ) -> str:
309
+ if isinstance(token_ids, int):
310
+ token_ids = [token_ids]
311
+ if skip_special_tokens:
312
+ token_ids = [i for i in token_ids if i < self.eod_id]
313
+ return self.tokenizer.decode(token_ids, errors=errors or self.errors)
314
+
315
+
cuda-fp16/tokenizer_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": null,
3
+ "_from_auto": true,
4
+ "added_tokens_decoder": {},
5
+ "auto_map": {
6
+ "AutoTokenizer": [
7
+ "tokenization_phi3_small.Phi3SmallTokenizer",
8
+ null
9
+ ]
10
+ },
11
+ "bos_token": "<|endoftext|>",
12
+ "chat_template": "{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + '\n' + message['content'] + '<|end|>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>\n' }}{% else %}{{ eos_token }}{% endif %}",
13
+ "clean_up_tokenization_spaces": true,
14
+ "eos_token": "<|endoftext|>",
15
+ "model_max_length": 8192,
16
+ "token": true,
17
+ "tokenizer_class": "Phi3SmallTokenizer",
18
+ "trust_remote_code": true
19
+ }
cuda-int4-rtn-block-32/cl100k_base.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
cuda-int4-rtn-block-32/config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Phi-3-small-8k-instruct",
3
+ "architectures": [
4
+ "Phi3SmallForCausalLM"
5
+ ],
6
+ "attention_dropout_prob": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_phi3_small.Phi3SmallConfig",
9
+ "AutoModelForCausalLM": "modeling_phi3_small.Phi3SmallForCausalLM",
10
+ "AutoTokenizer": "tokenization_phi3_small.Phi3SmallTokenizer"
11
+ },
12
+ "blocksparse_block_size": 64,
13
+ "blocksparse_homo_head_pattern": false,
14
+ "blocksparse_num_local_blocks": 16,
15
+ "blocksparse_triton_kernel_block_size": 64,
16
+ "blocksparse_vert_stride": 8,
17
+ "bos_token_id": 100257,
18
+ "dense_attention_every_n_layers": 2,
19
+ "embedding_dropout_prob": 0.1,
20
+ "eos_token_id": 100257,
21
+ "ff_dim_multiplier": null,
22
+ "ff_intermediate_size": 14336,
23
+ "ffn_dropout_prob": 0.1,
24
+ "gegelu_limit": 20.0,
25
+ "gegelu_pad_to_256": true,
26
+ "hidden_act": "gegelu",
27
+ "hidden_size": 4096,
28
+ "initializer_range": 0.02,
29
+ "layer_norm_epsilon": 1e-05,
30
+ "max_position_embeddings": 8192,
31
+ "model_type": "phi3small",
32
+ "mup_attn_multiplier": 1.0,
33
+ "mup_embedding_multiplier": 10.0,
34
+ "mup_use_scaling": true,
35
+ "mup_width_multiplier": 8.0,
36
+ "num_attention_heads": 32,
37
+ "num_hidden_layers": 32,
38
+ "num_key_value_heads": 8,
39
+ "pad_sequence_to_multiple_of_64": true,
40
+ "reorder_and_upcast_attn": false,
41
+ "rope_embedding_base": 1000000,
42
+ "rope_position_scale": 1.0,
43
+ "torch_dtype": "bfloat16",
44
+ "transformers_version": "4.38.1",
45
+ "use_cache": true,
46
+ "vocab_size": 100352
47
+ }
cuda-int4-rtn-block-32/configuration_phi3_small.py ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ from typing import Any, Dict, List, Optional, Union
17
+
18
+ from transformers.configuration_utils import PretrainedConfig
19
+ from transformers.utils import logging
20
+
21
+ from functools import cached_property
22
+
23
+ """ Phi3Small model configuration """
24
+ logger = logging.get_logger(__name__)
25
+
26
+
27
+ def next_mult(x, y):
28
+ return (x + y - 1) // y * y
29
+
30
+ class Phi3SmallConfig(PretrainedConfig):
31
+ """
32
+ This is the configuration class to store the configuration of a `Phi3Small` model. It is used to
33
+ instantiate a Phi-3-small model according to the specified arguments, defining the model architecture.
34
+ Instantiating a configuration with the defaults will yield a similar configuration to that of the Phi-3-small
35
+ [phi3](https://arxiv.org/pdf/2404.14219) architecture.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 100352):
43
+ Vocabulary size of the Phi3Small model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling `Phi3Small`.
45
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
46
+ The maximum sequence length that this model might safely be used with.
47
+ rope_embedding_base (`float`, *optional*, defaults to 10^6):
48
+ The base value for the RoPE (Relative Position Encoding) embedding.
49
+ rope_position_scale (`float`, *optional*, defaults to 1.0):
50
+ The scale factor for the RoPE position encoding.
51
+ rope_scaling (`Optional[Dict[str, Union[float, List[float], int]]]`, *optional*, defaults to None):
52
+ The scaling configuration used for LongRoPE.
53
+ hidden_size (`int`, *optional*, defaults to 4096):
54
+ The size of the hidden layers in the model.
55
+ num_hidden_layers (`int`, *optional*, defaults to 32):
56
+ The number of layers in the model.
57
+ num_attention_heads (`int`, *optional*, defaults to 32):
58
+ The number of query heads in the model.
59
+ num_key_value_heads (`int`, *optional*, defaults to 8):
60
+ The number of key-value heads in the model.
61
+ hidden_act (`str`, *optional*, defaults to "gegelu"):
62
+ The activation function used in the model.
63
+ gegelu_limit (`float`, *optional*, defaults to 20.0):
64
+ The limit value for the GELU activation function (for numerical stability).
65
+ gegelu_pad_to_256 (`bool`, *optional*, defaults to True):
66
+ Whether to pad the intermediate size to a multiple of 256 (for faster matmul ops).
67
+ ff_dim_multiplier (`Optional[int]`, *optional*, defaults to None):
68
+ The dimension multiplier for the feed-forward layers.
69
+ ff_intermediate_size (`Optional[int]`, *optional*, defaults to 14336):
70
+ The intermediate size for the feed-forward layers.
71
+ One of `ff_dim_multiplier` or `ff_intermediate_size` must be specified.
72
+ blocksparse_homo_head_pattern (`bool`, *optional*, defaults to False):
73
+ Whether to use a homogeneous head pattern for block-sparse attention.
74
+ blocksparse_block_size (`int`, *optional*, defaults to 64):
75
+ The block size for block-sparse attention.
76
+ blocksparse_num_local_blocks (`int`, *optional*, defaults to 16):
77
+ The number of local blocks for block-sparse attention.
78
+ The local window used in blocksparse equals `blocksparse_num_local_blocks * blocksparse_block_size`
79
+ blocksparse_vert_stride (`int`, *optional*, defaults to 8):
80
+ The vertical stride for block-sparse attention.
81
+ blocksparse_triton_kernel_block_size (`int`, *optional*, defaults to 64):
82
+ The kernel block size for block-sparse attention.
83
+ dense_attention_every_n_layers (`Optional[int]`, *optional*, defaults to 2):
84
+ The frequency of all dense attention layers in the model
85
+ embedding_dropout_prob (`float`, *optional*, defaults to 0.1):
86
+ The dropout probability for the embedding layer.
87
+ attention_dropout_prob (`float`, *optional*, defaults to 0.0):
88
+ The dropout probability for the attention layers.
89
+ ffn_dropout_prob (`float`, *optional*, defaults to 0.1):
90
+ The dropout probability for the feed-forward layers.
91
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
92
+ The epsilon value for layer normalization.
93
+ initializer_range (`float`, *optional*, defaults to 0.02):
94
+ The range for weight initialization.
95
+ mup_use_scaling (`bool`, *optional*, defaults to True):
96
+ Whether to use scaling for MuP parameters (see: https://arxiv.org/abs/2203.03466).
97
+ mup_width_multiplier (`bool`, *optional*, defaults to 8.0):
98
+ The width multiplier for MuP.
99
+ mup_embedding_multiplier (`bool`, *optional*, defaults to 10.0):
100
+ The embedding multiplier for MuP.
101
+ mup_attn_multiplier (`bool`, *optional*, defaults to 1.0):
102
+ The attention multiplier for MuP.
103
+ use_cache (`bool`, *optional*, defaults to True):
104
+ Whether to use cache for the model.
105
+ bos_token_id (`int`, *optional*, defaults to 100257):
106
+ The token ID for the beginning of sentence.
107
+ eos_token_id (`int`, *optional*, defaults to 100257):
108
+ The token ID for the end of sentence.
109
+ reorder_and_upcast_attn (`bool`, *optional*, defaults to False):
110
+ Whether to reorder and upcast attention.
111
+ pad_sequence_to_multiple_of_64 (`bool`, *optional*, defaults to True):
112
+ Whether to pad the sequence length to a multiple of 64.
113
+ **kwargs:
114
+ Additional keyword arguments.
115
+
116
+ Example:
117
+
118
+ ```python
119
+ >>> from transformers import Phi3SmallConfig, Phi3SmallModel
120
+
121
+ >>> # Initializing a Phi3Small configuration
122
+ >>> configuration = Phi3SmallConfig()
123
+
124
+ >>> # Initializing a model (with random weights) from the configuration
125
+ >>> model = Phi3SmallModel(configuration)
126
+
127
+ >>> # Accessing the model configuration
128
+ >>> configuration = model.config
129
+ ```
130
+ """
131
+
132
+ model_type = "phi3small"
133
+ keys_to_ignore_at_inference = ["past_key_values"]
134
+
135
+
136
+ def __init__(
137
+ self,
138
+ # General information about the model
139
+ vocab_size: int =100352,
140
+ max_position_embeddings: int = 8192,
141
+ # RoPE Related Parameters
142
+ rope_embedding_base: float = 10**6,
143
+ rope_position_scale: float = 1.0,
144
+ rope_scaling: Optional[Dict[str, Union[float, List[float], int]]] = None,
145
+ # General Model Parameters
146
+ hidden_size: int = 4096,
147
+ num_hidden_layers: int = 32,
148
+ # KV Shared Attention Configurations
149
+ num_attention_heads: int = 32,
150
+ num_key_value_heads: int = 8,
151
+ # GEGELU Related Parameters
152
+ hidden_act: str = "gegelu",
153
+ gegelu_limit: float = 20.0,
154
+ gegelu_pad_to_256: bool = True,
155
+ ff_dim_multiplier: Optional[int] = None,
156
+ ff_intermediate_size: Optional[int] = 14336,
157
+ # Block Sparse Attention Parameters
158
+ blocksparse_homo_head_pattern: bool = False,
159
+ blocksparse_block_size: int = 64,
160
+ blocksparse_num_local_blocks: int = 16,
161
+ blocksparse_vert_stride: int = 8,
162
+ blocksparse_triton_kernel_block_size: int = 64,
163
+ # Frequency of block-sparsity
164
+ dense_attention_every_n_layers: Optional[int] = 2,
165
+ # Reegularization parameters
166
+ embedding_dropout_prob: float =0.1,
167
+ attention_dropout_prob: float = 0.0,
168
+ ffn_dropout_prob: float = 0.1,
169
+ layer_norm_epsilon=1e-5,
170
+ initializer_range=0.02,
171
+ # MuP parameters
172
+ mup_use_scaling: bool = True,
173
+ mup_width_multiplier: bool = 8.0,
174
+ mup_embedding_multiplier: bool = 10.0,
175
+ mup_attn_multiplier: bool =1.0,
176
+ use_cache=True,
177
+ # The model does not have a bos token id
178
+ # However, in order for some of the downstream libraries to not break
179
+ # we set this to be the same as the eos_token_id
180
+ bos_token_id: int = 100257,
181
+ eos_token_id: int = 100257,
182
+ reorder_and_upcast_attn=False,
183
+ # Configuration to pad sequence length to a multiple of 64
184
+ pad_sequence_to_multiple_of_64: bool = True,
185
+ **kwargs,
186
+ ):
187
+ self.vocab_size = vocab_size
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.rope_embedding_base = rope_embedding_base
190
+ self.rope_position_scale = rope_position_scale
191
+ self.rope_scaling = rope_scaling
192
+ self.hidden_size = hidden_size
193
+ # QK Shared Attention
194
+ self.num_hidden_layers = num_hidden_layers
195
+ self.num_attention_heads = num_attention_heads
196
+ self.num_key_value_heads = num_key_value_heads
197
+ # Block Sparse Attention Pattern
198
+ self.blocksparse_homo_head_pattern = blocksparse_homo_head_pattern
199
+ self.blocksparse_block_size = blocksparse_block_size
200
+ self.blocksparse_num_local_blocks = blocksparse_num_local_blocks
201
+ self.blocksparse_vert_stride = blocksparse_vert_stride
202
+ self.blocksparse_triton_kernel_block_size = blocksparse_triton_kernel_block_size
203
+ # Frequency of block sparsity
204
+ self.dense_attention_every_n_layers = dense_attention_every_n_layers
205
+ # Activation function
206
+ self.hidden_act = hidden_act
207
+ self.gegelu_limit = gegelu_limit
208
+ self.gegelu_pad_to_256 = gegelu_pad_to_256
209
+ self.ff_dim_multiplier = ff_dim_multiplier
210
+ self.ff_intermediate_size = ff_intermediate_size
211
+ if self.ff_dim_multiplier is None and self.ff_intermediate_size is None:
212
+ raise ValueError(f"Cannot have both {self.ff_dim_multiplier} and {self.ff_intermediate_size} as None")
213
+ if self.ff_dim_multiplier is not None and self.ff_intermediate_size is not None:
214
+ raise ValueError(f"Cannot specify both {self.ff_dim_multiplier} and {self.ff_intermediate_size}.")
215
+ # General regularization
216
+ self.embedding_dropout_prob = embedding_dropout_prob
217
+ self.attention_dropout_prob = attention_dropout_prob
218
+ self.ffn_dropout_prob = ffn_dropout_prob
219
+ self.layer_norm_epsilon = layer_norm_epsilon
220
+ self.initializer_range = initializer_range
221
+ # MuP parameters
222
+ self.mup_use_scaling = mup_use_scaling
223
+ self.mup_width_multiplier = mup_width_multiplier
224
+ self.mup_embedding_multiplier = mup_embedding_multiplier
225
+ self.mup_attn_multiplier = mup_attn_multiplier
226
+ self.use_cache = use_cache
227
+
228
+ self.reorder_and_upcast_attn = reorder_and_upcast_attn
229
+ self.pad_sequence_to_multiple_of_64 = pad_sequence_to_multiple_of_64
230
+
231
+ self.bos_token_id = bos_token_id
232
+ self.eos_token_id = eos_token_id
233
+
234
+ super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
235
+
236
+ @cached_property
237
+ def dummy_token_indices(self) -> List[int]:
238
+ # Importing here to avoid circular imports
239
+ from .tokenization_phi3_small import Phi3SmallTokenizer
240
+ tokenizer = Phi3SmallTokenizer()
241
+ return tokenizer.dummy_token_indices
242
+
243
+ @property
244
+ def intermediate_size(self) -> int:
245
+ if self.ff_intermediate_size is not None:
246
+ return self.ff_intermediate_size
247
+ intermediate_size = (self.ff_dim_multiplier) * (self.hidden_size // 3) * 2
248
+ if self.gegelu_pad_to_256:
249
+ intermediate_size = next_mult(intermediate_size, 256)
250
+ return intermediate_size
cuda-int4-rtn-block-32/genai_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": {
3
+ "bos_token_id": 100257,
4
+ "context_length": 8192,
5
+ "decoder": {
6
+ "session_options": {
7
+ "log_id": "onnxruntime-genai",
8
+ "provider_options": [
9
+ {
10
+ "cuda": {
11
+ "enable_cuda_graph": "0"
12
+ }
13
+ }
14
+ ]
15
+ },
16
+ "filename": "phi3-small-8k-instruct-cuda-int4-rtn-block-32.onnx",
17
+ "head_size": 128,
18
+ "hidden_size": 4096,
19
+ "inputs": {
20
+ "input_ids": "input_ids",
21
+ "attention_mask": "attention_mask",
22
+ "past_key_names": "past_key_values.%d.key",
23
+ "past_value_names": "past_key_values.%d.value"
24
+ },
25
+ "outputs": {
26
+ "logits": "logits",
27
+ "present_key_names": "present.%d.key",
28
+ "present_value_names": "present.%d.value"
29
+ },
30
+ "num_attention_heads": 32,
31
+ "num_hidden_layers": 32,
32
+ "num_key_value_heads": 8
33
+ },
34
+ "eos_token_id": [
35
+ 100257,
36
+ 100266
37
+ ],
38
+ "pad_token_id": 100257,
39
+ "type": "phi3small",
40
+ "vocab_size": 100352
41
+ },
42
+ "search": {
43
+ "diversity_penalty": 0.0,
44
+ "do_sample": false,
45
+ "early_stopping": true,
46
+ "length_penalty": 1.0,
47
+ "max_length": 8192,
48
+ "min_length": 0,
49
+ "no_repeat_ngram_size": 0,
50
+ "num_beams": 1,
51
+ "num_return_sequences": 1,
52
+ "past_present_share_buffer": true,
53
+ "repetition_penalty": 1.0,
54
+ "temperature": 1.0,
55
+ "top_k": 1,
56
+ "top_p": 1.0
57
+ }
58
+ }
cuda-int4-rtn-block-32/phi3-small-8k-instruct-cuda-int4-rtn-block-32.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d289b9a7537ab9fa1d8ba49238ed7ea9c08150e703dd6984b9303676d35e5b34
3
+ size 361044
cuda-int4-rtn-block-32/phi3-small-8k-instruct-cuda-int4-rtn-block-32.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2ea2d0344a5af8c340d002b2c65710ef537f831079afe0e8f2644285bfa2dfb
3
+ size 4985548928
cuda-int4-rtn-block-32/special_tokens_map.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>"
4
+ }
cuda-int4-rtn-block-32/tokenization_phi3_small.py ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adapted from https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/tokenization_qwen.py
2
+ import os
3
+ from typing import Collection, List, Optional, Dict, Set, Tuple, Union
4
+
5
+ from functools import cached_property
6
+
7
+ import base64
8
+
9
+ from transformers import PreTrainedTokenizer, AddedToken, AutoConfig
10
+ from transformers.models.auto.tokenization_auto import get_tokenizer_config
11
+ import tiktoken
12
+
13
+
14
+ """
15
+ This tokenizer is almost identical to tiktoken.get_encoding("cl100k_base")
16
+ with a few additional special tokens to support the ChatML format.
17
+
18
+ TODO(bapatra): Right now, I do not save the special tokens to the vocab file.
19
+ Maybe in the future, that would be useful? Can add that support later.
20
+
21
+ """
22
+
23
+ def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
24
+ with open(tiktoken_bpe_file, "rb") as f:
25
+ contents = f.read()
26
+ return {
27
+ base64.b64decode(token): int(rank)
28
+ for token, rank in (line.split() for line in contents.splitlines() if line)
29
+ }
30
+
31
+ # On the megatron codebase, we pad vocabularies to ensure matrix multiplication is fast.
32
+ # this in turn causes some indices to be empty. We account for these empty indices by adding
33
+ # dummy tokens to the tokenizer.
34
+
35
+ EFFECTIVE_PADDED_VOCAB_SIZE = 100352
36
+ ACTUAL_VOCAB_SIZE = 100276
37
+
38
+
39
+ DUMMY_TOKENS = {
40
+ f"<|dummy_id_{11 + offset}|>": 100276 + offset
41
+ for offset in range(1, EFFECTIVE_PADDED_VOCAB_SIZE - ACTUAL_VOCAB_SIZE)
42
+ }
43
+
44
+ SPECIAL_TOKENS = {
45
+ # tiktoken.get_encoding("cl100k_base")._special_tokens
46
+ '<|endoftext|>': 100257,
47
+ '<|fim_prefix|>': 100258,
48
+ '<|fim_middle|>': 100259,
49
+ '<|fim_suffix|>': 100260,
50
+ # Special tokens for post-training
51
+ "<|system|>": 100261,
52
+ "<|user|>": 100262,
53
+ "<|assistant|>": 100263,
54
+ # Dummy unused tokens
55
+ "<|dummy_id_0|>": 100264,
56
+ "<|dummy_id_1|>": 100265,
57
+ # Special tokens for post-training continued
58
+ "<|end|>": 100266,
59
+ # Some dummy tokens, so that tokenization is contiguous and does not cause issues
60
+ # Note that the 100256th token of tiktoken.get_encoding("cl100k_base") does not
61
+ # actually map to anything. So we use a dummy token here.
62
+ "<|dummy_id_2|>": 100256,
63
+ # Likewise, tokens from 100267 to 100275 are also unused
64
+ "<|dummy_id_3|>": 100267,
65
+ "<|dummy_id_4|>": 100268,
66
+ "<|dummy_id_5|>": 100269,
67
+ "<|dummy_id_6|>": 100270,
68
+ "<|dummy_id_7|>": 100271,
69
+ "<|dummy_id_8|>": 100272,
70
+ "<|dummy_id_9|>": 100273,
71
+ "<|dummy_id_10|>": 100274,
72
+ "<|dummy_id_11|>": 100275,
73
+ # The final end of prompt token
74
+ # (unused, but present as a part of tiktoken.get_encoding("cl100k_base")._special_tokens)
75
+ '<|endofprompt|>': 100276,
76
+ # Dummy tokens to account for padding of the tokenizer
77
+ # We pad to ensure tensor cores are used for vocab multiplication
78
+ **DUMMY_TOKENS
79
+ }
80
+
81
+ class Phi3SmallTokenizer(PreTrainedTokenizer):
82
+ vocab_files_names = {
83
+ "vocab_file": "cl100k_base.tiktoken"
84
+ }
85
+
86
+ model_input_names: List[str] = ["input_ids", "attention_mask"]
87
+ padding_side = "left"
88
+
89
+ def __init__(
90
+ self,
91
+ vocab_file: Optional[str] = None,
92
+ errors: str = "replace",
93
+ **kwargs
94
+ ) -> None:
95
+ # PreTrainedTokenizer's init calls _add_tokens, which in turn checks
96
+ # if the token is present in `self.special_tokens``. Hence instantiating it here.
97
+ # The way Qwen gets around this is by checking against SPECIAL_TOKENS
98
+ # But I think it's better to check against the objects own `special_tokens`
99
+ # in case we eventually want to allow the tokenizer to have special tokens.
100
+ self.special_tokens = SPECIAL_TOKENS
101
+
102
+ super().__init__(**kwargs)
103
+ self.errors = errors
104
+
105
+ base = tiktoken.get_encoding("cl100k_base")
106
+ if vocab_file is None:
107
+ self.mergeable_ranks: Dict[bytes, int] = base._mergeable_ranks
108
+ else:
109
+ self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)
110
+
111
+ self.pat_str = base._pat_str
112
+
113
+ enc = tiktoken.Encoding(
114
+ name="phi3small",
115
+ pat_str=self.pat_str,
116
+ mergeable_ranks=self.mergeable_ranks,
117
+ special_tokens=self.special_tokens,
118
+ )
119
+ self.tokenizer = enc
120
+
121
+ self.decoder: Dict[int, bytes] = {
122
+ v: k for k, v in self.mergeable_ranks.items()
123
+ }
124
+ self.decoder.update({v: k for k, v in self.special_tokens.items()})
125
+
126
+ self.eod_id = self.tokenizer.eot_token
127
+ self._eos_token = self._convert_id_to_token(self.eod_id)
128
+
129
+ # Setting the bos_token to be the same as the eos_token
130
+ # Note that this is **not** the correct thing to do, and is done
131
+ # just so that some of the downstream libraries do not break.
132
+ self._bos_token = self._eos_token
133
+
134
+ # Assign the special tokens to class variables
135
+ self.system_id = self.special_tokens["<|system|>"]
136
+ self.user_id = self.special_tokens["<|user|>"]
137
+ self.assistant_id = self.special_tokens["<|assistant|>"]
138
+ self.end_id = self.special_tokens["<|end|>"]
139
+
140
+ @cached_property
141
+ def dummy_token_indices(self) -> List[int]:
142
+ # There are some additional special tokens in the cl100k_base tokenizer
143
+ # that we do not use. Hence, we also consider them to be dummy tokens.
144
+ additional_tokens = [
145
+ "<|fim_prefix|>",
146
+ "<|fim_middle|>",
147
+ "<|fim_suffix|>",
148
+ "<|endofprompt|>"
149
+ ]
150
+ dummy_token_indices = [index for token, index in self.special_tokens.items() if "dummy_id" in token]
151
+ dummy_token_indices.extend([self.special_tokens[token] for token in additional_tokens])
152
+ return sorted(dummy_token_indices)
153
+
154
+ def __getstate__(self):
155
+ state = self.__dict__.copy()
156
+ del state["tokenizer"]
157
+ return state
158
+
159
+ def __setstate__(self, state):
160
+ self.__dict__ = state
161
+ enc = tiktoken.Encoding(
162
+ name="cl100k_im",
163
+ pat_str=self.pat_str,
164
+ mergeable_ranks=self.mergeable_ranks,
165
+ special_tokens=self.special_tokens,
166
+ )
167
+ self.tokenizer = enc
168
+
169
+ def __len__(self):
170
+ return self.tokenizer.n_vocab
171
+
172
+ @classmethod
173
+ def from_pretrained(
174
+ cls,
175
+ pretrained_model_name_or_path: Union[str, os.PathLike],
176
+ *init_inputs,
177
+ **kwargs,
178
+ ):
179
+ cls_kwargs = kwargs
180
+ # First try to load from the tokenization config if it exists
181
+ tokenization_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
182
+ if tokenization_config:
183
+ cls_kwargs.update(
184
+ dict(
185
+ model_max_length=tokenization_config["model_max_length"],
186
+ chat_template=tokenization_config.get("chat_template", None)
187
+ )
188
+ )
189
+ else:
190
+ config = AutoConfig.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)
191
+ cls_kwargs["model_max_length"] = config.max_position_embeddings
192
+ return cls(**cls_kwargs)
193
+
194
+ def get_vocab(self) -> Dict[Union[str, bytes], int]:
195
+ return {**self.mergeable_ranks, **self.special_tokens}
196
+
197
+ def convert_tokens_to_ids(
198
+ self,
199
+ tokens: Union[bytes, str, List[Union[bytes, str]]]
200
+ ) -> Union[int, List[int]]:
201
+ ids = []
202
+ if isinstance(tokens, (str, bytes)):
203
+ if tokens in self.special_tokens:
204
+ return self.special_tokens[tokens]
205
+ else:
206
+ return self.mergeable_ranks.get(tokens)
207
+ ids: List[int] = []
208
+ for token in tokens:
209
+ ids.append(self.convert_tokens_to_ids(token))
210
+ return ids
211
+
212
+ def _add_tokens(
213
+ self,
214
+ new_tokens: Union[List[str], List[AddedToken]],
215
+ special_tokens: bool = False,
216
+ ) -> int:
217
+ if not special_tokens and new_tokens:
218
+ raise ValueError("Only special tokens can be added to this tokenizer")
219
+ for token in new_tokens:
220
+ surface_form = token.content if isinstance(token, AddedToken) else token
221
+ if surface_form not in self.special_tokens:
222
+ raise ValueError(
223
+ "For now, we do not support unknown special tokens\n"
224
+ "In the future, if there is a need for this, we can add special tokens to the tokenizer\n"
225
+ "starting from rank 100261 - 100263 and then 100266 - 100275.\n"
226
+ "And finally, we can re-construct the enc object back\n"
227
+ )
228
+ return 0
229
+
230
+ def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
231
+ file_path = os.path.join(save_directory, "cl100k_base.tiktoken")
232
+ with open(file_path, "w") as f:
233
+ for token, rank in self.mergeable_ranks.items():
234
+ line = base64.b64encode(token).decode("utf-8") + " " + str(rank) + "\n"
235
+ f.write(line)
236
+ return (file_path,)
237
+
238
+ def tokenize(
239
+ self,
240
+ text: str,
241
+ allowed_special: Union[Set, str] = "all",
242
+ disallowed_special: Union[Collection, str] = (),
243
+ **kwargs
244
+ ) -> List[Union[bytes, str]]:
245
+ tokens: List[Union[bytes, str]] = []
246
+ for token_id in self.tokenizer.encode(
247
+ text, allowed_special=allowed_special, disallowed_special=disallowed_special
248
+ ):
249
+ tokens.append(self.decoder[token_id])
250
+ return tokens
251
+
252
+ def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
253
+ """
254
+ Converts a sequence of tokens in a single string.
255
+ """
256
+ text = ""
257
+ temp = b""
258
+ for t in tokens:
259
+ if isinstance(t, str):
260
+ if temp:
261
+ text += temp.decode("utf-8", errors=self.errors)
262
+ temp = b""
263
+ text += t
264
+ elif isinstance(t, bytes):
265
+ temp += t
266
+ else:
267
+ raise TypeError("token should only be of type types or str")
268
+ if temp:
269
+ text += temp.decode("utf-8", errors=self.errors)
270
+ return text
271
+
272
+ @property
273
+ def vocab_size(self):
274
+ return self.tokenizer.n_vocab
275
+
276
+ @property
277
+ def eos_token_id(self) -> int:
278
+ return self.eod_id
279
+
280
+ def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
281
+ """Converts an id to a token, special tokens included"""
282
+ if index in self.decoder:
283
+ return self.decoder[index]
284
+ raise ValueError("unknown ids")
285
+
286
+ def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
287
+ """Converts a token to an id using the vocab, special tokens included"""
288
+ if token in self.special_tokens:
289
+ return self.special_tokens[token]
290
+ if token in self.mergeable_ranks:
291
+ return self.mergeable_ranks[token]
292
+ raise ValueError("unknown token")
293
+
294
+ def _tokenize(self, text: str, **kwargs):
295
+ """
296
+ Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
297
+ vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
298
+ Do NOT take care of added tokens.
299
+ """
300
+ raise NotImplementedError
301
+
302
+ def _decode(
303
+ self,
304
+ token_ids: Union[int, List[int]],
305
+ skip_special_tokens: bool = False,
306
+ errors: str = None,
307
+ **kwargs,
308
+ ) -> str:
309
+ if isinstance(token_ids, int):
310
+ token_ids = [token_ids]
311
+ if skip_special_tokens:
312
+ token_ids = [i for i in token_ids if i < self.eod_id]
313
+ return self.tokenizer.decode(token_ids, errors=errors or self.errors)
314
+
315
+
cuda-int4-rtn-block-32/tokenizer_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": null,
3
+ "_from_auto": true,
4
+ "added_tokens_decoder": {},
5
+ "auto_map": {
6
+ "AutoTokenizer": [
7
+ "tokenization_phi3_small.Phi3SmallTokenizer",
8
+ null
9
+ ]
10
+ },
11
+ "bos_token": "<|endoftext|>",
12
+ "chat_template": "{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + '\n' + message['content'] + '<|end|>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>\n' }}{% else %}{{ eos_token }}{% endif %}",
13
+ "clean_up_tokenization_spaces": true,
14
+ "eos_token": "<|endoftext|>",
15
+ "model_max_length": 8192,
16
+ "token": true,
17
+ "tokenizer_class": "Phi3SmallTokenizer",
18
+ "trust_remote_code": true
19
+ }