DachengZhang commited on
Commit
b94fca7
1 Parent(s): 5ebc2d0

initial commit

Browse files
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright (C) 2023 ORION STAR Robotics
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README.MD ADDED
@@ -0,0 +1 @@
 
 
1
+ obsolete
README.md ADDED
@@ -0,0 +1,728 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- markdownlint-disable first-line-h1 -->
2
+ <!-- markdownlint-disable html -->
3
+ <div align="center">
4
+ <img src="./assets/imgs/orion_start.PNG" alt="logo" width="30%" />
5
+ </div>
6
+
7
+ <div align="center">
8
+ <h1>
9
+ Orion-14B
10
+ </h1>
11
+ </div>
12
+
13
+ <div align="center">
14
+
15
+ <div align="center">
16
+ <b>🇨🇳中文</b> | <a href="#table-of-contents">🌐English</a>
17
+ </div>
18
+
19
+ <h4 align="center">
20
+ <p>
21
+ 🤗 <a href="https://huggingface.co/OrionStarAI" target="_blank">HuggingFace主页</a> | 🤖 <a href="https://modelscope.cn/organization/OrionStarAI" target="_blank">ModelScope主页</a><br>🎬 <a href="https://huggingface.co/spaces/OrionStarAI/Orion-14B-App-Demo" target="_blank">HuggingFace在线试用</a> | 🎫 <a href="https://modelscope.cn/studios/OrionStarAI/Orion-14B-App-Demo/summary" target="_blank">ModelScope在线试用</a><br>😺 <a href="https://github.com/OrionStarAI/Orion" target="_blank">GitHub</a><br>📖 <a href="https://github.com/OrionStarAI/Orion/blob/master/doc/Orion14B_v3.pdf" target="_blank">技术报告</a>
22
+ <p>
23
+ </h4>
24
+
25
+ </div>
26
+
27
+
28
+
29
+ # 目录
30
+
31
+ - [📖 模型介绍](#zh_model-introduction)
32
+ - [🔗 下载路径](#zh_model-download)
33
+ - [🔖 评估结果](#zh_model-benchmark)
34
+ - [📊 模型推理](#zh_model-inference) [<img src="./assets/imgs/vllm_1.png" alt="vllm" style="margin: 0;display: initial;" height="20" />](#vllm) [<img src="./assets/imgs/llama_cpp_1.png" alt="llamacpp" style="margin: 0;display: initial;" height="20" />](#llama-cpp)
35
+ - [📜 声明协议](#zh_declarations-license)
36
+ - [🥇 企业介绍](#zh_company-introduction)
37
+
38
+
39
+ <a name="zh_model-introduction"></a><br>
40
+ # 1. 模型介绍
41
+
42
+ - Orion-14B-Base是一个具有140亿参数的多语种大模型,该模型在一个包含2.5万亿token的多样化数据集上进行了训练,涵盖了中文、英语、日语、韩语等多种语言。在多语言环境下的一系列任务中展现出卓越的性能。在主流的公开基准评测中,Orion-14B系列模型表现优异,多项指标显著超越同等参数基本的其他模型。具体技术细节请参考[技术报告](https://github.com/OrionStarAI/Orion/blob/master/doc/Orion14B_v3.pdf)。
43
+
44
+ - Orion-14B系列大模型有以下几个特点:
45
+ - 基座20B参数级别大模型综合评测效果表现优异
46
+ - 多语言能力强,在日语、韩语测试集上显著领先
47
+ - 微调模型适应性强,在人类标注盲测中,表现突出
48
+ - 长上下文版本支持超长文本,在200k token长度上效果优异,最长可支持可达320k
49
+ - 量化版本模型大小缩小70%,推理速度提升30%,性能损失小于1%
50
+
51
+ <table style="border-collapse: collapse; width: 100%;">
52
+ <tr>
53
+ <td style="border: none; padding: 10px; box-sizing: border-box;">
54
+ <img src="./assets/imgs/opencompass_zh.png" alt="opencompass" style="width: 100%; height: auto;">
55
+ </td>
56
+ <td style="border: none; padding: 10px; box-sizing: border-box;">
57
+ <img src="./assets/imgs/model_cap_zh.png" alt="modelcap" style="width: 100%; height: auto;">
58
+ </td>
59
+ </tr>
60
+ </table>
61
+
62
+ - 具体而言,Orion-14B系列大语言模型包含:
63
+ - **Orion-14B-Base:** 基于2.5万亿tokens多样化数据集训练处的140亿参数量级的多语言基座模型。
64
+ - **Orion-14B-Chat:** 基于高质量语料库微调的对话类模型,旨在为大模型社区提供更好的用户交互体验。
65
+ - **Orion-14B-LongChat:** 在200k token长度上效果优异,最长可支持可达320k,在长文本评估集上性能比肩专有模型。
66
+ - **Orion-14B-Chat-RAG:** 在一个定制的检索增强生成数据集上进行微调的聊天模型,在检索增强生成任务中取得了卓越的性能。
67
+ - **Orion-14B-Chat-Plugin:** 专门针对插件和函数调用任务定制的聊天模型,非常适用于使用代理的相关场景,其中大语言模型充当插件和函数调用系统。
68
+ - **Orion-14B-Base-Int4:** 一个使用int4进行量化的基座模型。它将模型大小显著减小了70%,同时提高了推理速度30%,仅引入了1%的最小性能损失。
69
+ - **Orion-14B-Chat-Int4:** 一个使用int4进行量化的对话模型。
70
+
71
+
72
+ <a name="zh_model-download"></a><br>
73
+ # 2. 下载路径
74
+
75
+ 发布模型和下载链接见下表:
76
+
77
+ | 模型名称 | HuggingFace下载链接 | ModelScope下载链接 |
78
+ |---------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
79
+ | ⚾ 基座模型 | [Orion-14B-Base](https://huggingface.co/OrionStarAI/Orion-14B-Base) | [Orion-14B-Base](https://modelscope.cn/models/OrionStarAI/Orion-14B-Base/summary) |
80
+ | 😛 对话模型 | [Orion-14B-Chat](https://huggingface.co/OrionStarAI/Orion-14B-Chat) | [Orion-14B-Chat](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat/summary) |
81
+ | 📃 长上下文模型 | [Orion-14B-LongChat](https://huggingface.co/OrionStarAI/Orion-14B-LongChat) | [Orion-14B-LongChat](https://modelscope.cn/models/OrionStarAI/Orion-14B-LongChat/summary) |
82
+ | 🔎 检索增强模型 | [Orion-14B-Chat-RAG](https://huggingface.co/OrionStarAI/Orion-14B-Chat-RAG) | [Orion-14B-Chat-RAG](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat-RAG/summary) |
83
+ | 🔌 插件模型 | [Orion-14B-Chat-Plugin](https://huggingface.co/OrionStarAI/Orion-14B-Chat-Plugin) | [Orion-14B-Chat-Plugin](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat-Plugin/summary)|
84
+ | 💼 基座Int4量化模型 | [Orion-14B-Base-Int4](https://huggingface.co/OrionStarAI/Orion-14B-Base-Int4) | [Orion-14B-Base-Int4](https://modelscope.cn/models/OrionStarAI/Orion-14B-Base-Int4/summary) |
85
+ | 📦 对话Int4量化模型 | [Orion-14B-Chat-Int4](https://huggingface.co/OrionStarAI/Orion-14B-Chat-Int4) | [Orion-14B-Chat-Int4](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat-Int4/summary) |
86
+
87
+
88
+ <a name="zh_model-benchmark"></a><br>
89
+ # 3. 评估结果
90
+
91
+ ## 3.1. 基座模型Orion-14B-Base评估
92
+
93
+ ### 3.1.1. 专业知识与试题评估结果
94
+ | 模型名称 | C-Eval | CMMLU | MMLU | AGIEval | Gaokao | BBH |
95
+ |--------------------|----------|----------|----------|----------|----------|----------|
96
+ | LLaMA2-13B | 41.4 | 38.4 | 55.0 | 30.9 | 18.2 | 45.6 |
97
+ | Skywork-13B | 59.1 | 61.4 | 62.7 | 43.6 | 56.1 | 48.3 |
98
+ | Baichuan2-13B | 59.0 | 61.3 | 59.5 | 37.4 | 45.6 | 49.0 |
99
+ | QWEN-14B | 71.7 | 70.2 | 67.9 | 51.9 | **62.5** | 53.7 |
100
+ | InternLM-20B | 58.8 | 59.0 | 62.1 | 44.6 | 45.5 | 52.5 |
101
+ | **Orion-14B-Base** | **72.9** | **70.6** | **69.9** | **54.7** | 62.1 | **56.5** |
102
+
103
+ ### 3.1.2. 理解与通识评估结果
104
+ | 模型名称 |RACE-middle|RACE-high| HellaSwag| PIQA | Lambada | WSC |
105
+ |--------------------|----------|----------|----------|----------|----------|----------|
106
+ | LLaMA 2-13B | 63.0 | 58.9 | 77.5 | 79.8 | 76.5 | 66.3 |
107
+ | Skywork-13B | 87.6 | 84.1 | 73.7 | 78.3 | 71.8 | 66.3 |
108
+ | Baichuan 2-13B | 68.9 | 67.2 | 70.8 | 78.1 | 74.1 | 66.3 |
109
+ | QWEN-14B | 93.0 | 90.3 | **80.2** | 79.8 | 71.4 | 66.3 |
110
+ | InternLM-20B | 86.4 | 83.3 | 78.1 | **80.3** | 71.8 | 68.3 |
111
+ | **Orion-14B-Base** | **93.2** | **91.3** | 78.5 | 79.5 | **78.8** | **70.2** |
112
+
113
+ ### 3.1.3. OpenCompass评测集评估结果
114
+ | 模型名称 | Average | Examination | Language | Knowledge | Understanding | Reasoning |
115
+ |------------------|----------|----------|----------|----------|----------|----------|
116
+ | LLaMA 2-13B | 47.3 | 45.2 | 47.0 | 58.3 | 50.9 | 43.6 |
117
+ | Skywork-13B | 53.6 | 61.1 | 51.3 | 52.7 | 64.5 | 45.2 |
118
+ | Baichuan 2-13B | 49.4 | 51.8 | 47.5 | 48.9 | 58.1 | 44.2 |
119
+ | QWEN-14B | 62.4 | 71.3 | 52.67 | 56.1 | 68.8 | 60.1 |
120
+ | InternLM-20B | 59.4 | 62.5 | 55.0 | **60.1** | 67.3 | 54.9 |
121
+ |**Orion-14B-Base**| **64.3** | **71.4** | **55.0** | 60.0 | **71.9** | **61.6** |
122
+
123
+ ### 3.1.4. 日语测试集评估结果
124
+ | 模型名称 |**Average**| JCQA | JNLI | MARC | JSQD | JQK | XLS | XWN | MGSM |
125
+ |--------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
126
+ | PLaMo-13B | 52.3 | 56.7 | 42.8 | 95.8 | 70.6 | 71.0 | 8.70 | 70.5 | 2.40 |
127
+ | WebLab-10B | 50.7 | 66.6 | 53.7 | 82.1 | 62.9 | 56.2 | 10.0 | 72.0 | 2.40 |
128
+ | ELYZA-jp-7B | 48.8 | 71.7 | 25.3 | 86.6 | 70.8 | 64.1 | 2.50 | 62.1 | 7.20 |
129
+ | StableLM-jp-7B | 51.1 | 33.4 | 43.3 | **96.7** | 70.6 | 78.1 | 10.7 | 72.8 | 2.80 |
130
+ | LLaMA 2-13B | 46.3 | 75.0 | 47.6 | 38.8 | 76.1 | 67.7 | 18.1 | 63.2 | 10.4 |
131
+ | Baichuan 2-13B | 57.1 | 73.7 | 31.3 | 91.6 | 80.5 | 63.3 | 18.6 | 72.2 | 25.2 |
132
+ | QWEN-14B | 65.8 | 85.9 | 60.7 | 97.0 | 83.3 | 71.8 | 18.8 | 70.6 | 38.0 |
133
+ | Yi-34B | 67.1 | 83.8 | 61.2 | 95.2 | **86.1** | 78.5 | **27.2** | 69.2 | 35.2 |
134
+ | **Orion-14B-Base** | **69.1** | **88.2** | **75.8** | 94.1 | 75.7 | **85.1** | 17.3 | **78.8** | **38.0** |
135
+
136
+ ### 3.1.5. 韩语测试集n-shot评估结果
137
+ | 模型名称 | **Average**<br>n=0&nbsp;&nbsp;n=5 | HellaSwag<br>n=0&nbsp;&nbsp;n=5 | COPA<br> n=0&nbsp;&nbsp;n=5 | BooIQ<br>n=0&nbsp;&nbsp;n=5 | SentiNeg<br>n=0&nbsp;&nbsp;n=5|
138
+ |------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
139
+ | KoGPT | 53.0 &nbsp;&nbsp; 70.1 | 55.9 &nbsp;&nbsp; 58.3 | 73.5 &nbsp;&nbsp; 72.9 | 45.1 &nbsp;&nbsp; 59.8 | 37.5 &nbsp;&nbsp; 89.4 |
140
+ | Polyglot-ko-13B | 69.6 &nbsp;&nbsp; 73.7 |**59.5** &nbsp;&nbsp; **63.1**|**79.4** &nbsp;&nbsp; **81.1**| 48.2 &nbsp;&nbsp; 60.4 | 91.2 &nbsp;&nbsp; 90.2 |
141
+ | LLaMA 2-13B | 46.7 &nbsp;&nbsp; 63.7 | 41.3 &nbsp;&nbsp; 44.0 | 59.3 &nbsp;&nbsp; 63.8 | 34.9 &nbsp;&nbsp; 73.8 | 51.5 &nbsp;&nbsp; 73.4 |
142
+ | Baichuan 2-13B | 52.1 &nbsp;&nbsp; 58.7 | 39.2 &nbsp;&nbsp; 39.6 | 60.6 &nbsp;&nbsp; 60.6 | 58.4 &nbsp;&nbsp; 61.5 | 50.3 &nbsp;&nbsp; 72.9 |
143
+ | QWEN-14B | 53.8 &nbsp;&nbsp; 73.7 | 45.3 &nbsp;&nbsp; 46.8 | 64.9 &nbsp;&nbsp; 68.9 | 33.4 &nbsp;&nbsp; 83.5 | 71.5 &nbsp;&nbsp; 95.7 |
144
+ | Yi-34B | 54.2 &nbsp;&nbsp; 72.1 | 44.6 &nbsp;&nbsp; 44.7 | 58.0 &nbsp;&nbsp; 60.6 | 65.9 &nbsp;&nbsp; 90.2 | 48.3 &nbsp;&nbsp; 92.9 |
145
+ |**Orion-14B-Base**|**74.5** &nbsp;&nbsp; **79.6**| 47.0 &nbsp;&nbsp; 49.6 | 77.7 &nbsp;&nbsp; 79.4 |**81.6** &nbsp;&nbsp; **90.7**|**92.4** &nbsp;&nbsp; **98.7**|
146
+
147
+ ### 3.1.6. 多语言评估结果
148
+ | 模型名称 | Train Lang | Japanese | Korean | Chinese | English |
149
+ |--------------------|------------|----------|----------|----------|----------|
150
+ | PLaMo-13B | En,Jp | 52.3 | * | * | * |
151
+ | Weblab-10B | En,Jp | 50.7 | * | * | * |
152
+ | ELYZA-jp-7B | En,Jp | 48.8 | * | * | * |
153
+ | StableLM-jp-7B | En,Jp | 51.1 | * | * | * |
154
+ | KoGPT-6B | En,Ko | * | 70.1 | * | * |
155
+ | Polyglot-ko-13B | En,Ko | * | 70.7 | * | * |
156
+ | Baichuan2-13B | Multi | 57.1 | 58.7 | 50.8 | 57.1 |
157
+ | Qwen-14B | Multi | 65.8 | 73.7 | 64.5 | 65.4 |
158
+ | Llama2-13B | Multi | 46.3 | 63.7 | 41.4 | 55.3 |
159
+ | Yi-34B | Multi | 67.1 | 72.2 | 58.7 | **68.8** |
160
+ | **Orion-14B-Base** | Multi | **69.1** | **79.5** | **67.9** | 67.3 |
161
+
162
+ ## 3.2. 对话模型Orion-14B-Chat评估
163
+ ### 3.2.1. 对话模型MTBench主观评估
164
+ | 模型名称 | 第一轮 | 第二轮 | **平均** |
165
+ |----------------------|----------|----------|----------|
166
+ | Baichuan2-13B-Chat | 7.05 | 6.47 | 6.76 |
167
+ | Qwen-14B-Chat | 7.30 | 6.62 | 6.96 |
168
+ | Llama2-13B-Chat | 7.10 | 6.20 | 6.65 |
169
+ | InternLM-20B-Chat | 7.03 | 5.93 | 6.48 |
170
+ | **Orion-14B-Chat** | **7.68** | **7.07** | **7.37** |
171
+
172
+ \*这里评测使用vllm进行推理
173
+
174
+ ### 3.2.2. 对话模型AlignBench主观评估
175
+ | 模型名称 | 数学能力 | 逻辑推理 | 基本能力 | 中文理解 | 综合问答 | 写作能力 | 角色扮演 | 专业知识 | **平均** |
176
+ |--------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
177
+ | Baichuan2-13B-Chat | 3.76 | 4.07 | 6.22 | 6.05 | 7.11 | 6.97 | 6.75 | 6.43 | 5.25 |
178
+ | Qwen-14B-Chat | **4.91** | **4.71** | **6.90** | 6.36 | 6.74 | 6.64 | 6.59 | 6.56 | **5.72** |
179
+ | Llama2-13B-Chat | 3.05 | 3.79 | 5.43 | 4.40 | 6.76 | 6.63 | 6.99 | 5.65 | 4.70 |
180
+ | InternLM-20B-Chat | 3.39 | 3.92 | 5.96 | 5.50 | **7.18** | 6.19 | 6.49 | 6.22 | 4.96 |
181
+ | **Orion-14B-Chat** | 4.00 | 4.24 | 6.18 | **6.57** | 7.16 | **7.36** | **7.16** | **6.99** | 5.51 |
182
+
183
+ \*这里评测使用vllm进行推理
184
+
185
+ ## 3.3. 长上下文模型Orion-14B-LongChat评估
186
+ ### 3.3.1. 长上下文模型LongBench评估
187
+ | 模型名称 | NarrativeQA| MultiFieldQA-en| MultiFieldQA-zh | DuReader | QMSum | VCSUM | TREC | TriviaQA | LSHT | RepoBench-P |
188
+ |--------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
189
+ | GPT-3.5-Turbo-16k | **23.60** | **52.30** | **61.20** | 28.70 | 23.40 | **16.00** | 68.00 | **91.40** | 29.20 | 53.60 |
190
+ | LongChat-v1.5-7B-32k | 16.90 | 41.40 | 29.10 | 19.50 | 22.70 | 9.90 | 63.50 | 82.30 | 23.20 | 55.30 |
191
+ | Vicuna-v1.5-7B-16k | 19.40 | 38.50 | 43.00 | 19.30 | 22.80 | 15.10 | 71.50 | 86.20 | 28.80 | 43.50 |
192
+ | Yi-6B-200K | 14.11 | 36.74 | 22.68 | 14.01 | 20.44 | 8.08 | 72.00 | 86.61 | 38.00 | **63.29** |
193
+ | Orion-14B-LongChat | 19.47 | 48.11 | 55.84 | **37.02** | **24.87** | 15.44 | **77.00** | 89.12 | **45.50** | 54.31 |
194
+
195
+ ## 3.4. 检索增强模型Orion-14B-Chat-RAG评估
196
+ ### 3.4.1. 自建检索增强测试集评估结果
197
+ |模型名称|回复效果(关键字)|*回复效果(主观打分)|引用能力|兜底能力|*AutoQA|*抽取数据|
198
+ |---------------------|------|------|------|------|------|------|
199
+ | Baichuan2-13B-Chat | 85 | 76 | 1 | 0 | 69 | 51 |
200
+ | Qwen-14B-Chat | 79 | 77 | 75 | 47 | 68 | 72 |
201
+ | Qwen-72B-Chat(Int4) | 87 | 89 | 90 | 32 | 67 | 76 |
202
+ | GPT-4 | 91 | 94 | 96 | 95 | 75 | 86 |
203
+ | Orion-14B-Chat-RAG | 86 | 87 | 91 | 97 | 73 | 71 |
204
+ \* 表示人工评判结果
205
+
206
+ ## 3.5. 插件模型Orion-14B-Chat-Plugin评估
207
+ ### 3.5.1. 自建插件测试集评估结果
208
+ | 模型名称 | 全参数意图识别 | 缺参数意图识别 | 非插件调用识别 |
209
+ |-----------------------|--------|-----------|--------|
210
+ | Baichuan2-13B-Chat | 25 | 0 | 0 |
211
+ | Qwen-14B-Chat | 55 | 0 | 50 |
212
+ | GPT-4 | **95** | 52.38 | 70 |
213
+ | Orion-14B-Chat-Plugin | 92.5 | **60.32** | **90** |
214
+
215
+ ## 3.6. 量化模型Orion-14B-Base-Int4评估
216
+ ### 3.6.1. 量化前后整体对比
217
+ |模型名称|模型大小(GB)|推理速度(令牌数/秒)|C-Eval |CMMLU |MMLU |RACE | HellaSwag|
218
+ |-------------------------|------|-----|------|------|------|------|------|
219
+ | OrionStar-14B-Base | 28.0 | 135 | 72.8 | 70.6 | 70.0 | 93.3 | 78.5 |
220
+ | OrionStar-14B-Base-Int4 | 8.3 | 178 | 71.8 | 69.8 | 69.2 | 93.1 | 78.0 |
221
+
222
+
223
+ <a name="zh_model-inference"></a><br>
224
+ # 4. 模型推理
225
+
226
+ 推理所需的模型权重、源码、配置已发布在 Hugging Face,下载链接见本文档最开始的表格。我们在此示范多种推理方式。程序会自动从
227
+ Hugging Face 下载所需资源。
228
+
229
+ ## 4.1. Python 代码方式
230
+
231
+ ```python
232
+ import torch
233
+ from transformers import AutoModelForCausalLM, AutoTokenizer
234
+ from transformers.generation.utils import GenerationConfig
235
+
236
+ tokenizer = AutoTokenizer.from_pretrained("OrionStarAI/Orion-14B", use_fast=False, trust_remote_code=True)
237
+ model = AutoModelForCausalLM.from_pretrained("OrionStarAI/Orion-14B", device_map="auto",
238
+ torch_dtype=torch.bfloat16, trust_remote_code=True)
239
+
240
+ model.generation_config = GenerationConfig.from_pretrained("OrionStarAI/Orion-14B")
241
+ messages = [{"role": "user", "content": "你好! 你叫什么名字!"}]
242
+ response = model.chat(tokenizer, messages, streaming=False)
243
+ print(response)
244
+
245
+ ```
246
+
247
+ 在上述两段代码中,模型加载指定 `device_map='auto'`
248
+ ,会使用所有可用显卡。如需指定使用的设备,可以使用类似 `export CUDA_VISIBLE_DEVICES=0,1`(使用了0、1号显卡)的方式控制。
249
+
250
+ ## 4.2. 命令行工具方式
251
+
252
+ ```shell
253
+ CUDA_VISIBLE_DEVICES=0 python cli_demo.py
254
+ ```
255
+
256
+ 本命令行工具是为 Chat 场景设计,因此我们不支持使用该工具调用 Base 模型。
257
+
258
+ ## 4.3. 脚本直接推理
259
+
260
+ ```shell
261
+ # base model
262
+ CUDA_VISIBLE_DEVICES=0 python demo/text_generation_base.py --model OrionStarAI/Orion-14B --tokenizer OrionStarAI/Orion-14B --prompt 你好,你叫什么名字
263
+
264
+ # chat model
265
+ CUDA_VISIBLE_DEVICES=0 python demo/text_generation.py --model OrionStarAI/Orion-14B-Chat --tokenizer OrionStarAI/Orion-14B-Chat --prompt 你好,你叫什么名字
266
+
267
+ ```
268
+
269
+ <a name="vllm"></a><br>
270
+ ## 4.4. 使用vllm推理
271
+ - 工程地址<br>
272
+ https://github.com/vllm-project/vllm
273
+
274
+ - 拉取请求<br>
275
+ https://github.com/vllm-project/vllm/pull/2539
276
+
277
+ <a name="llama-cpp"></a><br>
278
+ ## 4.5. 使用llama.cpp推理
279
+
280
+ - 工程地址<br>
281
+ https://github.com/ggerganov/llama.cpp
282
+
283
+ - 拉取请求<br>
284
+ https://github.com/ggerganov/llama.cpp/pull/5118
285
+
286
+ - 如何转换GGUF格式模型
287
+
288
+ ```shell
289
+ python convert-hf-to-gguf.py path/to/Orion-14B-Chat --outfile chat.gguf
290
+ ```
291
+
292
+ - 如何运行GGUF格式模型推理
293
+
294
+ ```shell
295
+ ./main --frequency-penalty 0.5 --frequency-penalty 0.5 --top-k 5 --top-p 0.9 -m chat.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
296
+ ```
297
+
298
+
299
+
300
+
301
+ ## 4.6. 示例输出
302
+
303
+ ### 4.6.1. 闲聊
304
+
305
+ `````
306
+ 用户:你好,你叫什么名字
307
+ Orion-14B:你好!我是一个人工智能助手,没有固定的名字。你可以随意称呼我。有什么我可以帮助你的吗?
308
+ 用户:讲个故事
309
+ Orion-14B:从前有一个小男孩叫杰克,他住在一个小村庄里。杰克非常喜欢冒险和探索,总是梦想着能够发现一些神奇的事情。
310
+
311
+ 有一天,杰克听说在附近的森林深处有一座传说中的魔法城堡,里面住着一个邪恶的巨人。这个巨人的名字叫格里姆,他被困在城堡中,无法离开。
312
+
313
+ 杰克决定要去解救这个巨人,因为他相信格里姆一定有自己的原因被困在那里。于是,他带着勇气和决心,踏上了前往魔法城堡的旅程。
314
+
315
+ 经过长时间的跋涉,杰克终��来到了城堡的大门前。他小心翼翼地推开大门,进入了一个充满神秘气息的房间。
316
+
317
+ 房间里弥漫着黑暗的气息,但杰克并没有被吓倒。他勇敢地向城堡深处走去,最终找到了格里姆。
318
+
319
+ 格里姆告诉杰克,他是被一个邪恶的女巫诅咒的,只有通过解开咒语才能获得自由。杰克决定要帮助格里姆,他开始寻找线索和解开咒语的方法。
320
+
321
+ 经过一番努力,杰克终于找到了一条通往女巫的秘密实验室的道路。他在那里发现了女巫留下的魔法书,书中记载了解开咒语的方法。
322
+
323
+ 杰克按照书上所写的步骤进行操作,成功地解开了格里姆身上的诅咒。巨人感激不已,他答应帮助杰克实现他的愿望。
324
+
325
+ 从此以后,杰克和格里姆成为了好朋友,他们一起经历了许多惊险刺激的冒险,并用他们的勇气和智慧战胜了各种困难。
326
+
327
+ 这个故事告诉我们,只要有勇气和决心,我们就能克服一切困难,实现自己的梦想。
328
+ `````
329
+
330
+ ### 4.6.2. 日韩
331
+
332
+ `````
333
+ 用户:自己を紹介してください
334
+ Orion-14B:こんにちは、私の名前はChatMaxで、OrionStarによって開発されたAIアシスタントです。私はユーザーに正確で、客観的で、有用な回答を提供し、彼らが問題を解決し、情報を取得するのを助けることができます。
335
+ 用户:자기소개를 해주세요.
336
+ Orion-14B:안녕하세요, 제 이름은 ChatMax입니다. OrionStar에서 개발한 AI 어시스턴트입니다. 저는 사용자들에게 정확하고 객관적이며 유용한 답변을 제공하여 문제를 해결하고 정보를 얻는 데 도움을 줄 수 있습니다.
337
+ `````
338
+
339
+
340
+ <a name="zh_declarations-license"></a><br>
341
+ # 5. 声明、协议
342
+
343
+ ## 5.1. 声明
344
+
345
+ 我们强烈呼吁所有使用者,不要利用 Orion-14B 模型进行任何危害国家社会安全或违法的活动。另外,我们也要求使用者不要将
346
+ Orion-14B 模型用于未经适当安全审查和备案的互联网服务。
347
+
348
+ 我们希望所有的使用者都能遵守这个原则,确保科技的发展能在规范和合法的环境下进行。
349
+ 我们已经尽我们所能,来确保模型训练过程中使用的数据的合规性。然而,尽管我们已经做出了巨大的努力,但由于模型和数据的复杂性,仍有可能存在一些无法预见的问题。因此,如果由于使用
350
+ Orion-14B 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。
351
+
352
+ ## 5.2. 协议
353
+
354
+ 社区使用Orion-14B系列模型
355
+ - 代码请遵循 [Apache License Version 2.0](./LICENSE)<br>
356
+ - 模型请遵循 [Orion-14B系列模型社区许可协议](./ModelsCommunityLicenseAgreement)
357
+
358
+
359
+ <a name="zh_company-introduction"></a><br>
360
+ # 6. 企业介绍
361
+
362
+ 猎户星空(OrionStar)是一家全球领先的服务机器人解决方案公司,成立于2016年9月。猎户星空致力于基于人工智能技术打造下一代革命性机器人,使人们能够摆脱重复的体力劳动,使人类的工作和生活更加智能和有趣,通过技术使社会和世界变得更加美好。
363
+
364
+ 猎户星空拥有完全自主开发的全链条人工智能技术,如语音交互和视觉导航。它整合了产品开发能力和技术应用能力。基于Orion机械臂平台,它推出了ORION
365
+ STAR AI Robot Greeting、AI Robot Greeting Mini、Lucki、Coffee
366
+ Master等产品,并建立了Orion机器人的开放平台OrionOS。通过为 **真正有用的机器人而生** 的理念实践,它通过AI技术为更多人赋能。
367
+
368
+ 凭借7年AI经验积累,猎户星空已推出的大模型深度应用“聚言”,并陆续面向行业客户提供定制化AI大模型咨询与服务解决方案,真正帮助客户实现企业经营效率领先同行目标。
369
+
370
+ **猎户星空具备全链条大模型应用能力的核心优势**,包括拥有从海量数据处理、大模型预训练、二次预训练、微调(Fine-tune)、Prompt
371
+ Engineering 、Agent开发的全链条能力和经验积累;拥有完整的端到端模型训练能力,包括系统化的数据处理流程和数百张GPU的并行模型训练能力,现已在大政务、云服务、出海电商、快消等多个行业场景落地。
372
+
373
+ ***欢迎有大模型应用落地需求的企业联系我们进行商务合作***<br>
374
+ **咨询电话:** 400-898-7779<br>
375
+ **电子邮箱:** ai@orionstar.com<br>
376
+ **Discord社区链接:** https://discord.gg/zumjDWgdAs
377
+
378
+ <div align="center">
379
+ <img src="./assets/imgs/wechat_group.jpg" alt="wechat" width="40%" />
380
+ </div>
381
+
382
+
383
+
384
+
385
+ # Table of Contents
386
+
387
+ - [📖 Model Introduction](#model-introduction)
388
+ - [🔗 Model Download](#model-download)
389
+ - [🔖 Model Benchmark](#model-benchmark)
390
+ - [📊 Model Inference](#model-inference)
391
+ - [📜 Declarations & License](#declarations-license)
392
+ - [🥇 Company Introduction](#company-introduction)
393
+
394
+ <a name="model-introduction"></a><br>
395
+ # 1. Model Introduction
396
+
397
+ - Orion-14B series models are open-source multilingual large language models trained from scratch by OrionStarAI. The base model is trained on 2.5T multilingual corpus, including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages. For details, please refer to [tech report](https://github.com/OrionStarAI/Orion/blob/master/doc/Orion14B_v3.pdf).
398
+
399
+ - The Orion-14B series models exhibit the following features:
400
+ - Among models with 20B-parameter scale level, Orion-14B-Base model shows outstanding performance in comprehensive evaluations.
401
+ - Strong multilingual capabilities, significantly outperforming in Japanese and Korean testsets.
402
+ - The fine-tuned models demonstrate strong adaptability, excelling in human-annotated blind tests.
403
+ - The long-chat version supports extremely long texts, performing exceptionally well at a token length of 200k and can support up to a maximum of 320k.
404
+ - The quantized versions reduce model size by 70%, improve inference speed by 30%, with performance loss less than 1%.
405
+ <table style="border-collapse: collapse; width: 100%;">
406
+ <tr>
407
+ <td style="border: none; padding: 10px; box-sizing: border-box;">
408
+ <img src="./assets/imgs/opencompass_en.png" alt="opencompass" style="width: 100%; height: auto;">
409
+ </td>
410
+ <td style="border: none; padding: 10px; box-sizing: border-box;">
411
+ <img src="./assets/imgs/model_cap_en.png" alt="modelcap" style="width: 100%; height: auto;">
412
+ </td>
413
+ </tr>
414
+ </table>
415
+
416
+ - Orion-14B series models including:
417
+ - **Orion-14B-Base:** A multilingual large language foundational model with 14 billion parameters, pretrained on a diverse dataset of 2.5 trillion tokens.
418
+ - **Orion-14B-Chat:** A chat-model fine-tuned on a high-quality corpus aims to provide an excellence interactive experience for users in the large model community.
419
+ - **Orion-14B-LongChat:** The long-context version excels at handling extremely lengthy texts, performing exceptionally well at a token length of 200k and can support up to a maximum of 320k.
420
+ - **Orion-14B-Chat-RAG:** A chat-model fine-tuned on a custom retrieval augmented generation dataset, achieving superior performance in retrieval augmented generation tasks.
421
+ - **Orion-14B-Chat-Plugin:** A chat-model specifically tailored for plugin and function calling tasks, ideal for agent-related scenarios where the LLM acts as a plugin and function call system.
422
+ - **Orion-14B-Base-Int4:** A quantized base model utilizing 4-bit integer weights. It significantly reduces the model size by 70% and increases the inference speed by 30% while incurring a minimal performance loss of only 1%.
423
+ - **Orion-14B-Chat-Int4:** A quantized chat model utilizing 4-bit integer weights.
424
+
425
+
426
+ <a name="model-download"></a><br>
427
+ # 2. Model Download
428
+
429
+ Model release and download links are provided in the table below:
430
+
431
+ | Model Name | HuggingFace Download Links | ModelScope Download Links |
432
+ |-------------------------|-----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
433
+ | ⚾Orion-14B-Base | [Orion-14B-Base](https://huggingface.co/OrionStarAI/Orion-14B-Base) | [Orion-14B-Base](https://modelscope.cn/models/OrionStarAI/Orion-14B-Base/summary) |
434
+ | 😛Orion-14B-Chat | [Orion-14B-Chat](https://huggingface.co/OrionStarAI/Orion-14B-Chat) | [Orion-14B-Chat](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat/summary) |
435
+ | 📃Orion-14B-LongChat | [Orion-14B-LongChat](https://huggingface.co/OrionStarAI/Orion-14B-LongChat) | [Orion-14B-LongChat](https://modelscope.cn/models/OrionStarAI/Orion-14B-LongChat/summary) |
436
+ | 🔎Orion-14B-Chat-RAG | [Orion-14B-Chat-RAG](https://huggingface.co/OrionStarAI/Orion-14B-Chat-RAG) | [Orion-14B-Chat-RAG](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat-RAG/summary) |
437
+ | 🔌Orion-14B-Chat-Plugin | [Orion-14B-Chat-Plugin](https://huggingface.co/OrionStarAI/Orion-14B-Chat-Plugin) | [Orion-14B-Chat-Plugin](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat-Plugin/summary) |
438
+ | 💼Orion-14B-Base-Int4 | [Orion-14B-Base-Int4](https://huggingface.co/OrionStarAI/Orion-14B-Base-Int4) | [Orion-14B-Base-Int4](https://modelscope.cn/models/OrionStarAI/Orion-14B-Base-Int4/summary) |
439
+ | 📦Orion-14B-Chat-Int4 | [Orion-14B-Chat-Int4](https://huggingface.co/OrionStarAI/Orion-14B-Chat-Int4) | [Orion-14B-Chat-Int4](https://modelscope.cn/models/OrionStarAI/Orion-14B-Chat-Int4/summary) |
440
+
441
+ <a name="model-benchmark"></a><br>
442
+ # 3. Model Benchmarks
443
+
444
+ ## 3.1. Base Model Orion-14B-Base Benchmarks
445
+ ### 3.1.1. LLM evaluation results on examination and professional knowledge
446
+ | Model | C-Eval | CMMLU | MMLU | AGIEval | Gaokao | BBH |
447
+ |--------------------|----------|----------|----------|----------|----------|----------|
448
+ | LLaMA2-13B | 41.4 | 38.4 | 55.0 | 30.9 | 18.2 | 45.6 |
449
+ | Skywork-13B | 59.1 | 61.4 | 62.7 | 43.6 | 56.1 | 48.3 |
450
+ | Baichuan2-13B | 59.0 | 61.3 | 59.5 | 37.4 | 45.6 | 49.0 |
451
+ | QWEN-14B | 71.7 | 70.2 | 67.9 | 51.9 | **62.5** | 53.7 |
452
+ | InternLM-20B | 58.8 | 59.0 | 62.1 | 44.6 | 45.5 | 52.5 |
453
+ | **Orion-14B-Base** | **72.9** | **70.6** | **69.9** | **54.7** | 62.1 | **56.5** |
454
+
455
+ ### 3.1.2. LLM evaluation results on language understanding and common knowledge
456
+ | Model |RACE-middle|RACE-high |HellaSwag | PIQA | Lambada | WSC |
457
+ |--------------------|----------|----------|----------|----------|----------|----------|
458
+ | LLaMA 2-13B | 63.0 | 58.9 | 77.5 | 79.8 | 76.5 | 66.3 |
459
+ | Skywork-13B | 87.6 | 84.1 | 73.7 | 78.3 | 71.8 | 66.3 |
460
+ | Baichuan 2-13B | 68.9 | 67.2 | 70.8 | 78.1 | 74.1 | 66.3 |
461
+ | QWEN-14B | 93.0 | 90.3 | **80.2** | 79.8 | 71.4 | 66.3 |
462
+ | InternLM-20B | 86.4 | 83.3 | 78.1 | **80.3** | 71.8 | 68.3 |
463
+ | **Orion-14B-Base** | **93.2** | **91.3** | 78.5 | 79.5 | **78.8** | **70.2** |
464
+
465
+ ### 3.1.3. LLM evaluation results of OpenCompass testsets
466
+ | Model | Average | Examination | Language | Knowledge | Understanding | Reasoning |
467
+ |------------------|----------|----------|----------|----------|----------|----------|
468
+ | LLaMA 2-13B | 47.3 | 45.2 | 47.0 | 58.3 | 50.9 | 43.6 |
469
+ | Skywork-13B | 53.6 | 61.1 | 51.3 | 52.7 | 64.5 | 45.2 |
470
+ | Baichuan 2-13B | 49.4 | 51.8 | 47.5 | 48.9 | 58.1 | 44.2 |
471
+ | QWEN-14B | 62.4 | 71.3 | 52.67 | 56.1 | 68.8 | 60.1 |
472
+ | InternLM-20B | 59.4 | 62.5 | 55.0 | **60.1** | 67.3 | 54.9 |
473
+ |**Orion-14B-Base**| **64.3** | **71.4** | **55.0** | 60.0 | **71.9** | **61.6** |
474
+
475
+ ### 3.1.4. Comparison of LLM performances on Japanese testsets
476
+ | Model |**Average**| JCQA | JNLI | MARC | JSQD | JQK | XLS | XWN | MGSM |
477
+ |--------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
478
+ | PLaMo-13B | 52.3 | 56.7 | 42.8 | 95.8 | 70.6 | 71.0 | 8.70 | 70.5 | 2.40 |
479
+ | WebLab-10B | 50.7 | 66.6 | 53.7 | 82.1 | 62.9 | 56.2 | 10.0 | 72.0 | 2.40 |
480
+ | ELYZA-jp-7B | 48.8 | 71.7 | 25.3 | 86.6 | 70.8 | 64.1 | 2.50 | 62.1 | 7.20 |
481
+ | StableLM-jp-7B | 51.1 | 33.4 | 43.3 | **96.7** | 70.6 | 78.1 | 10.7 | 72.8 | 2.80 |
482
+ | LLaMA 2-13B | 46.3 | 75.0 | 47.6 | 38.8 | 76.1 | 67.7 | 18.1 | 63.2 | 10.4 |
483
+ | Baichuan 2-13B | 57.1 | 73.7 | 31.3 | 91.6 | 80.5 | 63.3 | 18.6 | 72.2 | 25.2 |
484
+ | QWEN-14B | 65.8 | 85.9 | 60.7 | 97.0 | 83.3 | 71.8 | 18.8 | 70.6 | 38.0 |
485
+ | Yi-34B | 67.1 | 83.8 | 61.2 | 95.2 | **86.1** | 78.5 | **27.2** | 69.2 | 35.2 |
486
+ | **Orion-14B-Base** | **69.1** | **88.2** | **75.8** | 94.1 | 75.7 | **85.1** | 17.3 | **78.8** | **38.0** |
487
+
488
+ ### 3.1.5. Comparison of LLM performances on Korean testsets. n = 0 and n = 5 stand for n-shot prompts used in the evaluation
489
+ |Model | **Average**<br>n=0&nbsp;&nbsp;n=5 | HellaSwag<br>n=0&nbsp;&nbsp;n=5 | COPA<br> n=0&nbsp;&nbsp;n=5 | BooIQ<br>n=0&nbsp;&nbsp;n=5 | SentiNeg<br>n=0&nbsp;&nbsp;n=5|
490
+ |------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
491
+ | KoGPT | 53.0 &nbsp;&nbsp; 70.1 | 55.9 &nbsp;&nbsp; 58.3 | 73.5 &nbsp;&nbsp; 72.9 | 45.1 &nbsp;&nbsp; 59.8 | 37.5 &nbsp;&nbsp; 89.4 |
492
+ | Polyglot-ko-13B | 69.6 &nbsp;&nbsp; 73.7 |**59.5** &nbsp;&nbsp; **63.1**|**79.4** &nbsp;&nbsp; **81.1**| 48.2 &nbsp;&nbsp; 60.4 | 91.2 &nbsp;&nbsp; 90.2 |
493
+ | LLaMA 2-13B | 46.7 &nbsp;&nbsp; 63.7 | 41.3 &nbsp;&nbsp; 44.0 | 59.3 &nbsp;&nbsp; 63.8 | 34.9 &nbsp;&nbsp; 73.8 | 51.5 &nbsp;&nbsp; 73.4 |
494
+ | Baichuan 2-13B | 52.1 &nbsp;&nbsp; 58.7 | 39.2 &nbsp;&nbsp; 39.6 | 60.6 &nbsp;&nbsp; 60.6 | 58.4 &nbsp;&nbsp; 61.5 | 50.3 &nbsp;&nbsp; 72.9 |
495
+ | QWEN-14B | 53.8 &nbsp;&nbsp; 73.7 | 45.3 &nbsp;&nbsp; 46.8 | 64.9 &nbsp;&nbsp; 68.9 | 33.4 &nbsp;&nbsp; 83.5 | 71.5 &nbsp;&nbsp; 95.7 |
496
+ | Yi-34B | 54.2 &nbsp;&nbsp; 72.1 | 44.6 &nbsp;&nbsp; 44.7 | 58.0 &nbsp;&nbsp; 60.6 | 65.9 &nbsp;&nbsp; 90.2 | 48.3 &nbsp;&nbsp; 92.9 |
497
+ |**Orion-14B-Chat**|**74.5** &nbsp;&nbsp; **79.6**| 47.0 &nbsp;&nbsp; 49.6 | 77.7 &nbsp;&nbsp; 79.4 |**81.6** &nbsp;&nbsp; **90.7**|**92.4** &nbsp;&nbsp; **98.7**|
498
+
499
+ ### 3.1.6. Multilingual evaluation
500
+ | Model | Train Lang | Japanese | Korean | Chinese | English |
501
+ |--------------------|------------|----------|----------|----------|----------|
502
+ | PLaMo-13B | En,Jp | 52.3 | * | * | * |
503
+ | Weblab-10B | En,Jp | 50.7 | * | * | * |
504
+ | ELYZA-jp-7B | En,Jp | 48.8 | * | * | * |
505
+ | StableLM-jp-7B | En,Jp | 51.1 | * | * | * |
506
+ | KoGPT-6B | En,Ko | * | 70.1 | * | * |
507
+ | Polyglot-ko-13B | En,Ko | * | 70.7 | * | * |
508
+ | Baichuan2-13B | Multi | 57.1 | 58.7 | 50.8 | 57.1 |
509
+ | Qwen-14B | Multi | 65.8 | 73.7 | 64.5 | 65.4 |
510
+ | Llama2-13B | Multi | 46.3 | 63.7 | 41.4 | 55.3 |
511
+ | Yi-34B | Multi | 67.1 | 72.2 | 58.7 | **68.8** |
512
+ | **Orion-14B-Chat** | Multi | **69.1** | **79.5** | **67.9** | 67.3 |
513
+
514
+
515
+ ## 3.2. Chat Model Orion-14B-Chat Benchmarks
516
+ ### 3.2.1. Chat model subjective evaluation of MTBench
517
+ | Model | First-Turn | Second-Turn | **Average** |
518
+ |----------------------|----------|----------|----------|
519
+ | Baichuan2-13B-Chat | 7.05 | 6.47 | 6.76 |
520
+ | Qwen-14B-Chat | 7.30 | 6.62 | 6.96 |
521
+ | Llama2-13B-Chat | 7.10 | 6.20 | 6.65 |
522
+ | InternLM-20B-Chat | 7.03 | 5.93 | 6.48 |
523
+ | **Orion-14B-Chat** | **7.68** | **7.07** | **7.37** |
524
+ \* use vllm for inference
525
+
526
+ ### 3.2.2. Chat model subjective evaluation of AlignBench
527
+ | Model | Math. | Logi. | Basic. | Chi. | Comp. | Writ. | Role. | Prof. |**Avg.**|
528
+ |--------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
529
+ | Baichuan2-13B-Chat | 3.76 | 4.07 | 6.22 | 6.05 | 7.11 | 6.97 | 6.75 | 6.43 | 5.25 |
530
+ | Qwen-14B-Chat |**4.91**|**4.71**|**6.90**| 6.36 | 6.74 | 6.64 | 6.59 | 6.56 |**5.72**|
531
+ | Llama2-13B-Chat | 3.05 | 3.79 | 5.43 | 4.40 | 6.76 | 6.63 | 6.99 | 5.65 | 4.70 |
532
+ | InternLM-20B-Chat | 3.39 | 3.92 | 5.96 | 5.50 |**7.18**| 6.19 | 6.49 | 6.22 | 4.96 |
533
+ | **Orion-14B-Chat** | 4.00 | 4.24 | 6.18 |**6.57**| 7.16 |**7.36**|**7.16**|**6.99**| 5.51 |
534
+ \* use vllm for inference
535
+
536
+ ## 3.3. LongChat Model Orion-14B-LongChat Benchmarks
537
+ ### 3.3.1. LongChat evaluation of LongBench
538
+ | Model | NarrativeQA|MultiFieldQA-en|MultiFieldQA-zh| DuReader | QMSum | VCSUM | TREC | TriviaQA | LSHT |RepoBench-P|
539
+ |--------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
540
+ | GPT-3.5-Turbo-16k | **23.60** | **52.30** | **61.20** | 28.70 | 23.40 | **16.00** | 68.00 | **91.40** | 29.20 | 53.60 |
541
+ | LongChat-v1.5-7B-32k | 16.90 | 41.40 | 29.10 | 19.50 | 22.70 | 9.90 | 63.50 | 82.30 | 23.20 | 55.30 |
542
+ | Vicuna-v1.5-7B-16k | 19.40 | 38.50 | 43.00 | 19.30 | 22.80 | 15.10 | 71.50 | 86.20 | 28.80 | 43.50 |
543
+ | Yi-6B-200K | 14.11 | 36.74 | 22.68 | 14.01 | 20.44 | 8.08 | 72.00 | 86.61 | 38.00 | **63.29** |
544
+ | Orion-14B-LongChat | 19.47 | 48.11 | 55.84 | **37.02** | **24.87** | 15.44 | **77.00** | 89.12 | **45.50** | 54.31 |
545
+
546
+
547
+ ## 3.4. Chat RAG Model Benchmarks
548
+ ### 3.4.1. LLM evaluation results of self-built RAG testsets
549
+ |Model|Effectiveness of Response(Keyword)|*Effectiveness of Response(subjective evaluation)|Quoting Ability|Fallback Ability|*AutoQA|*Data Extraction|
550
+ |---------------------|------|------|------|------|------|------|
551
+ | Baichuan2-13B-Chat | 85 | 76 | 1 | 0 | 69 | 51 |
552
+ | Qwen-14B-Chat | 79 | 77 | 75 | 47 | 68 | 72 |
553
+ | Qwen-72B-Chat(Int4) | 87 | 89 | 90 | 32 | 67 | 76 |
554
+ | GPT-4 | 91 | 94 | 96 | 95 | 75 | 86 |
555
+ | Orion-14B-Chat-RAG | 86 | 87 | 91 | 97 | 73 | 71 |
556
+ \* means manual assessment
557
+
558
+ ## 3.5. Chat Plugin Model Orion-14B-Chat-Plugin Benchmarks
559
+ ### 3.5.1. LLM evaluation results of self-built plugin testsets
560
+ |Model |Intent Recognition with Full Params |Intent Recognition with Missing Params |Non-Plugin Invocation Recognition |
561
+ |-----------------------|--------|-----------|--------|
562
+ | Baichuan2-13B-Chat | 25 | 0 | 0 |
563
+ | Qwen-14B-Chat | 55 | 0 | 50 |
564
+ | GPT-4 | **95** | 52.38 | 70 |
565
+ | Orion-14B-Chat-Plugin | 92.5 | **60.32** | **90** |
566
+
567
+ ## 3.6. Quantized Model Orion-14B-Base-Int4 Benchmarks
568
+ ### 3.6.1. Comparison of before and after quantization
569
+ |Model |Size(GB)|Inference Speed(tokens/s)|C-Eval|CMMLU|MMLU|RACE|HellaSwag|
570
+ |-------------------------|-------|-----|------|------|------|------|------|
571
+ | OrionStar-14B-Base | 28.0 | 135 | 72.8 | 70.6 | 70.0 | 93.3 | 78.5 |
572
+ | OrionStar-14B-Base-Int4 | 8.3 | 178 | 71.8 | 69.8 | 69.2 | 93.1 | 78.0 |
573
+
574
+
575
+ <a name="model-inference"></a><br>
576
+ # 4. Model Inference
577
+
578
+ Model weights, source code, and configuration needed for inference are published on Hugging Face, and the download link
579
+ is available in the table at the beginning of this document. We demonstrate various inference methods here, and the
580
+ program will automatically download the necessary resources from Hugging Face.
581
+
582
+ ## 4.1. Python Code
583
+
584
+ ```python
585
+ import torch
586
+ from transformers import AutoModelForCausalLM, AutoTokenizer
587
+ from transformers.generation.utils import GenerationConfig
588
+
589
+ tokenizer = AutoTokenizer.from_pretrained("OrionStarAI/Orion-14B", use_fast=False, trust_remote_code=True)
590
+ model = AutoModelForCausalLM.from_pretrained("OrionStarAI/Orion-14B", device_map="auto",
591
+ torch_dtype=torch.bfloat16, trust_remote_code=True)
592
+
593
+ model.generation_config = GenerationConfig.from_pretrained("OrionStarAI/Orion-14B")
594
+ messages = [{"role": "user", "content": "Hello, what is your name? "}]
595
+ response = model.chat(tokenizer, messages, streaming=False)
596
+ print(response)
597
+
598
+ ```
599
+
600
+ In the above Python code, the model is loaded with `device_map='auto'` to utilize all available GPUs. To specify the
601
+ device, you can use something like `export CUDA_VISIBLE_DEVICES=0,1` (using GPUs 0 and 1).
602
+
603
+ ## 4.2. Command Line Tool
604
+
605
+ ```shell
606
+ CUDA_VISIBLE_DEVICES=0 python cli_demo.py
607
+ ```
608
+
609
+ This command-line tool is designed for chat scenarios, and thus, it does not support calling the base model.
610
+
611
+ ## 4.3. Direct Script Inference
612
+
613
+ ```shell
614
+
615
+ # base model
616
+ CUDA_VISIBLE_DEVICES=0 python demo/text_generation_base.py --model OrionStarAI/Orion-14B --tokenizer OrionStarAI/Orion-14B --prompt hello
617
+
618
+ # chat model
619
+ CUDA_VISIBLE_DEVICES=0 python demo/text_generation.py --model OrionStarAI/Orion-14B-Chat --tokenizer OrionStarAI/Orion-14B-Chat --prompt hi
620
+
621
+ ```
622
+ <a name="vllm"></a><br>
623
+ ## 4.4. 使用vllm推理
624
+ - 工程地址<br>
625
+ https://github.com/vllm-project/vllm
626
+
627
+ - 拉取请求<br>
628
+ https://github.com/vllm-project/vllm/pull/2539
629
+
630
+ <a name="llama-cpp"></a><br>
631
+ ## 4.5. 使用llama.cpp推理
632
+
633
+ - 工程地址<br>
634
+ https://github.com/ggerganov/llama.cpp
635
+
636
+ - 拉取请求<br>
637
+ https://github.com/ggerganov/llama.cpp/pull/5118
638
+
639
+ - 如何转换GGUF格式模型
640
+
641
+ ```shell
642
+ python convert-hf-to-gguf.py path/to/Orion-14B-Chat --outfile chat.gguf
643
+ ```
644
+
645
+ - 如何运行GGUF格式模型推理
646
+
647
+ ```shell
648
+ ./main --frequency-penalty 0.5 --frequency-penalty 0.5 --top-k 5 --top-p 0.9 -m chat.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
649
+ ```
650
+
651
+ ## 4.6 Example Output
652
+
653
+ ### 4.6.1. Casual Chat
654
+
655
+ `````
656
+ User: Hello
657
+ Orion-14B: Hello! How can I help you today? If you have any questions or need assistance, feel free to ask.
658
+ User: What functions do you have?
659
+ Orion-14B: As an AI language model, I have a wide range of functions and capabilities, including but not limited to:
660
+ Answering general knowledge questions: I can provide information on various topics such as history, science, geography, and more.
661
+ Providing definitions and explanations: I can explain complex concepts and provide definitions for words and phrases.
662
+ Assisting with tasks: I can help with simple tasks like setting reminders, converting units of measurement, and even ordering food from restaurants.
663
+ Generating text: I can write short stories, essays, articles, and other types of content based on prompts given to me.
664
+ Translating languages: I can translate text from one language to another.
665
+ Chatting: I can engage in conversation with users and respond to their queries in a conversational manner.
666
+ Summarizing long texts: I can summarize longer texts like articles, books, and research papers.
667
+ Generating random numbers and names: I can generate random numbers within a specified range and also suggest names based on certain criteria.
668
+ Playing games: I can play simple games like Hangman, Trivia, and more.
669
+ Providing recommendations: I can recommend movies, music, books, and other forms of entertainment based on user preferences.
670
+ These are just some of the many functions that I am capable of performing. If you have a specific task in mind, feel free to ask me about it.
671
+ User: Tell me a joke.
672
+ Orion-14B: Sure, here's a classic one-liner: Why don't scientists trust atoms? Because they make up everything.
673
+ `````
674
+
675
+ ### 4.6.2. Japanese & Korean Chat
676
+
677
+ `````
678
+ User:自己を紹介してください
679
+ Orion-14B:こんにちは、私の名前はChatMaxで、OrionStarによって開発されたAIアシスタントです。私はユーザーに正確で、客観的で、有用な回答を提供し、彼らが問題を解決し、情報を取得���るのを助けることができます。
680
+ User:자기소개를 해주세요.
681
+ Orion-14B:안녕하세요, 제 이름은 ChatMax입니다. OrionStar에서 개발한 AI 어시스턴트입니다. 저는 사용자들에게 정확하고 객관적이며 유용한 답변을 제공하여 문제를 해결하고 정보를 얻는 데 도움을 줄 수 있습니다.
682
+ `````
683
+
684
+ <a name="declarations-license"></a><br>
685
+ # 5. Declarations, License
686
+
687
+ ## 5.1. Declarations
688
+
689
+ We strongly urge all users not to use the Orion-14B model for any activities that may harm national or social security or violate the law.
690
+ Additionally, we request users not to use the Orion-14B model for internet services without proper security review and filing.
691
+ We hope all users abide by this principle to ensure that technological development takes place in a regulated and legal environment.
692
+ We have done our best to ensure the compliance of the data used in the model training process. However, despite our
693
+ significant efforts, unforeseen issues may still arise due to the complexity of the model and data. Therefore, if any
694
+ problems arise due to the use of the Orion-14B open-source model, including but not limited to data security
695
+ issues, public opinion risks, or any risks and issues arising from the model being misled, abused, disseminated, or
696
+ improperly utilized, we will not assume any responsibility.
697
+
698
+ ## 5.2. License
699
+
700
+ Community use of the Orion-14B series models
701
+ - For code, please comply with [Apache License Version 2.0](./LICENSE)<br>
702
+ - For model, please comply with [【Orion-14B Series】 Models Community License Agreement](./ModelsCommunityLicenseAgreement)
703
+
704
+
705
+ <a name="company-introduction"></a><br>
706
+ # 6. Company Introduction
707
+
708
+ OrionStar is a leading global service robot solutions company, founded in September 2016. OrionStar is dedicated to
709
+ using artificial intelligence technology to create the next generation of revolutionary robots, allowing people to break
710
+ free from repetitive physical labor and making human work and life more intelligent and enjoyable. Through technology,
711
+ OrionStar aims to make society and the world a better place.
712
+
713
+ OrionStar possesses fully self-developed end-to-end artificial intelligence technologies, such as voice interaction and
714
+ visual navigation. It integrates product development capabilities and technological application capabilities. Based on
715
+ the Orion robotic arm platform, it has launched products such as OrionStar AI Robot Greeting, AI Robot Greeting Mini,
716
+ Lucki, Coffee Master, and established the open platform OrionOS for Orion robots. Following the philosophy of "Born for
717
+ Truly Useful Robots", OrionStar empowers more people through AI technology.
718
+
719
+ **The core strengths of OrionStar lies in possessing end-to-end AI application capabilities,** including big data preprocessing, large model pretraining, fine-tuning, prompt engineering, agent, etc. With comprehensive end-to-end model training capabilities, including systematic data processing workflows and the parallel model training capability of hundreds of GPUs, it has been successfully applied in various industry scenarios such as government affairs, cloud services, international e-commerce, and fast-moving consumer goods.
720
+
721
+ Companies with demands for deploying large-scale model applications are welcome to contact us.<br>
722
+ **Enquiry Hotline: 400-898-7779**<br>
723
+ **E-mail: ai@orionstar.com**<br>
724
+ **Discord Link: https://discord.gg/zumjDWgdAs**
725
+
726
+ <div align="center">
727
+ <img src="./assets/imgs/wechat_group.jpg" alt="wechat" width="40%" />
728
+ </div>
README_en.md ADDED
@@ -0,0 +1 @@
 
 
1
+ obsolete
README_zh.MD ADDED
@@ -0,0 +1 @@
 
 
1
+ obsolete
assets/imgs/llama_cpp_1.png ADDED
assets/imgs/model_cap_en.png ADDED
assets/imgs/model_cap_zh.png ADDED
assets/imgs/opencompass_en.png ADDED
assets/imgs/opencompass_zh.png ADDED
assets/imgs/orion_start.PNG ADDED
assets/imgs/vllm_1.png ADDED
assets/imgs/wechat_group.jpg ADDED
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OrionForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_orion.OrionConfig",
7
+ "AutoModelForCausalLM": "modeling_orion.OrionForCausalLM"
8
+ },
9
+ "bos_token_id": 1,
10
+ "eos_token_id": 2,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 5120,
13
+ "model_type": "orion",
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 15360,
16
+ "max_position_embeddings": 4096,
17
+ "max_sequence_length": 4096,
18
+ "num_attention_heads": 40,
19
+ "num_hidden_layers": 40,
20
+ "num_key_value_heads": 40,
21
+ "pad_token_id": 0,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_scaling": null,
25
+ "rope_theta": 10000.0,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.34.0",
29
+ "use_cache": true,
30
+ "vocab_size": 84608
31
+ }
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"text-generation"}
configuration_orion.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, OrionStar Inc. All rights reserved.
2
+
3
+ from transformers import PretrainedConfig
4
+
5
+ class OrionConfig(PretrainedConfig):
6
+ model_type = "orion"
7
+ keys_to_ignore_at_inference = ["past_key_values"]
8
+
9
+ def __init__(
10
+ self,
11
+ vocab_size=84608,
12
+ hidden_size=4096,
13
+ intermediate_size=15360,
14
+ num_hidden_layers=40,
15
+ num_attention_heads=40,
16
+ num_key_value_heads=40,
17
+ hidden_act="silu",
18
+ max_position_embeddings=4096,
19
+ initializer_range=0.02,
20
+ rms_norm_eps=1e-5,
21
+ use_cache=True,
22
+ pad_token_id=None,
23
+ bos_token_id=1,
24
+ eos_token_id=2,
25
+ pretraining_tp=1,
26
+ tie_word_embeddings=False,
27
+ rope_theta=10000.0,
28
+ rope_scaling=None,
29
+ attention_bias=False,
30
+ **kwargs,
31
+ ):
32
+ self.vocab_size = vocab_size
33
+ self.max_position_embeddings = max_position_embeddings
34
+ self.hidden_size = hidden_size
35
+ self.intermediate_size = intermediate_size
36
+ self.num_hidden_layers = num_hidden_layers
37
+ self.num_attention_heads = num_attention_heads
38
+
39
+ # for backward compatibility
40
+ if num_key_value_heads is None:
41
+ num_key_value_heads = num_attention_heads
42
+
43
+ self.num_key_value_heads = num_key_value_heads
44
+ self.hidden_act = hidden_act
45
+ self.initializer_range = initializer_range
46
+ self.rms_norm_eps = rms_norm_eps
47
+ self.pretraining_tp = pretraining_tp
48
+ self.use_cache = use_cache
49
+ self.rope_theta = rope_theta
50
+ self.rope_scaling = rope_scaling
51
+ self._rope_scaling_validation()
52
+ self.attention_bias = attention_bias
53
+
54
+ super().__init__(
55
+ pad_token_id=pad_token_id,
56
+ bos_token_id=bos_token_id,
57
+ eos_token_id=eos_token_id,
58
+ tie_word_embeddings=tie_word_embeddings,
59
+ **kwargs,
60
+ )
61
+
62
+ def _rope_scaling_validation(self):
63
+ """
64
+ Validate the `rope_scaling` configuration.
65
+ """
66
+ if self.rope_scaling is None:
67
+ return
68
+
69
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
70
+ raise ValueError(
71
+ "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
72
+ f"got {self.rope_scaling}"
73
+ )
74
+ rope_scaling_type = self.rope_scaling.get("type", None)
75
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
76
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
77
+ raise ValueError(
78
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
79
+ )
80
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
81
+ raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
82
+
modeling_orion.py ADDED
@@ -0,0 +1,1097 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2024 OrionStar Inc. team. All rights reserved.
2
+ # Copied and adapted from https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
3
+
4
+ from transformers import AutoConfig, AutoModel
5
+
6
+ from .configuration_orion import OrionConfig
7
+
8
+ import numbers
9
+ import importlib
10
+ import math
11
+ from typing import List, Optional, Tuple, Union
12
+
13
+ import torch
14
+ import torch.nn.functional as F
15
+ from torch.nn.parameter import Parameter
16
+ import torch.utils.checkpoint
17
+ from torch import nn
18
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
19
+ from torch.nn import init
20
+
21
+ from transformers.activations import ACT2FN
22
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
23
+ from transformers.modeling_utils import PreTrainedModel
24
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
25
+ from transformers.utils import (
26
+ add_start_docstrings,
27
+ add_start_docstrings_to_model_forward,
28
+ is_flash_attn_2_available,
29
+ logging,
30
+ replace_return_docstrings,
31
+ )
32
+
33
+ if is_flash_attn_2_available():
34
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
35
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
36
+
37
+ logger = logging.get_logger(__name__)
38
+
39
+ _CONFIG_FOR_DOC = "OrionConfig"
40
+
41
+ def _get_unpad_data(padding_mask):
42
+ seqlens_in_batch = padding_mask.sum(dim=-1, dtype=torch.int32)
43
+ indices = torch.nonzero(padding_mask.flatten(), as_tuple=False).flatten()
44
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
45
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
46
+ return (
47
+ indices,
48
+ cu_seqlens,
49
+ max_seqlen_in_batch,
50
+ )
51
+
52
+
53
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
54
+ def _make_causal_mask(
55
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
56
+ ):
57
+ """
58
+ Make causal mask used for bi-directional self-attention.
59
+ """
60
+ bsz, tgt_len = input_ids_shape
61
+ mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
62
+ mask_cond = torch.arange(mask.size(-1), device=device)
63
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
64
+ mask = mask.to(dtype)
65
+
66
+ if past_key_values_length > 0:
67
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
68
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
69
+
70
+
71
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
72
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
73
+ """
74
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
75
+ """
76
+ bsz, src_len = mask.size()
77
+ tgt_len = tgt_len if tgt_len is not None else src_len
78
+
79
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
80
+
81
+ inverted_mask = 1.0 - expanded_mask
82
+
83
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
84
+
85
+ class OrionRotaryEmbedding(nn.Module):
86
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
87
+ super().__init__()
88
+
89
+ self.dim = dim
90
+ self.max_position_embeddings = max_position_embeddings
91
+ self.base = base
92
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
93
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
94
+
95
+ # Build here to make `torch.jit.trace` work.
96
+ self._set_cos_sin_cache(
97
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
98
+ )
99
+
100
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
101
+ self.max_seq_len_cached = seq_len
102
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
103
+
104
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
105
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
106
+ emb = torch.cat((freqs, freqs), dim=-1)
107
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
108
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
109
+
110
+ def forward(self, x, seq_len=None):
111
+ # x: [bs, num_attention_heads, seq_len, head_size]
112
+ if seq_len > self.max_seq_len_cached:
113
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
114
+
115
+ return (
116
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
117
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
118
+ )
119
+
120
+
121
+ class OrionLinearScalingRotaryEmbedding(OrionRotaryEmbedding):
122
+ """OrionRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
123
+
124
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
125
+ self.scaling_factor = scaling_factor
126
+ super().__init__(dim, max_position_embeddings, base, device)
127
+
128
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
129
+ self.max_seq_len_cached = seq_len
130
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
131
+ t = t / self.scaling_factor
132
+
133
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
134
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
135
+ emb = torch.cat((freqs, freqs), dim=-1)
136
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
137
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
138
+
139
+
140
+ class OrionDynamicNTKScalingRotaryEmbedding(OrionRotaryEmbedding):
141
+ """OrionRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
142
+
143
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
144
+ self.scaling_factor = scaling_factor
145
+ super().__init__(dim, max_position_embeddings, base, device)
146
+
147
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
148
+ self.max_seq_len_cached = seq_len
149
+
150
+ if seq_len > self.max_position_embeddings:
151
+ base = self.base * (
152
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
153
+ ) ** (self.dim / (self.dim - 2))
154
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
155
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
156
+
157
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
158
+
159
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
160
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
161
+ emb = torch.cat((freqs, freqs), dim=-1)
162
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
163
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
164
+
165
+
166
+ def rotate_half(x):
167
+ """Rotates half the hidden dims of the input."""
168
+ x1 = x[..., : x.shape[-1] // 2]
169
+ x2 = x[..., x.shape[-1] // 2 :]
170
+ return torch.cat((-x2, x1), dim=-1)
171
+
172
+
173
+ # Copied from transformers.models.gpt_neox.modeling_gpt_neox.apply_rotary_pos_emb
174
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
175
+ cos = cos[position_ids].unsqueeze(1) # [seq_len, dim] -> [batch_size, 1, seq_len, head_dim]
176
+ sin = sin[position_ids].unsqueeze(1)
177
+ q_embed = (q * cos) + (rotate_half(q) * sin)
178
+ k_embed = (k * cos) + (rotate_half(k) * sin)
179
+ return q_embed, k_embed
180
+
181
+
182
+ class OrionMLP(nn.Module):
183
+ def __init__(self, config):
184
+ super().__init__()
185
+ self.config = config
186
+ self.hidden_size = config.hidden_size
187
+ self.intermediate_size = config.intermediate_size
188
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
189
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
190
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
191
+ self.act_fn = ACT2FN[config.hidden_act]
192
+
193
+ def forward(self, x):
194
+ if self.config.pretraining_tp > 1:
195
+ slice = self.intermediate_size // self.config.pretraining_tp
196
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
197
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
198
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
199
+
200
+ gate_proj = torch.cat(
201
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
202
+ )
203
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
204
+
205
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
206
+ down_proj = [
207
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
208
+ ]
209
+ down_proj = sum(down_proj)
210
+ else:
211
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
212
+
213
+ return down_proj
214
+
215
+
216
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
217
+ """
218
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
219
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
220
+ """
221
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
222
+ if n_rep == 1:
223
+ return hidden_states
224
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
225
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
226
+
227
+
228
+ class OrionAttention(nn.Module):
229
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
230
+
231
+ def __init__(self, config: OrionConfig):
232
+ super().__init__()
233
+ self.config = config
234
+ self.hidden_size = config.hidden_size
235
+ self.num_heads = config.num_attention_heads
236
+ self.head_dim = self.hidden_size // self.num_heads
237
+ self.num_key_value_heads = config.num_key_value_heads
238
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
239
+ self.max_position_embeddings = config.max_position_embeddings
240
+ self.rope_theta = config.rope_theta
241
+
242
+ if (self.head_dim * self.num_heads) != self.hidden_size:
243
+ raise ValueError(
244
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
245
+ f" and `num_heads`: {self.num_heads})."
246
+ )
247
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
248
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
249
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
250
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
251
+ self._init_rope()
252
+
253
+ def _init_rope(self):
254
+ if self.config.rope_scaling is None:
255
+ self.rotary_emb = OrionRotaryEmbedding(
256
+ self.head_dim,
257
+ max_position_embeddings=self.max_position_embeddings,
258
+ base=self.rope_theta,
259
+ )
260
+ else:
261
+ scaling_type = self.config.rope_scaling["type"]
262
+ scaling_factor = self.config.rope_scaling["factor"]
263
+ if scaling_type == "linear":
264
+ self.rotary_emb = OrionLinearScalingRotaryEmbedding(
265
+ self.head_dim,
266
+ max_position_embeddings=self.max_position_embeddings,
267
+ scaling_factor=scaling_factor,
268
+ base=self.rope_theta,
269
+ )
270
+ elif scaling_type == "dynamic":
271
+ self.rotary_emb = OrionDynamicNTKScalingRotaryEmbedding(
272
+ self.head_dim,
273
+ max_position_embeddings=self.max_position_embeddings,
274
+ scaling_factor=scaling_factor,
275
+ base=self.rope_theta,
276
+ )
277
+ else:
278
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
279
+
280
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
281
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
282
+
283
+ def forward(
284
+ self,
285
+ hidden_states: torch.Tensor,
286
+ attention_mask: Optional[torch.Tensor] = None,
287
+ position_ids: Optional[torch.LongTensor] = None,
288
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
289
+ output_attentions: bool = False,
290
+ use_cache: bool = False,
291
+ padding_mask: Optional[torch.LongTensor] = None,
292
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
293
+ bsz, q_len, _ = hidden_states.size()
294
+
295
+ if self.config.pretraining_tp > 1:
296
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
297
+ query_slices = self.q_proj.weight.split(
298
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
299
+ )
300
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
301
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
302
+
303
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
304
+ query_states = torch.cat(query_states, dim=-1)
305
+
306
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
307
+ key_states = torch.cat(key_states, dim=-1)
308
+
309
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
310
+ value_states = torch.cat(value_states, dim=-1)
311
+
312
+ else:
313
+ query_states = self.q_proj(hidden_states)
314
+ key_states = self.k_proj(hidden_states)
315
+ value_states = self.v_proj(hidden_states)
316
+
317
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
318
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
319
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
320
+
321
+ kv_seq_len = key_states.shape[-2]
322
+ if past_key_value is not None:
323
+ kv_seq_len += past_key_value[0].shape[-2]
324
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
325
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
326
+
327
+ if past_key_value is not None:
328
+ # reuse k, v, self_attention
329
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
330
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
331
+
332
+ past_key_value = (key_states, value_states) if use_cache else None
333
+
334
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
335
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
336
+
337
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
338
+
339
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
340
+ raise ValueError(
341
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
342
+ f" {attn_weights.size()}"
343
+ )
344
+
345
+ if attention_mask is not None:
346
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
347
+ raise ValueError(
348
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
349
+ )
350
+ attn_weights = attn_weights + attention_mask
351
+
352
+ # upcast attention to fp32
353
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
354
+ attn_output = torch.matmul(attn_weights, value_states)
355
+
356
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
357
+ raise ValueError(
358
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
359
+ f" {attn_output.size()}"
360
+ )
361
+
362
+ attn_output = attn_output.transpose(1, 2).contiguous()
363
+
364
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
365
+
366
+ if self.config.pretraining_tp > 1:
367
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
368
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
369
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
370
+ else:
371
+ attn_output = self.o_proj(attn_output)
372
+
373
+ if not output_attentions:
374
+ attn_weights = None
375
+
376
+ return attn_output, attn_weights, past_key_value
377
+
378
+
379
+ class OrionFlashAttention2(OrionAttention):
380
+ """
381
+ Orion flash attention module. This module inherits from `OrionAttention` as the weights of the module stays
382
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
383
+ flash attention and deal with padding tokens in case the input contains any of them.
384
+ """
385
+
386
+ def forward(
387
+ self,
388
+ hidden_states: torch.Tensor,
389
+ attention_mask: Optional[torch.Tensor] = None,
390
+ position_ids: Optional[torch.LongTensor] = None,
391
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
392
+ output_attentions: bool = False,
393
+ use_cache: bool = False,
394
+ padding_mask: Optional[torch.LongTensor] = None,
395
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
396
+ # OrionFlashAttention2 attention does not support output_attentions
397
+ output_attentions = False
398
+
399
+ bsz, q_len, _ = hidden_states.size()
400
+
401
+ query_states = self.q_proj(hidden_states)
402
+ key_states = self.k_proj(hidden_states)
403
+ value_states = self.v_proj(hidden_states)
404
+
405
+ # Flash attention requires the input to have the shape
406
+ # batch_size x seq_length x head_dime x hidden_dim
407
+ # therefore we just need to keep the original shape
408
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
409
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
410
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
411
+
412
+ kv_seq_len = key_states.shape[-2]
413
+ if past_key_value is not None:
414
+ kv_seq_len += past_key_value[0].shape[-2]
415
+
416
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
417
+
418
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
419
+
420
+ if past_key_value is not None:
421
+ # reuse k, v, self_attention
422
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
423
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
424
+
425
+ past_key_value = (key_states, value_states) if use_cache else None
426
+
427
+ query_states = query_states.transpose(1, 2)
428
+ key_states = key_states.transpose(1, 2)
429
+ value_states = value_states.transpose(1, 2)
430
+
431
+ # TODO: llama does not have dropout in the config??
432
+ # It is recommended to use dropout with FA according to the docs
433
+ # when training.
434
+ dropout_rate = 0.0 # if not self.training else self.attn_dropout
435
+
436
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
437
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
438
+ # cast them back in float16 just to be sure everything works as expected.
439
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
440
+ # in fp32. (LlamaRMSNorm handles it correctly)
441
+ input_dtype = query_states.dtype
442
+ if input_dtype == torch.float32:
443
+ logger.warning_once(
444
+ "The input hidden states seems to be silently casted in float32, this might be related to"
445
+ " the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
446
+ " float16."
447
+ )
448
+
449
+ query_states = query_states.to(torch.float16)
450
+ key_states = key_states.to(torch.float16)
451
+ value_states = value_states.to(torch.float16)
452
+
453
+ attn_output = self._flash_attention_forward(
454
+ query_states, key_states, value_states, padding_mask, q_len, dropout=dropout_rate
455
+ )
456
+
457
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
458
+ attn_output = self.o_proj(attn_output)
459
+
460
+ if not output_attentions:
461
+ attn_weights = None
462
+
463
+ return attn_output, attn_weights, past_key_value
464
+
465
+ def _flash_attention_forward(
466
+ self, query_states, key_states, value_states, padding_mask, query_length, dropout=0.0, softmax_scale=None
467
+ ):
468
+ """
469
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
470
+ first unpad the input, then computes the attention scores and pad the final attention scores.
471
+
472
+ Args:
473
+ query_states (`torch.Tensor`):
474
+ Input query states to be passed to Flash Attention API
475
+ key_states (`torch.Tensor`):
476
+ Input key states to be passed to Flash Attention API
477
+ value_states (`torch.Tensor`):
478
+ Input value states to be passed to Flash Attention API
479
+ padding_mask (`torch.Tensor`):
480
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
481
+ position of padding tokens and 1 for the position of non-padding tokens.
482
+ dropout (`int`, *optional*):
483
+ Attention dropout
484
+ softmax_scale (`float`, *optional*):
485
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
486
+ """
487
+ # Contains at least one padding token in the sequence
488
+ if padding_mask is not None:
489
+ batch_size = query_states.shape[0]
490
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
491
+ query_states, key_states, value_states, padding_mask, query_length
492
+ )
493
+
494
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
495
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
496
+
497
+ attn_output_unpad = flash_attn_varlen_func(
498
+ query_states,
499
+ key_states,
500
+ value_states,
501
+ cu_seqlens_q=cu_seqlens_q,
502
+ cu_seqlens_k=cu_seqlens_k,
503
+ max_seqlen_q=max_seqlen_in_batch_q,
504
+ max_seqlen_k=max_seqlen_in_batch_k,
505
+ dropout_p=dropout,
506
+ softmax_scale=softmax_scale,
507
+ causal=True,
508
+ )
509
+
510
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
511
+ else:
512
+ attn_output = flash_attn_func(
513
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=True
514
+ )
515
+
516
+ return attn_output
517
+
518
+ def _upad_input(self, query_layer, key_layer, value_layer, padding_mask, query_length):
519
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(padding_mask)
520
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
521
+
522
+ key_layer = index_first_axis(
523
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
524
+ )
525
+ value_layer = index_first_axis(
526
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
527
+ )
528
+ if query_length == kv_seq_len:
529
+ query_layer = index_first_axis(
530
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
531
+ )
532
+ cu_seqlens_q = cu_seqlens_k
533
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
534
+ indices_q = indices_k
535
+ elif query_length == 1:
536
+ max_seqlen_in_batch_q = 1
537
+ cu_seqlens_q = torch.arange(
538
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
539
+ ) # There is a memcpy here, that is very bad.
540
+ indices_q = cu_seqlens_q[:-1]
541
+ query_layer = query_layer.squeeze(1)
542
+ else:
543
+ # The -q_len: slice assumes left padding.
544
+ padding_mask = padding_mask[:, -query_length:]
545
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, padding_mask)
546
+
547
+ return (
548
+ query_layer,
549
+ key_layer,
550
+ value_layer,
551
+ indices_q,
552
+ (cu_seqlens_q, cu_seqlens_k),
553
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
554
+ )
555
+
556
+
557
+ class OrionDecoderLayer(nn.Module):
558
+ def __init__(self, config: OrionConfig):
559
+ super().__init__()
560
+ self.hidden_size = config.hidden_size
561
+ self.self_attn = (
562
+ OrionAttention(config=config)
563
+ if not getattr(config, "_flash_attn_2_enabled", False)
564
+ else OrionFlashAttention2(config=config)
565
+ )
566
+ self.mlp = OrionMLP(config)
567
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
568
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
569
+
570
+ def forward(
571
+ self,
572
+ hidden_states: torch.Tensor,
573
+ attention_mask: Optional[torch.Tensor] = None,
574
+ position_ids: Optional[torch.LongTensor] = None,
575
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
576
+ output_attentions: Optional[bool] = False,
577
+ use_cache: Optional[bool] = False,
578
+ padding_mask: Optional[torch.LongTensor] = None,
579
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
580
+ """
581
+ Args:
582
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
583
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
584
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
585
+ output_attentions (`bool`, *optional*):
586
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
587
+ returned tensors for more detail.
588
+ use_cache (`bool`, *optional*):
589
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
590
+ (see `past_key_values`).
591
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
592
+ """
593
+
594
+ residual = hidden_states
595
+
596
+ hidden_states = self.input_layernorm(hidden_states)
597
+
598
+ # Self Attention
599
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
600
+ hidden_states=hidden_states,
601
+ attention_mask=attention_mask,
602
+ position_ids=position_ids,
603
+ past_key_value=past_key_value,
604
+ output_attentions=output_attentions,
605
+ use_cache=use_cache,
606
+ padding_mask=padding_mask,
607
+ )
608
+ hidden_states = residual + hidden_states
609
+
610
+ # Fully Connected
611
+ residual = hidden_states
612
+ hidden_states = self.post_attention_layernorm(hidden_states)
613
+ hidden_states = self.mlp(hidden_states)
614
+ hidden_states = residual + hidden_states
615
+
616
+ outputs = (hidden_states,)
617
+
618
+ if output_attentions:
619
+ outputs += (self_attn_weights,)
620
+
621
+ if use_cache:
622
+ outputs += (present_key_value,)
623
+
624
+ return outputs
625
+
626
+ class OrionPreTrainedModel(PreTrainedModel):
627
+ config_class = OrionConfig
628
+ base_model_prefix = "model"
629
+ supports_gradient_checkpointing = True
630
+ _no_split_modules = ["OrionDecoderLayer"]
631
+ _skip_keys_device_placement = "past_key_values"
632
+ _supports_flash_attn_2 = True
633
+
634
+ def _init_weights(self, module):
635
+ std = self.config.initializer_range
636
+ if isinstance(module, nn.Linear):
637
+ module.weight.data.normal_(mean=0.0, std=std)
638
+ if module.bias is not None:
639
+ module.bias.data.zero_()
640
+ elif isinstance(module, nn.Embedding):
641
+ module.weight.data.normal_(mean=0.0, std=std)
642
+ if module.padding_idx is not None:
643
+ module.weight.data[module.padding_idx].zero_()
644
+
645
+ def _set_gradient_checkpointing(self, module, value=False):
646
+ if isinstance(module, OrionModel):
647
+ module.gradient_checkpointing = value
648
+
649
+ class OrionModel(OrionPreTrainedModel):
650
+ """
651
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OrionDecoderLayer`]
652
+
653
+ Args:
654
+ config: OrionConfig
655
+ """
656
+
657
+ def __init__(self, config: OrionConfig):
658
+ super().__init__(config)
659
+ self.padding_idx = config.pad_token_id
660
+ self.vocab_size = config.vocab_size
661
+
662
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
663
+ self.layers = nn.ModuleList([OrionDecoderLayer(config) for _ in range(config.num_hidden_layers)])
664
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
665
+
666
+ self.gradient_checkpointing = False
667
+ # Initialize weights and apply final processing
668
+ self.post_init()
669
+
670
+ def get_input_embeddings(self):
671
+ return self.embed_tokens
672
+
673
+ def set_input_embeddings(self, value):
674
+ self.embed_tokens = value
675
+
676
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
677
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
678
+ # create causal mask
679
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
680
+ combined_attention_mask = None
681
+ if input_shape[-1] > 1:
682
+ combined_attention_mask = _make_causal_mask(
683
+ input_shape,
684
+ inputs_embeds.dtype,
685
+ device=inputs_embeds.device,
686
+ past_key_values_length=past_key_values_length,
687
+ )
688
+
689
+ if attention_mask is not None:
690
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
691
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
692
+ inputs_embeds.device
693
+ )
694
+ combined_attention_mask = (
695
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
696
+ )
697
+
698
+ return combined_attention_mask
699
+
700
+ def forward(
701
+ self,
702
+ input_ids: torch.LongTensor = None,
703
+ attention_mask: Optional[torch.Tensor] = None,
704
+ position_ids: Optional[torch.LongTensor] = None,
705
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
706
+ inputs_embeds: Optional[torch.FloatTensor] = None,
707
+ use_cache: Optional[bool] = None,
708
+ output_attentions: Optional[bool] = None,
709
+ output_hidden_states: Optional[bool] = None,
710
+ return_dict: Optional[bool] = None,
711
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
712
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
713
+ output_hidden_states = (
714
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
715
+ )
716
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
717
+
718
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
719
+
720
+ # retrieve input_ids and inputs_embeds
721
+ if input_ids is not None and inputs_embeds is not None:
722
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
723
+ elif input_ids is not None:
724
+ batch_size, seq_length = input_ids.shape
725
+ elif inputs_embeds is not None:
726
+ batch_size, seq_length, _ = inputs_embeds.shape
727
+ else:
728
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
729
+
730
+ seq_length_with_past = seq_length
731
+ past_key_values_length = 0
732
+
733
+ if past_key_values is not None:
734
+ past_key_values_length = past_key_values[0][0].shape[2]
735
+ seq_length_with_past = seq_length_with_past + past_key_values_length
736
+
737
+ if position_ids is None:
738
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
739
+ position_ids = torch.arange(
740
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
741
+ )
742
+ position_ids = position_ids.unsqueeze(0)
743
+
744
+ if inputs_embeds is None:
745
+ inputs_embeds = self.embed_tokens(input_ids)
746
+ # embed positions
747
+ if attention_mask is None:
748
+ attention_mask = torch.ones(
749
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
750
+ )
751
+ padding_mask = None
752
+ else:
753
+ if 0 in attention_mask:
754
+ padding_mask = attention_mask
755
+ else:
756
+ padding_mask = None
757
+
758
+ attention_mask = self._prepare_decoder_attention_mask(
759
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
760
+ )
761
+
762
+ hidden_states = inputs_embeds
763
+
764
+ if self.gradient_checkpointing and self.training:
765
+ if use_cache:
766
+ logger.warning_once(
767
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
768
+ )
769
+ use_cache = False
770
+
771
+ # decoder layers
772
+ all_hidden_states = () if output_hidden_states else None
773
+ all_self_attns = () if output_attentions else None
774
+ next_decoder_cache = () if use_cache else None
775
+
776
+ for idx, decoder_layer in enumerate(self.layers):
777
+ if output_hidden_states:
778
+ all_hidden_states += (hidden_states,)
779
+
780
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
781
+
782
+ if self.gradient_checkpointing and self.training:
783
+
784
+ def create_custom_forward(module):
785
+ def custom_forward(*inputs):
786
+ # None for past_key_value
787
+ return module(*inputs, past_key_value, output_attentions, padding_mask=padding_mask)
788
+
789
+ return custom_forward
790
+
791
+ layer_outputs = torch.utils.checkpoint.checkpoint(
792
+ create_custom_forward(decoder_layer), hidden_states, attention_mask, position_ids
793
+ )
794
+ else:
795
+ layer_outputs = decoder_layer(
796
+ hidden_states,
797
+ attention_mask=attention_mask,
798
+ position_ids=position_ids,
799
+ past_key_value=past_key_value,
800
+ output_attentions=output_attentions,
801
+ use_cache=use_cache,
802
+ padding_mask=padding_mask,
803
+ )
804
+
805
+ hidden_states = layer_outputs[0]
806
+
807
+ if use_cache:
808
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
809
+
810
+ if output_attentions:
811
+ all_self_attns += (layer_outputs[1],)
812
+
813
+ hidden_states = self.norm(hidden_states)
814
+
815
+ # add hidden states from the last decoder layer
816
+ if output_hidden_states:
817
+ all_hidden_states += (hidden_states,)
818
+
819
+ next_cache = next_decoder_cache if use_cache else None
820
+ if not return_dict:
821
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
822
+ return BaseModelOutputWithPast(
823
+ last_hidden_state=hidden_states,
824
+ past_key_values=next_cache,
825
+ hidden_states=all_hidden_states,
826
+ attentions=all_self_attns,
827
+ )
828
+
829
+
830
+ class OrionForCausalLM(OrionPreTrainedModel):
831
+ model_type = "orion"
832
+ _tied_weights_keys = ["lm_head.weight"]
833
+
834
+ def __init__(self, config):
835
+ super().__init__(config)
836
+ self.model = OrionModel(config)
837
+ self.vocab_size = config.vocab_size
838
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
839
+
840
+ # Initialize weights and apply final processing
841
+ self.post_init()
842
+
843
+ def get_input_embeddings(self):
844
+ return self.model.embed_tokens
845
+
846
+ def set_input_embeddings(self, value):
847
+ self.model.embed_tokens = value
848
+
849
+ def get_output_embeddings(self):
850
+ return self.lm_head
851
+
852
+ def set_output_embeddings(self, new_embeddings):
853
+ self.lm_head = new_embeddings
854
+
855
+ def set_decoder(self, decoder):
856
+ self.model = decoder
857
+
858
+ def get_decoder(self):
859
+ return self.model
860
+
861
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
862
+ def forward(
863
+ self,
864
+ input_ids: torch.LongTensor = None,
865
+ attention_mask: Optional[torch.Tensor] = None,
866
+ position_ids: Optional[torch.LongTensor] = None,
867
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
868
+ inputs_embeds: Optional[torch.FloatTensor] = None,
869
+ labels: Optional[torch.LongTensor] = None,
870
+ use_cache: Optional[bool] = None,
871
+ output_attentions: Optional[bool] = None,
872
+ output_hidden_states: Optional[bool] = None,
873
+ return_dict: Optional[bool] = None,
874
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
875
+ r"""
876
+ Args:
877
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
878
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
879
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
880
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
881
+
882
+ Returns:
883
+
884
+ Example:
885
+
886
+ ```python
887
+ >>> from transformers import AutoTokenizer, OrionForCausalLM
888
+
889
+ >>> model = OrionForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
890
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
891
+
892
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
893
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
894
+
895
+ >>> # Generate
896
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
897
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
898
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
899
+ ```"""
900
+
901
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
902
+ output_hidden_states = (
903
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
904
+ )
905
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
906
+
907
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
908
+ outputs = self.model(
909
+ input_ids=input_ids,
910
+ attention_mask=attention_mask,
911
+ position_ids=position_ids,
912
+ past_key_values=past_key_values,
913
+ inputs_embeds=inputs_embeds,
914
+ use_cache=use_cache,
915
+ output_attentions=output_attentions,
916
+ output_hidden_states=output_hidden_states,
917
+ return_dict=return_dict,
918
+ )
919
+
920
+ hidden_states = outputs[0]
921
+ if self.config.pretraining_tp > 1:
922
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
923
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
924
+ logits = torch.cat(logits, dim=-1)
925
+ else:
926
+ logits = self.lm_head(hidden_states)
927
+ logits = logits.float()
928
+
929
+ loss = None
930
+ if labels is not None:
931
+ # Shift so that tokens < n predict n
932
+ shift_logits = logits[..., :-1, :].contiguous()
933
+ shift_labels = labels[..., 1:].contiguous()
934
+ # Flatten the tokens
935
+ loss_fct = CrossEntropyLoss()
936
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
937
+ shift_labels = shift_labels.view(-1)
938
+ # Enable model parallelism
939
+ shift_labels = shift_labels.to(shift_logits.device)
940
+ loss = loss_fct(shift_logits, shift_labels)
941
+
942
+ if not return_dict:
943
+ output = (logits,) + outputs[1:]
944
+ return (loss,) + output if loss is not None else output
945
+
946
+ return CausalLMOutputWithPast(
947
+ loss=loss,
948
+ logits=logits,
949
+ past_key_values=outputs.past_key_values,
950
+ hidden_states=outputs.hidden_states,
951
+ attentions=outputs.attentions,
952
+ )
953
+
954
+ def prepare_inputs_for_generation(
955
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
956
+ ):
957
+ if past_key_values:
958
+ input_ids = input_ids[:, -1:]
959
+
960
+ position_ids = kwargs.get("position_ids", None)
961
+ if attention_mask is not None and position_ids is None:
962
+ # create position_ids on the fly for batch generation
963
+ position_ids = attention_mask.long().cumsum(-1) - 1
964
+ position_ids.masked_fill_(attention_mask == 0, 1)
965
+ if past_key_values:
966
+ position_ids = position_ids[:, -1].unsqueeze(-1)
967
+
968
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
969
+ if inputs_embeds is not None and past_key_values is None:
970
+ model_inputs = {"inputs_embeds": inputs_embeds}
971
+ else:
972
+ model_inputs = {"input_ids": input_ids}
973
+
974
+ model_inputs.update(
975
+ {
976
+ "position_ids": position_ids,
977
+ "past_key_values": past_key_values,
978
+ "use_cache": kwargs.get("use_cache"),
979
+ "attention_mask": attention_mask,
980
+ }
981
+ )
982
+ return model_inputs
983
+
984
+ @staticmethod
985
+ def _reorder_cache(past_key_values, beam_idx):
986
+ reordered_past = ()
987
+ for layer_past in past_key_values:
988
+ reordered_past += (
989
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
990
+ )
991
+ return reordered_past
992
+
993
+ class OrionForSequenceClassification(OrionPreTrainedModel):
994
+ def __init__(self, config):
995
+ super().__init__(config)
996
+ self.num_labels = config.num_labels
997
+ self.model = OrionModel(config)
998
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
999
+
1000
+ # Initialize weights and apply final processing
1001
+ self.post_init()
1002
+
1003
+ def get_input_embeddings(self):
1004
+ return self.model.embed_tokens
1005
+
1006
+ def set_input_embeddings(self, value):
1007
+ self.model.embed_tokens = value
1008
+
1009
+ def forward(
1010
+ self,
1011
+ input_ids: torch.LongTensor = None,
1012
+ attention_mask: Optional[torch.Tensor] = None,
1013
+ position_ids: Optional[torch.LongTensor] = None,
1014
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1015
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1016
+ labels: Optional[torch.LongTensor] = None,
1017
+ use_cache: Optional[bool] = None,
1018
+ output_attentions: Optional[bool] = None,
1019
+ output_hidden_states: Optional[bool] = None,
1020
+ return_dict: Optional[bool] = None,
1021
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1022
+ r"""
1023
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1024
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1025
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1026
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1027
+ """
1028
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1029
+
1030
+ transformer_outputs = self.model(
1031
+ input_ids,
1032
+ attention_mask=attention_mask,
1033
+ position_ids=position_ids,
1034
+ past_key_values=past_key_values,
1035
+ inputs_embeds=inputs_embeds,
1036
+ use_cache=use_cache,
1037
+ output_attentions=output_attentions,
1038
+ output_hidden_states=output_hidden_states,
1039
+ return_dict=return_dict,
1040
+ )
1041
+ hidden_states = transformer_outputs[0]
1042
+ logits = self.score(hidden_states)
1043
+
1044
+ if input_ids is not None:
1045
+ batch_size = input_ids.shape[0]
1046
+ else:
1047
+ batch_size = inputs_embeds.shape[0]
1048
+
1049
+ if self.config.pad_token_id is None and batch_size != 1:
1050
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1051
+ if self.config.pad_token_id is None:
1052
+ sequence_lengths = -1
1053
+ else:
1054
+ if input_ids is not None:
1055
+ sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).long().argmax(-1) - 1).to(
1056
+ logits.device
1057
+ )
1058
+ else:
1059
+ sequence_lengths = -1
1060
+
1061
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1062
+
1063
+ loss = None
1064
+ if labels is not None:
1065
+ labels = labels.to(logits.device)
1066
+ if self.config.problem_type is None:
1067
+ if self.num_labels == 1:
1068
+ self.config.problem_type = "regression"
1069
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1070
+ self.config.problem_type = "single_label_classification"
1071
+ else:
1072
+ self.config.problem_type = "multi_label_classification"
1073
+
1074
+ if self.config.problem_type == "regression":
1075
+ loss_fct = MSELoss()
1076
+ if self.num_labels == 1:
1077
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1078
+ else:
1079
+ loss = loss_fct(pooled_logits, labels)
1080
+ elif self.config.problem_type == "single_label_classification":
1081
+ loss_fct = CrossEntropyLoss()
1082
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1083
+ elif self.config.problem_type == "multi_label_classification":
1084
+ loss_fct = BCEWithLogitsLoss()
1085
+ loss = loss_fct(pooled_logits, labels)
1086
+ if not return_dict:
1087
+ output = (pooled_logits,) + transformer_outputs[1:]
1088
+ return ((loss,) + output) if loss is not None else output
1089
+
1090
+ return SequenceClassifierOutputWithPast(
1091
+ loss=loss,
1092
+ logits=pooled_logits,
1093
+ past_key_values=transformer_outputs.past_key_values,
1094
+ hidden_states=transformer_outputs.hidden_states,
1095
+ attentions=transformer_outputs.attentions,
1096
+ )
1097
+
pytorch_model-00001-of-00002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8155f53c91f0e9591a69a5e49ac5deee0e0f32c671f5a869ed6dd85853e16a2
3
+ size 15704709713
pytorch_model-00002-of-00002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e05566cec048435bc8fda67cc481e46f42df62f4707168d3420cb60bbda13acf
3
+ size 13292827298
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,491 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 18511646976
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "pytorch_model-00002-of-00002.bin",
7
+ "model.embed_tokens.weight": "pytorch_model-00001-of-00002.bin",
8
+ "model.layers.0.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
9
+ "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
10
+ "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
11
+ "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
12
+ "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
13
+ "model.layers.0.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
14
+ "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
15
+ "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
16
+ "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
17
+ "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
18
+ "model.layers.0.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
19
+ "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
20
+ "model.layers.1.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
21
+ "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
22
+ "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
23
+ "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
24
+ "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
25
+ "model.layers.1.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
26
+ "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
27
+ "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
28
+ "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
29
+ "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
30
+ "model.layers.1.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
31
+ "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
32
+ "model.layers.10.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
33
+ "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
34
+ "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
35
+ "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
36
+ "model.layers.10.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
37
+ "model.layers.10.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
38
+ "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
39
+ "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
40
+ "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
41
+ "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
42
+ "model.layers.10.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
43
+ "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
44
+ "model.layers.11.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
45
+ "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
46
+ "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
47
+ "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
48
+ "model.layers.11.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
49
+ "model.layers.11.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
50
+ "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
51
+ "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
52
+ "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
53
+ "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
54
+ "model.layers.11.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
55
+ "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
56
+ "model.layers.12.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
57
+ "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
58
+ "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
59
+ "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
60
+ "model.layers.12.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
61
+ "model.layers.12.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
62
+ "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
63
+ "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
64
+ "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
65
+ "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
66
+ "model.layers.12.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
67
+ "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
68
+ "model.layers.13.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
69
+ "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
70
+ "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
71
+ "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
72
+ "model.layers.13.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
73
+ "model.layers.13.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
74
+ "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
75
+ "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
76
+ "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
77
+ "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
78
+ "model.layers.13.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
79
+ "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
80
+ "model.layers.14.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
81
+ "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
82
+ "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
83
+ "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
84
+ "model.layers.14.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
85
+ "model.layers.14.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
86
+ "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
87
+ "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
88
+ "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
89
+ "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
90
+ "model.layers.14.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
91
+ "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
92
+ "model.layers.15.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
93
+ "model.layers.15.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
94
+ "model.layers.15.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
95
+ "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
96
+ "model.layers.15.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
97
+ "model.layers.15.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
98
+ "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
99
+ "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
100
+ "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
101
+ "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
102
+ "model.layers.15.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
103
+ "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
104
+ "model.layers.16.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
105
+ "model.layers.16.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
106
+ "model.layers.16.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
107
+ "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
108
+ "model.layers.16.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
109
+ "model.layers.16.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
110
+ "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
111
+ "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
112
+ "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
113
+ "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
114
+ "model.layers.16.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
115
+ "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
116
+ "model.layers.17.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
117
+ "model.layers.17.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
118
+ "model.layers.17.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
119
+ "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
120
+ "model.layers.17.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
121
+ "model.layers.17.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
122
+ "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
123
+ "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
124
+ "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
125
+ "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
126
+ "model.layers.17.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
127
+ "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
128
+ "model.layers.18.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
129
+ "model.layers.18.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
130
+ "model.layers.18.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
131
+ "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
132
+ "model.layers.18.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
133
+ "model.layers.18.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
134
+ "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
135
+ "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
136
+ "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
137
+ "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
138
+ "model.layers.18.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
139
+ "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
140
+ "model.layers.19.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
141
+ "model.layers.19.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
142
+ "model.layers.19.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
143
+ "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
144
+ "model.layers.19.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
145
+ "model.layers.19.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
146
+ "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
147
+ "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
148
+ "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
149
+ "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
150
+ "model.layers.19.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
151
+ "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
152
+ "model.layers.2.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
153
+ "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
154
+ "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
155
+ "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
156
+ "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
157
+ "model.layers.2.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
158
+ "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
159
+ "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
160
+ "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
161
+ "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
162
+ "model.layers.2.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
163
+ "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
164
+ "model.layers.20.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
165
+ "model.layers.20.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
166
+ "model.layers.20.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
167
+ "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
168
+ "model.layers.20.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
169
+ "model.layers.20.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
170
+ "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
171
+ "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
172
+ "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
173
+ "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
174
+ "model.layers.20.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
175
+ "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
176
+ "model.layers.21.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
177
+ "model.layers.21.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
178
+ "model.layers.21.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
179
+ "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
180
+ "model.layers.21.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
181
+ "model.layers.21.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
182
+ "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
183
+ "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
184
+ "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
185
+ "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
186
+ "model.layers.21.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
187
+ "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
188
+ "model.layers.22.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
189
+ "model.layers.22.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
190
+ "model.layers.22.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
191
+ "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
192
+ "model.layers.22.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
193
+ "model.layers.22.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
194
+ "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
195
+ "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
196
+ "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
197
+ "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
198
+ "model.layers.22.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
199
+ "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
200
+ "model.layers.23.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
201
+ "model.layers.23.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
202
+ "model.layers.23.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
203
+ "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
204
+ "model.layers.23.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
205
+ "model.layers.23.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
206
+ "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
207
+ "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
208
+ "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
209
+ "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
210
+ "model.layers.23.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
211
+ "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
212
+ "model.layers.24.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
213
+ "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
214
+ "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
215
+ "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
216
+ "model.layers.24.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
217
+ "model.layers.24.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
218
+ "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
219
+ "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
220
+ "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
221
+ "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
222
+ "model.layers.24.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
223
+ "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
224
+ "model.layers.25.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
225
+ "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
226
+ "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
227
+ "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
228
+ "model.layers.25.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
229
+ "model.layers.25.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
230
+ "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
231
+ "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
232
+ "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
233
+ "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
234
+ "model.layers.25.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
235
+ "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
236
+ "model.layers.26.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
237
+ "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
238
+ "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
239
+ "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
240
+ "model.layers.26.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
241
+ "model.layers.26.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
242
+ "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
243
+ "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
244
+ "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
245
+ "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
246
+ "model.layers.26.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
247
+ "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
248
+ "model.layers.27.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
249
+ "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
250
+ "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
251
+ "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
252
+ "model.layers.27.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
253
+ "model.layers.27.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
254
+ "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
255
+ "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
256
+ "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
257
+ "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
258
+ "model.layers.27.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
259
+ "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
260
+ "model.layers.28.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
261
+ "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
262
+ "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
263
+ "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
264
+ "model.layers.28.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
265
+ "model.layers.28.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
266
+ "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
267
+ "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
268
+ "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
269
+ "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
270
+ "model.layers.28.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
271
+ "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
272
+ "model.layers.29.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
273
+ "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
274
+ "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
275
+ "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
276
+ "model.layers.29.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
277
+ "model.layers.29.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
278
+ "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
279
+ "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
280
+ "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
281
+ "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
282
+ "model.layers.29.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
283
+ "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
284
+ "model.layers.3.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
285
+ "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
286
+ "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
287
+ "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
288
+ "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
289
+ "model.layers.3.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
290
+ "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
291
+ "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
292
+ "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
293
+ "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
294
+ "model.layers.3.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
295
+ "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
296
+ "model.layers.30.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
297
+ "model.layers.30.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
298
+ "model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
299
+ "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
300
+ "model.layers.30.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
301
+ "model.layers.30.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
302
+ "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
303
+ "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
304
+ "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
305
+ "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
306
+ "model.layers.30.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
307
+ "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
308
+ "model.layers.31.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
309
+ "model.layers.31.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
310
+ "model.layers.31.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
311
+ "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
312
+ "model.layers.31.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
313
+ "model.layers.31.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
314
+ "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
315
+ "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
316
+ "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
317
+ "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
318
+ "model.layers.31.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
319
+ "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
320
+ "model.layers.32.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
321
+ "model.layers.32.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
322
+ "model.layers.32.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
323
+ "model.layers.32.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
324
+ "model.layers.32.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
325
+ "model.layers.32.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
326
+ "model.layers.32.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
327
+ "model.layers.32.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
328
+ "model.layers.32.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
329
+ "model.layers.32.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
330
+ "model.layers.32.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
331
+ "model.layers.32.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
332
+ "model.layers.33.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
333
+ "model.layers.33.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
334
+ "model.layers.33.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
335
+ "model.layers.33.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
336
+ "model.layers.33.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
337
+ "model.layers.33.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
338
+ "model.layers.33.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
339
+ "model.layers.33.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
340
+ "model.layers.33.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
341
+ "model.layers.33.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
342
+ "model.layers.33.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
343
+ "model.layers.33.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
344
+ "model.layers.34.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
345
+ "model.layers.34.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
346
+ "model.layers.34.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
347
+ "model.layers.34.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
348
+ "model.layers.34.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
349
+ "model.layers.34.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
350
+ "model.layers.34.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
351
+ "model.layers.34.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
352
+ "model.layers.34.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
353
+ "model.layers.34.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
354
+ "model.layers.34.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
355
+ "model.layers.34.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
356
+ "model.layers.35.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
357
+ "model.layers.35.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
358
+ "model.layers.35.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
359
+ "model.layers.35.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
360
+ "model.layers.35.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
361
+ "model.layers.35.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
362
+ "model.layers.35.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
363
+ "model.layers.35.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
364
+ "model.layers.35.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
365
+ "model.layers.35.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
366
+ "model.layers.35.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
367
+ "model.layers.35.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
368
+ "model.layers.36.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
369
+ "model.layers.36.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
370
+ "model.layers.36.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
371
+ "model.layers.36.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
372
+ "model.layers.36.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
373
+ "model.layers.36.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
374
+ "model.layers.36.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
375
+ "model.layers.36.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
376
+ "model.layers.36.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
377
+ "model.layers.36.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
378
+ "model.layers.36.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
379
+ "model.layers.36.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
380
+ "model.layers.37.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
381
+ "model.layers.37.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
382
+ "model.layers.37.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
383
+ "model.layers.37.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
384
+ "model.layers.37.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
385
+ "model.layers.37.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
386
+ "model.layers.37.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
387
+ "model.layers.37.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
388
+ "model.layers.37.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
389
+ "model.layers.37.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
390
+ "model.layers.37.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
391
+ "model.layers.37.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
392
+ "model.layers.38.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
393
+ "model.layers.38.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
394
+ "model.layers.38.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
395
+ "model.layers.38.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
396
+ "model.layers.38.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
397
+ "model.layers.38.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
398
+ "model.layers.38.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
399
+ "model.layers.38.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
400
+ "model.layers.38.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
401
+ "model.layers.38.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
402
+ "model.layers.38.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
403
+ "model.layers.38.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
404
+ "model.layers.39.input_layernorm.bias": "pytorch_model-00002-of-00002.bin",
405
+ "model.layers.39.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
406
+ "model.layers.39.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
407
+ "model.layers.39.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
408
+ "model.layers.39.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
409
+ "model.layers.39.post_attention_layernorm.bias": "pytorch_model-00002-of-00002.bin",
410
+ "model.layers.39.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
411
+ "model.layers.39.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
412
+ "model.layers.39.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
413
+ "model.layers.39.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
414
+ "model.layers.39.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
415
+ "model.layers.39.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
416
+ "model.layers.4.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
417
+ "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
418
+ "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
419
+ "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
420
+ "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
421
+ "model.layers.4.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
422
+ "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
423
+ "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
424
+ "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
425
+ "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
426
+ "model.layers.4.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
427
+ "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
428
+ "model.layers.5.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
429
+ "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
430
+ "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
431
+ "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
432
+ "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
433
+ "model.layers.5.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
434
+ "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
435
+ "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
436
+ "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
437
+ "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
438
+ "model.layers.5.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
439
+ "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
440
+ "model.layers.6.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
441
+ "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
442
+ "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
443
+ "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
444
+ "model.layers.6.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
445
+ "model.layers.6.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
446
+ "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
447
+ "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
448
+ "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
449
+ "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
450
+ "model.layers.6.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
451
+ "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
452
+ "model.layers.7.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
453
+ "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
454
+ "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
455
+ "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
456
+ "model.layers.7.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
457
+ "model.layers.7.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
458
+ "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
459
+ "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
460
+ "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
461
+ "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
462
+ "model.layers.7.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
463
+ "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
464
+ "model.layers.8.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
465
+ "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
466
+ "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
467
+ "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
468
+ "model.layers.8.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
469
+ "model.layers.8.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
470
+ "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
471
+ "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
472
+ "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
473
+ "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
474
+ "model.layers.8.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
475
+ "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
476
+ "model.layers.9.input_layernorm.bias": "pytorch_model-00001-of-00002.bin",
477
+ "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
478
+ "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
479
+ "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
480
+ "model.layers.9.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
481
+ "model.layers.9.post_attention_layernorm.bias": "pytorch_model-00001-of-00002.bin",
482
+ "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
483
+ "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
484
+ "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
485
+ "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
486
+ "model.layers.9.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
487
+ "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
488
+ "model.norm.bias": "pytorch_model-00002-of-00002.bin",
489
+ "model.norm.weight": "pytorch_model-00002-of-00002.bin"
490
+ }
491
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<unk>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenization_orion.py ADDED
@@ -0,0 +1,269 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, OrionStar Inc. All rights reserved.
2
+
3
+ import os
4
+ from shutil import copyfile
5
+ from typing import Any, Dict, List, Optional, Tuple
6
+ import re
7
+
8
+ import sentencepiece as spm
9
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
10
+
11
+
12
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
13
+
14
+ PRETRAINED_VOCAB_FILES_MAP = {
15
+ "vocab_file": {},
16
+ "tokenizer_file": {},
17
+ }
18
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
19
+
20
+
21
+ class OrionTokenizer(PreTrainedTokenizer):
22
+ """
23
+ Construct a Orion tokenizer. Based on byte-level Byte-Pair-Encoding.
24
+
25
+ Args:
26
+ vocab_file (`str`):
27
+ Path to the vocabulary file.
28
+ """
29
+
30
+ vocab_files_names = VOCAB_FILES_NAMES
31
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
32
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
33
+ model_input_names = ["input_ids", "attention_mask"]
34
+
35
+ def __init__(
36
+ self,
37
+ vocab_file,
38
+ unk_token="<unk>",
39
+ bos_token="<s>",
40
+ eos_token="</s>",
41
+ pad_token=None,
42
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
43
+ add_bos_token=True,
44
+ add_eos_token=False,
45
+ clean_up_tokenization_spaces=False,
46
+ **kwargs,
47
+ ):
48
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
49
+ bos_token = (
50
+ AddedToken(bos_token, lstrip=False, rstrip=False)
51
+ if isinstance(bos_token, str)
52
+ else bos_token
53
+ )
54
+ eos_token = (
55
+ AddedToken(eos_token, lstrip=False, rstrip=False)
56
+ if isinstance(eos_token, str)
57
+ else eos_token
58
+ )
59
+ unk_token = (
60
+ AddedToken(unk_token, lstrip=False, rstrip=False)
61
+ if isinstance(unk_token, str)
62
+ else unk_token
63
+ )
64
+ pad_token = (
65
+ AddedToken(pad_token, lstrip=False, rstrip=False)
66
+ if isinstance(pad_token, str)
67
+ else pad_token
68
+ )
69
+ self.vocab_file = vocab_file
70
+ self.add_bos_token = add_bos_token
71
+ self.add_eos_token = add_eos_token
72
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
73
+ self.sp_model.Load(vocab_file)
74
+
75
+ super().__init__(
76
+ bos_token=bos_token,
77
+ eos_token=eos_token,
78
+ unk_token=unk_token,
79
+ pad_token=pad_token,
80
+ add_bos_token=add_bos_token,
81
+ add_eos_token=add_eos_token,
82
+ sp_model_kwargs=self.sp_model_kwargs,
83
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
84
+ **kwargs,
85
+ )
86
+
87
+ def __getstate__(self):
88
+ state = self.__dict__.copy()
89
+ state["sp_model"] = None
90
+ return state
91
+
92
+ def __setstate__(self, d):
93
+ self.__dict__ = d
94
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
95
+ self.sp_model.Load(self.vocab_file)
96
+
97
+ @property
98
+ def vocab_size(self):
99
+ """Returns vocab size"""
100
+ return self.sp_model.get_piece_size()
101
+
102
+ def get_vocab(self):
103
+ """Returns vocab as a dict"""
104
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
105
+ vocab.update(self.added_tokens_encoder)
106
+ return vocab
107
+
108
+ def _tokenize(self, text):
109
+ """Returns a tokenized string."""
110
+ return self.sp_model.encode(text, out_type=str)
111
+
112
+ def _convert_token_to_id(self, token):
113
+ """Converts a token (str) in an id using the vocab."""
114
+ return self.sp_model.piece_to_id(token)
115
+
116
+ def _convert_id_to_token(self, index):
117
+ """Converts an index (integer) in a token (str) using the vocab."""
118
+ token = self.sp_model.IdToPiece(index)
119
+ return token
120
+
121
+ def convert_tokens_to_string(self, tokens):
122
+ """Converts a sequence of tokens (string) in a single string."""
123
+ zhPattern = re.compile(u'[\u4e00-\u9fa5]+')
124
+ need_convert_punctuation=(",",";","!","?",":","(",")")
125
+ current_sub_tokens = []
126
+ out_string = ""
127
+ prev_is_special = False
128
+ for i, token in enumerate(tokens):
129
+ # make sure that special tokens are not decoded using sentencepiece model
130
+ if token in self.all_special_tokens:
131
+ if not prev_is_special and i != 0:
132
+ out_string += " "
133
+ out_string += self.sp_model.decode(current_sub_tokens) + token
134
+ prev_is_special = True
135
+ current_sub_tokens = []
136
+ if any([True if punctuation in token else False for punctuation in need_convert_punctuation]):
137
+ out_string += self.sp_model.decode(current_sub_tokens)
138
+ token=self.sp_model.decode(token)
139
+ if zhPattern.search(out_string[-20:]):
140
+ token = self.to_zh_punctuation(token)
141
+ out_string += token
142
+ current_sub_tokens = []
143
+ else:
144
+ current_sub_tokens.append(token)
145
+ prev_is_special = False
146
+ out_string += self.sp_model.decode(current_sub_tokens)
147
+ return out_string
148
+
149
+ def to_zh_punctuation(self, token):
150
+ return token.replace(",",",").replace(";",";").replace("!","!").replace("?","?").replace(":",":").replace("(","(").replace(")",")")
151
+
152
+ def save_vocabulary(
153
+ self, save_directory, filename_prefix: Optional[str] = None
154
+ ) -> Tuple[str]:
155
+ """
156
+ Save the vocabulary and special tokens file to a directory.
157
+
158
+ Args:
159
+ save_directory (`str`):
160
+ The directory in which to save the vocabulary.
161
+
162
+ Returns:
163
+ `Tuple(str)`: Paths to the files saved.
164
+ """
165
+ if not os.path.isdir(save_directory):
166
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
167
+ return
168
+ out_vocab_file = os.path.join(
169
+ save_directory,
170
+ (filename_prefix + "-" if filename_prefix else "")
171
+ + VOCAB_FILES_NAMES["vocab_file"],
172
+ )
173
+
174
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
175
+ out_vocab_file
176
+ ) and os.path.isfile(self.vocab_file):
177
+ copyfile(self.vocab_file, out_vocab_file)
178
+ elif not os.path.isfile(self.vocab_file):
179
+ with open(out_vocab_file, "wb") as fi:
180
+ content_spiece_model = self.sp_model.serialized_model_proto()
181
+ fi.write(content_spiece_model)
182
+
183
+ return (out_vocab_file,)
184
+
185
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
186
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
187
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
188
+
189
+ output = bos_token_id + token_ids_0 + eos_token_id
190
+
191
+ if token_ids_1 is not None:
192
+ output = output + bos_token_id + token_ids_1 + eos_token_id
193
+
194
+ return output
195
+
196
+ def get_special_tokens_mask(
197
+ self,
198
+ token_ids_0: List[int],
199
+ token_ids_1: Optional[List[int]] = None,
200
+ already_has_special_tokens: bool = False,
201
+ ) -> List[int]:
202
+ """
203
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
204
+ special tokens using the tokenizer `prepare_for_model` method.
205
+
206
+ Args:
207
+ token_ids_0 (`List[int]`):
208
+ List of IDs.
209
+ token_ids_1 (`List[int]`, *optional*):
210
+ Optional second list of IDs for sequence pairs.
211
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
212
+ Whether or not the token list is already formatted with special tokens for the model.
213
+
214
+ Returns:
215
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
216
+ """
217
+ if already_has_special_tokens:
218
+ return super().get_special_tokens_mask(
219
+ token_ids_0=token_ids_0,
220
+ token_ids_1=token_ids_1,
221
+ already_has_special_tokens=True,
222
+ )
223
+
224
+ bos_token_id = [1] if self.add_bos_token else []
225
+ eos_token_id = [1] if self.add_eos_token else []
226
+
227
+ if token_ids_1 is None:
228
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
229
+ return (
230
+ bos_token_id
231
+ + ([0] * len(token_ids_0))
232
+ + eos_token_id
233
+ + bos_token_id
234
+ + ([0] * len(token_ids_1))
235
+ + eos_token_id
236
+ )
237
+
238
+ def create_token_type_ids_from_sequences(
239
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
240
+ ) -> List[int]:
241
+ """
242
+ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
243
+ sequence pair mask has the following format:
244
+
245
+ ```
246
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
247
+ | first sequence | second sequence |
248
+ ```
249
+
250
+ if token_ids_1 is None, only returns the first portion of the mask (0s).
251
+
252
+ Args:
253
+ token_ids_0 (`List[int]`):
254
+ List of ids.
255
+ token_ids_1 (`List[int]`, *optional*):
256
+ Optional second list of IDs for sequence pairs.
257
+
258
+ Returns:
259
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
260
+ """
261
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
262
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
263
+
264
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
265
+
266
+ if token_ids_1 is not None:
267
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
268
+
269
+ return output
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ded43118b7418f56db97a4eed08a5c265c03120158229ddd4fbcc9658241d5f0
3
+ size 1520600
tokenizer_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "auto_map": {
5
+ "AutoTokenizer": [
6
+ "tokenization_orion.OrionTokenizer",
7
+ null
8
+ ]
9
+ },
10
+ "bos_token": {
11
+ "__type": "AddedToken",
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": true,
15
+ "rstrip": false,
16
+ "single_word": true
17
+ },
18
+ "clean_up_tokenization_spaces": false,
19
+ "eos_token": {
20
+ "__type": "AddedToken",
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": true
26
+ },
27
+ "model_max_length": 4096,
28
+ "pad_token": {
29
+ "__type": "AddedToken",
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": true
35
+ },
36
+ "sp_model_kwargs": {},
37
+ "tokenizer_class": "OrionTokenizer",
38
+ "unk_token": {
39
+ "__type": "AddedToken",
40
+ "content": "<unk>",
41
+ "lstrip": false,
42
+ "normalized": true,
43
+ "rstrip": false,
44
+ "single_word": true
45
+ }
46
+ }