DeepLangLvcc commited on
Commit
fa4621e
1 Parent(s): 9cc196a

upload model

Browse files
MODEL_LICENSE.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LingoWhale-8B模型许可协议
2
+
3
+ ## 1. 定义
4
+ - “发布方”:指发布源模型的LingoWhale-8B模型团队。
5
+ - “源模型”:指根据本许可提供的LingoWhale-8B模型参数。
6
+ - “使用方”:指根据本协议使用源模型的单位或个人。
7
+
8
+ ## 2. 许可内容
9
+ 根据本许可的条款和条件,发布方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
10
+
11
+ 上述版权声明和本许可声明应包含在此源模型的所有副本或重要部分中。
12
+
13
+ ## 3. 限制
14
+ 您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建此源模型的全部或部分衍生品。
15
+
16
+ 您不得利用此源模型从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
17
+
18
+ ## 4. 免责声明
19
+ 此源模型“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由源模型或源模型的使用或其他交易引起、由源模型引起或与之相关模型。
20
+
21
+ ## 5. 责任限制
22
+ 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何发布方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使使用方已被告知此类损害的可能性。
23
+
24
+ ## 6. 争议解决
25
+ 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议,由发布方住所地人民法院管辖。
26
+
27
+ 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过license@deeplang.ai与我们联系。
28
+
29
+ ## 7. 附则
30
+ 若您期望基于本协议的许可条件与限制,将此源模型或其衍生品用作商业用途,请您按照如下方式联系发布方,以进行登记并向发布方申请书面授权:
31
+
32
+ 1. 联系邮箱:license@deeplang.ai
33
+ 2. 需提交内容如下:
34
+
35
+ | 选项 | 是否必须 | 说明 |
36
+ | :---- | :----: | :---- |
37
+ | 申请人姓名 | 是 | 请填写申请人真实姓名。 |
38
+ | 申请企业名称 | 是 | 请填写完整的企业名称,自然人不能申请此源模型的商业许可。 |
39
+ | 联系方式 | 是 | 请填写真实邮箱,用于接收授权书文件。 |
40
+ | 使用目的和场景 | 是 | 请填写真实使用目的和场景。 |
README.md CHANGED
@@ -1,3 +1,289 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="left">
2
+ <a href="README_EN.md">English</a>&nbsp | &nbsp中文
3
+ </p>
4
+ <br>
5
+
6
+ <div align="center">
7
+ <h1>
8
+ LingoWhale-8B
9
+ </h1>
10
+ </div>
11
+
12
+ <p align="center">
13
+ 🤗 <a href="https://huggingface.co/deeplang-ai/LingoWhale-8B" target="_blank">Hugging Face</a> • 🤖 <a href="https://www.modelscope.cn/models/DeepLang/LingoWhale-8B" target="_blank">ModelScope</a> • ⛵ <a href="https://wisemodel.cn/models/%E5%8C%97%E4%BA%AC%E6%B7%B1%E8%A8%80%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E8%B4%A3%E4%BB%BB%E5%85%AC%E5%8F%B8/LingoWhale-8B/" target="_blank">Wisemodel</a>
14
+ </p>
15
+
16
+ <div align="center">
17
+ <strong>
18
+ 深言科技联合清华大学NLP实验室开源语鲸-8B模型 🎉
19
+ </strong>
20
+ </div>
21
+
22
+ # 目录
23
+
24
+ - [目录](#目录)
25
+ - [模型介绍](#模型介绍)
26
+ - [测评结果](#测评结果)
27
+ - [生成样例](#生成样例)
28
+ - [部署和推理](#部署和推理)
29
+ - [微调方法](#微调方法)
30
+ - [开源协议](#开源协议)
31
+
32
+ # 模型介绍
33
+
34
+ LingoWhale-8B是由深言科技推出的语鲸系列大模型中首个开源的中英双语大语言模型。
35
+
36
+ LingoWhale-8B模型在数万亿token的高质量中英数据上进行预训练,具有强大的基础能力,在多个公开评测基准上均达到领先效果。在预训练阶段,模型使用8K的上下文长度进行训练,能够完成更长上下文的理解和生成任务。
37
+
38
+ LingoWhale-8B模型对学术研究完全开放,使用方通过邮件申请并获得官方商用许可后,即可免费商用。
39
+
40
+ 在开源模型权重的同时,我们也提供了符合用户习惯的Huggingface推理接口以及LoRA等参数高效微调示例,便于开发者快速使用LingoWhale-8B模型。
41
+
42
+ 受模型参数量影响,大模型固有的幻觉问题、数学计算能力相对较弱等问题在LingoWhale-8B模型中仍然存在。请大家在使用前了解这些问题,评估可能存在的风险。后续版本的LingoWhale模型将会针对此类问题进行重点优化。
43
+
44
+
45
+ # 测评结果
46
+
47
+ 我们在以下公开评测基准上进行了测试:
48
+
49
+ - [C-Eval](https://arxiv.org/abs/2305.08322)是一个中文基础模型评估基准,包含了13948个多项选择题,涵盖了52个不同的学科和四个难度级别。它旨在评估中文语言模型的能力。我们使用该数据集的dev集作为few-shot的来源,在test集上进行了5-shot测试。
50
+ - [MMLU](https://arxiv.org/abs/2009.03300)是一个英文基础模型评估基准,涵盖了基本数学、美国历史、计算机科学、法律等多个领域,共包含57个任务。它用于评估语言模型在不同领域任务上的表现。我们对模型进行了5-shot测试。
51
+ - [CMMLU](https://arxiv.org/abs/2306.09212)是一个中文评估基准,涵盖了从基础学科到高级专业水平的67个主题。它用于评估中文语言模型在知识和推理能力方面的表现。我们使用该数据集的dev集作为few-shot的来源,在test集上进行了5-shot测试。
52
+ - [Gaokao](https://arxiv.org/abs/2305.12474)是一个以中国高考题目为数据集的评估基准。它旨在提供测评中文语言模型在语言理解能力和逻辑推理能力方面的能力。我们只保留了其中的四选一的选择题,随机划分后对模型进行了5-shot测试。
53
+ - [HumanEval](https://arxiv.org/abs/2107.03374)是一个包含上百个编程问题的英文评估基准。它用于评估语言模型在程序理解与生成能力方面的表现。我们采用了zero-shot计算Pass@1的方法对模型进行了测试。
54
+ - [GSM8K](https://arxiv.org/abs/2110.14168)是一个由高质量、语言多样化的小学数学应用题组成的数据集。它要求根据给定的场景选择最合理的方案,用于评估语言模型在数学应用方面的能力。我们对模型进行了8-shot测试。
55
+ - [BBH](https://arxiv.org/abs/2210.09261)是一个从204项Big-Bench评测基准任务中选择出的表现较差的任务单独形成的评估基准。它用于评估大型语言模型在具有挑战性的任务上的表现。我们对模型进行了3-shot测试。
56
+ - [AGIEval](https://arxiv.org/abs/2304.06364)是一项考察基础模型类人能力的基准测试,专门用于评估基础模型在人类认知和问题解决相关任务中的能力。我们只保留了其中的四选一的选择题,随机划分后对模型进行了5-shot测试。
57
+
58
+
59
+ 这些评估基准提供了标准化的测试和度量,用于评估语言模型在不同任务和领域上的性能和能力。评测方法和测评结果如下表所示:
60
+
61
+ | **Model** |**C-Eval**| **MMLU** |**CMMLU** |**GAOKAO**|**HumanEval**|**GSM8K** | **BBH** |**AGIEval**|
62
+ |:-----------------------|:--------:|:--------:|:--------:|:--------:|:-----------:|:--------:|:--------:|:---------:|
63
+ | | 5-shot | 5-shot | 5-shot | 5-shot | 0-shot | 8-shot | 3-shot | 5-shot |
64
+ | **GPT-4** | 68.4 | 83.9 | 70.3 | 66.2 | 69.5 | 90.0 | 75.1 | 63.3 |
65
+ | **GPT-3.5 Turbo** | 51.1 | 68.5 | 54.1 | 47.1 | 52.4 | 57.8 | 61.6 | 46.1 |
66
+ | **LLaMA2-7B** | 28.9 | 45.7 | 31.4 | 26.0 | 12.8 | 16.2 | 39.2 | 26.5 |
67
+ | **ChatGLM2-6B** | 50.2 | 45.9 | 49.0 | 49.4 | 9.2 | 28.9 | 31.7 | 45.3 |
68
+ | **Baichuan2-7B-Base** | 54.0 | 54.2 | 57.1 | 47.5 | 18.3 | 24.5 | 41.6 | 42.7 |
69
+ | **Qwen-7B v1.1** | 63.5 | 58.2 | 62.2 | - | 29.9 | 51.7 | 45.0 | - |
70
+ | **LingoWhale-8B-base** | 63.6 | 60.2 | 62.8 | 50.3 | 32.9 | 55.0 | 47.5 | 43.8 |
71
+
72
+ <span style="color:gray">对于以上所有对比模型,我们列出了其官方汇报结果并四舍五入保留一位小数。</span>
73
+
74
+ # 生成样例
75
+
76
+ > 生成以下样例的模型是使用开源SFT数据基于LingoWhale-8B模型进行微调后的对话模型
77
+
78
+ <details><summary><b>数学计算</b></summary><p>
79
+
80
+ **问题**:f(x)为偶函数, f(1) = 3, f(2) = 4, 那么f(1) + f(-2) ^2 等于多少? 请一步一步说明
81
+
82
+ **解答**:首先,根据题目中给出的条件,我们可以知道:
83
+
84
+ - f(x) 是一个偶函数,即 f(-x) = f(x)。
85
+ - f(1) = 3。
86
+ - f(2) = 4。
87
+
88
+ 接下来,我们需要求出 f(1) + f(-2)^2 的值。
89
+ 由于 f(x) 是偶函数,所以 f(-2) = f(2),因此 f(-2) = 4。
90
+ 将 f(1) 和 f(-2) 的值代入 f(1) + f(-2)^2 中,得到:
91
+ f(1) + f(-2)^2 = 3 + (4)^2 = 3 + 16 = 19。
92
+ 因此,f(1) + f(-2)^2 的值为 19。
93
+
94
+ </details>
95
+
96
+ <details><summary><b>角色扮演</b></summary><p>
97
+
98
+ **问题**:作为有丰富营销经验的品牌经理,请为我提供一条关于以下情境的品牌营销策略:
99
+ 情境:618期间,提高品牌销售额
100
+ 品牌定位:国产精品眉笔
101
+ 建议要求:提供具有创新性和实效性的品牌营销策略,帮助提升品牌的销售额, 并加入直播带货的方案。
102
+
103
+ **解答**:在618期间,为了提高品牌销售额,我们可以采取以下品牌营销策略:
104
+
105
+ 1. 推出限时优惠活动:在618活动期间,我们可以在品牌官方网站上推出限时优惠活动,例如购买眉笔送赠品、打折促销等,以吸引消费者的注意力并促进销售。
106
+
107
+ 2. 利用社交媒体宣传:我们可以通过社交媒体平台(如微博、微信、抖音等)发布品牌宣传内容,包括产品介绍、使用技巧、用户评价等,以增加品牌曝光率和用户互动性。
108
+
109
+ 3. 与知名博主合作推广:我们可以邀请一些知名的美妆博主或网红来试用我们的产品,并在其社交媒体上发布推荐视频或图文,以吸引更多的目标受众。
110
+
111
+ 4. 直播带货:我们可以邀请一些知名的带货主播或明星来为我们的产品进行直播带货,通过他们的影响力来吸引更多的消费者关注和购买。
112
+
113
+ 5. 参与电商平台活动:我们可以积极参与各大电商平台的促销活动,例如满减、满赠等,以吸引更多的消费者购买我们的产品。
114
+
115
+ 综上所述,以上品牌营销策略可以帮助我们在618期间提高品牌销售额,同时加入直播带货的方案可以更好地吸引目标受众,提高销售效果。
116
+
117
+ </details>
118
+
119
+ <details><summary><b>诗歌创作</b></summary><p>
120
+
121
+ **问题**:写一个七言诗,描述岳飞的诗。诗词希望是宋代诗的风格。
122
+
123
+ **解答**:《岳飞》
124
+
125
+ 铁马金戈壮士魂,
126
+
127
+ 精忠报国志凌云。
128
+
129
+ 千古英雄谁可及,
130
+
131
+ 岳家军威震乾坤。
132
+
133
+ </details>
134
+
135
+ <details><summary><b>文化知识</b></summary><p>
136
+
137
+ **问题**:我要去成都武侯祠参观,能帮我详细介绍下武侯的生平事迹吗?
138
+
139
+ **解答**:诸葛亮(181年-234年),字孔明,号卧龙,徐州琅琊阳都(今山东临沂市沂南县)人,三国时期蜀汉丞相,杰出的政治家、军事家、文学家、书法家、发明家。早年随叔父诸葛玄到荆州,诸葛玄死后,诸葛亮就在隆中隐居。刘备三顾茅庐请出诸葛亮,联合东吴孙权于赤壁之战大败曹军,形成三国鼎足之势,又夺占荆益。刘备入蜀后,诸葛亮镇守荆州。刘备称帝后,诸葛亮被任命为丞相,主持朝政。后主刘禅继位,诸葛亮被封为武乡侯,领益州牧。勤勉谨慎,大小政事必亲自处理,赏罚严明;与东吴联盟,改善和西南各族的关系;实行屯田政策,加强战备。前后五次北伐中原,多以粮尽无功。终因积劳成疾,于234年病逝于五丈原(今陕西岐山县南),享年54岁。刘禅追封其为忠武侯,后世常以武侯尊称诸葛亮。东晋政权因其军事才能特追封他为武兴王。诸葛亮散文代表作有《出师表》、《诫子书》等。曾发明木牛流马、孔明灯等,并改造连弩,叫做诸葛连弩,可一弩十矢俱发。诸葛亮一生“鞠躬尽瘁、死而后已”,是中国传统文化中忠臣与智者的代表人物。
140
+
141
+ </details>
142
+
143
+ <details><summary><b>代码生成</b></summary><p>
144
+
145
+ **问题**:你是一名编程开发者,能写一个正则帮忙判断一个字符串是否是合理的手机号码, 用Python实现。
146
+
147
+ **解答**:
148
+ ```python
149
+ import re
150
+
151
+ def is_valid_phone(phone):
152
+ pattern = r'^1[3-9]\d{9}$'
153
+ return bool(re.match(pattern, phone))
154
+
155
+ print(is_valid_phone('13800138000')) # True
156
+ print(is_valid_phone('12345678901')) # False
157
+ ```
158
+
159
+ </details>
160
+
161
+ # 部署和推理
162
+
163
+ ## 安装依赖
164
+ ```shell
165
+ pip install -r requirements.txt
166
+ ```
167
+
168
+ + 如需使用flash attention需要安装[flash-attn](https://github.com/Dao-AILab/flash-attention)
169
+ + 如需使用LoRA等少参数微调方法需要额外安装[peft](https://github.com/huggingface/peft)
170
+
171
+ ## Python 样例
172
+
173
+ ```python
174
+ >>> from transformers import AutoModelForCausalLM, AutoTokenizer
175
+ >>> tokenizer = AutoTokenizer.from_pretrained("deeplang-ai/LingoWhale-8B", trust_remote_code=True)
176
+ >>> model = AutoModelForCausalLM.from_pretrained("deeplang-ai/LingoWhale-8B", device_map="auto", trust_remote_code=True)
177
+ >>> inputs = tokenizer("陋室铭\n唐 刘禹锡\n", return_tensors="pt")
178
+ >>> inputs = inputs.to("cuda:0")
179
+ >>> pred = model.generate(**inputs, max_new_tokens=100, repetition_penalty=1.1)
180
+ >>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
181
+ ```
182
+
183
+ # 微调方法
184
+ 模型微调样例代码以`transformers.Trainer`为基础,其中大部分参数和使用方法都可以参考Huggingface中[`Trainer`](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/trainer#trainer) 的教程和介绍。
185
+
186
+ > 本章节旨在展示微调过程,并不对该微调配置下进行微调后的模型效果进行保证。
187
+
188
+ ## 单机训练
189
+ 下面是一个单机进行微调的例子,使用的数据为从[COIG](https://huggingface.co/datasets/BAAI/COIG)数据集中随机选取的10000条指令微调数据,可以使用自己的数据进行替换。
190
+
191
+ ```shell
192
+ hostfile=""
193
+ deepspeed --hostfile=$hostfile finetune/finetune.py \
194
+ --report_to "none" \
195
+ --data_path "finetune/data/coig_10k.json" \
196
+ --model_name_or_path deeplang-ai/LingoWhale-8B \
197
+ --output_dir "output" \
198
+ --model_max_length 2048 \
199
+ --num_train_epochs 4 \
200
+ --per_device_train_batch_size 4 \
201
+ --gradient_accumulation_steps 1 \
202
+ --save_strategy epoch \
203
+ --learning_rate 2e-5 \
204
+ --lr_scheduler_type constant \
205
+ --adam_beta1 0.9 \
206
+ --adam_beta2 0.98 \
207
+ --adam_epsilon 1e-8 \
208
+ --max_grad_norm 1.0 \
209
+ --weight_decay 1e-4 \
210
+ --warmup_ratio 0.0 \
211
+ --logging_steps 1 \
212
+ --gradient_checkpointing True \
213
+ --deepspeed finetune/ds_config.json \
214
+ --bf16 True \
215
+ --tf32 True
216
+ ```
217
+
218
+ 若要替换为自己的数据,可以使用如下格式的json文件。
219
+ ```json
220
+ [
221
+ {
222
+ "id": 0,
223
+ "conversations": [
224
+ {
225
+ "from": "human",
226
+ "value": "请问什么是“模式年龄”?"
227
+ },
228
+ {
229
+ "from": "model",
230
+ "value": "模式年龄是指利用放射性衰变规律假定地质样品形成时的初始同位素组成计算得到的年龄。"
231
+ },
232
+ ...
233
+ ]
234
+ },
235
+ ...
236
+ ]
237
+ ```
238
+
239
+ ## 多机训练
240
+
241
+ 多机器训练需要编辑如下格式的`hostfile`文件。其中,每一行表示一个机器,`ip_address-X`为各个机器对应的ip地址,`slots`内容表示机器可用GPU数量。内容格式如下:
242
+
243
+ ```
244
+ ip_address-1 slots=8
245
+ ip_address-2 slots=8
246
+ ip_address-3 slots=8
247
+ ip_address-4 slots=8
248
+ ...
249
+ ```
250
+
251
+ 同时指定hostfile参数为`hostfile`文件路径,然后运行如下命令即可启动多机训练。
252
+
253
+ ```shell
254
+ hostfile="/path/to/hostfile"
255
+ deepspeed --hostfile=$hostfile finetune/finetune.py \
256
+ --report_to "none" \
257
+ --data_path "finetune/data/coig_10k.json" \
258
+ --model_name_or_path deeplang-ai/LingoWhale-8B \
259
+ --output_dir "output" \
260
+ --model_max_length 2048 \
261
+ --num_train_epochs 4 \
262
+ --per_device_train_batch_size 4 \
263
+ --gradient_accumulation_steps 1 \
264
+ --save_strategy epoch \
265
+ --learning_rate 2e-5 \
266
+ --lr_scheduler_type constant \
267
+ --adam_beta1 0.9 \
268
+ --adam_beta2 0.98 \
269
+ --adam_epsilon 1e-8 \
270
+ --max_grad_norm 1.0 \
271
+ --weight_decay 1e-4 \
272
+ --warmup_ratio 0.0 \
273
+ --logging_steps 1 \
274
+ --gradient_checkpointing True \
275
+ --deepspeed finetune/ds_config.json \
276
+ --bf16 True \
277
+ --tf32 True
278
+ ```
279
+ ## 少参数微调
280
+ 通过使用[peft](https://github.com/huggingface/peft),可以轻松调LoRA, Prefix-Tuning等少参数微调的方法。目前在代码中集合了LoRA的训练方法,可以通过加入`--use_lora True`启动。
281
+
282
+ 使用LoRA训练的checkpoint可以通过下面的代码读取和调用:
283
+ ```python
284
+ from peft import AutoPeftModelForCausalLM
285
+ model = AutoPeftModelForCausalLM.from_pretrained("output", trust_remote_code=True)
286
+ ```
287
+
288
+ # 开源协议
289
+ 社区使用LingoWhale-8B模型需要遵循[Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)和[《LingoWhale-8B模型许可协议》](MODEL_LICENSE.md)。若您期望将此源模型或其衍��品用作商业用途,请参考[《LingoWhale-8B模型许可协议》](MODEL_LICENSE.md)。
README_EN.md ADDED
@@ -0,0 +1,295 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="left">
2
+ English</a>&nbsp | &nbsp<a href="README.md">中文</a>
3
+ </p>
4
+ <br>
5
+
6
+ <div align="center">
7
+ <h1>
8
+ LingoWhale-8B
9
+ </h1>
10
+ </div>
11
+
12
+ <p align="center">
13
+ 🤗 <a href="https://huggingface.co/deeplang-ai/LingoWhale-8B" target="_blank">Hugging Face</a> • 🤖 <a href="https://www.modelscope.cn/models/DeepLang/LingoWhale-8B" target="_blank">ModelScope</a> • ⛵ <a href="https://wisemodel.cn/models/%E5%8C%97%E4%BA%AC%E6%B7%B1%E8%A8%80%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E8%B4%A3%E4%BB%BB%E5%85%AC%E5%8F%B8/LingoWhale-8B/" target="_blank">Wisemodel</a>
14
+ </p>
15
+
16
+ <div align="center">
17
+ <strong>
18
+ LingoWhale-8B model open-sourced by DeepLangAI in collaboration with THUNLP Lab 🎉
19
+ </strong>
20
+ </div>
21
+
22
+ # Table of Contents
23
+
24
+ - [Introduction](#introduction)
25
+ - [Evaluation](#evaluation)
26
+ - [Generated Examples](#generated-examples)
27
+ - [Deployment and Inference](#deployment-and-inference)
28
+ - [Fine-tuning](#fine-tuning)
29
+ - [Open Source License](#open-source-license)
30
+
31
+ # Introduction
32
+
33
+ LingoWhale-8B is the first open-source model in the LingoWhale series introduced by DeepLangAI. It's a bilingual (Chinese-English) large language model.
34
+
35
+ LingoWhale-8B has been pre-trained on a large volume of high-quality bilingual data and exhibits powerful capabilities as a foundation model. It has achieved leading results on multiple public benchmarks. During its pre-training phase, the model was trained with a context window of 8K, allowing it to comprehend and generate longer sequences.
36
+
37
+ LingoWhale-8B is fully open for academic research. Users can apply for commercial use by email, and once granted official commercial permission, they can use it for free.
38
+
39
+ Along with open-sourcing the model weights, we also provide a Huggingface inference interface and parameter efficient fine-tuning examples like LoRA, making it easier for developers to use the LingoWhale-8B model.
40
+
41
+ Due to the scale of model parameters, intrinsic issues of large language models like hallucination and relatively weak mathematical computation capabilities persist in LingoWhale-8B. Please understand these issues and evaluate the possible risks before using the model. Future versions of the LingoWhale model will focus on optimizing these areas.
42
+
43
+ # Evaluation
44
+
45
+ We tested on the following public evaluation benchmarks:
46
+
47
+ - [C-Eval](https://arxiv.org/abs/2305.08322) is a Chinese foundation model evaluation benchmark consisting of 13,948 multiple-choice questions, covering 52 different subjects and four difficulty levels. It aims to assess the capability of Chinese language models. We used the dev set of this dataset as a few-shot source and conducted a 5-shot test on the test set.
48
+
49
+ - [MMLU](https://arxiv.org/abs/2009.03300) is an English foundation model evaluation benchmark that spans various domains like basic mathematics, American history, computer science, law, among others, with a total of 57 tasks. It evaluates language models' performance on different domain tasks. We performed a 5-shot test on this benchmark.
50
+
51
+ - [CMMLU](https://arxiv.org/abs/2306.09212) is a Chinese evaluation benchmark that encompasses 67 topics ranging from basic subjects to advanced professional levels. It evaluates Chinese language models' performance in knowledge and reasoning capabilities. We used the dev set of this dataset as a few-shot source and conducted a 5-shot test on the test set.
52
+
53
+ - [Gaokao](https://arxiv.org/abs/2305.12474) is an evaluation benchmark based on the dataset of Chinese college entrance examination questions. It aims to provide an assessment of Chinese language models in terms of language comprehension and logical reasoning capabilities. We retained only the four-option multiple-choice questions from it and conducted a 5-shot test after random partitioning.
54
+
55
+ - [HumanEval](https://arxiv.org/abs/2107.03374) is an English evaluation benchmark containing over one hundred coding problems. It assesses language models' abilities in code comprehension and generation. We adopted a zero-shot setting and the Pass@1 metric for testing the model.
56
+
57
+ - [GSM8K](https://arxiv.org/abs/2110.14168) is a dataset composed of high-quality elementary school math application problems. It requires the models to select the most appropriate solution based on the provided scenario and evaluates the models' capabilities in mathematical application. We conducted an 8-shot test on this benchmark.
58
+
59
+ - [BBH](https://arxiv.org/abs/2210.09261) is an evaluation benchmark formed from a selection of challenging tasks out of 204 Big-Bench benchmark tasks. We performed a 3-shot test on this benchmark.
60
+
61
+ - [AGIEval](https://arxiv.org/abs/2304.06364) is a benchmark to examine foundation models' human-like capabilities, specifically assessing foundational models' abilities in human cognition and problem-solving tasks. We retained only the four-option multiple-choice questions from it and conducted a 5-shot test after random partitioning.
62
+
63
+ These evaluation benchmarks provide standardized tests and metrics to assess language models' performance and capabilities across various tasks and domains. The evaluation results are shown in the table below:
64
+
65
+ | **Model** |**C-Eval**| **MMLU** |**CMMLU** |**GAOKAO**|**HumanEval**|**GSM8K** | **BBH** |**AGIEval**|
66
+ |:-----------------------|:--------:|:--------:|:--------:|:--------:|:-----------:|:--------:|:--------:|:---------:|
67
+ | | 5-shot | 5-shot | 5-shot | 5-shot | 0-shot | 8-shot | 3-shot | 5-shot |
68
+ | **GPT-4** | 68.4 | 83.9 | 70.3 | 66.2 | 69.5 | 90.0 | 75.1 | 63.3 |
69
+ | **GPT-3.5 Turbo** | 51.1 | 68.5 | 54.1 | 47.1 | 52.4 | 57.8 | 61.6 | 46.1 |
70
+ | **LLaMA2-7B** | 28.9 | 45.7 | 31.4 | 26.0 | 12.8 | 16.2 | 39.2 | 26.5 |
71
+ | **ChatGLM2-6B** | 50.2 | 45.9 | 49.0 | 49.4 | 9.2 | 28.9 | 31.7 | 45.3 |
72
+ | **Baichuan2-7B-Base** | 54.0 | 54.2 | 57.1 | 47.5 | 18.3 | 24.5 | 41.6 | 42.7 |
73
+ | **Qwen-7B v1.1** | 63.5 | 58.2 | 62.2 | - | 29.9 | 51.7 | 45.0 | - |
74
+ | **LingoWhale-8B-base** | 63.6 | 60.2 | 62.8 | 50.3 | 32.9 | 55.0 | 47.5 | 43.8 |
75
+
76
+ <span style="color:gray">For all of the above comparison models, we list their official reports and round them to one decimal place.</span>
77
+
78
+ # Generated Examples
79
+
80
+ The following examples are generated by a Chat model fine-tuned based on the LingoWhale-8B model using open source SFT data.
81
+
82
+ <details><summary><b>Mathematical calculations</b></summary><p>
83
+
84
+ **Query**:f(x)为偶函数, f(1) = 3, f(2) = 4, 那么f(1) + f(-2) ^2 等于多少? 请一步一步说明
85
+
86
+ **Answer**:首先,根据题目中给出的条件,我们可以知道:
87
+
88
+ - f(x) 是一个偶函数,即 f(-x) = f(x)。
89
+ - f(1) = 3。
90
+ - f(2) = 4。
91
+
92
+ 接下来,我们需要求出 f(1) + f(-2)^2 的值。
93
+ 由于 f(x) 是偶函数,所以 f(-2) = f(2),因此 f(-2) = 4。
94
+ 将 f(1) 和 f(-2) 的值代入 f(1) + f(-2)^2 中,得到:
95
+ f(1) + f(-2)^2 = 3 + (4)^2 = 3 + 16 = 19。
96
+ 因此,f(1) + f(-2)^2 的值为 19。
97
+
98
+ </details>
99
+
100
+ <details><summary><b>Role playing</b></summary><p>
101
+
102
+ **Query**:作为有丰富营销经验的品牌经理,请为我提供一条关于以下情境的品牌营销策略:
103
+ 情境:618期间,提高品牌销售额
104
+ 品牌定位:国产精品眉笔
105
+ 建议要求:提供具有创新性和实效性的品牌营销策略,帮助提升品牌的销售额, 并加入直播带货的方案。
106
+
107
+ **Answer**:在618期间,为了提高品牌销售额,我们可以采取以下品牌营销策略:
108
+
109
+ 1. 推出限时优惠活动:在618活动期间,我们可以在品牌官方网站上推出限时优惠活动,例如购买眉笔送赠品、打折促销等,以吸引消费者的注意力并促进销售。
110
+
111
+ 2. 利用社交媒体宣传:我们可以通过社交媒体平台(如微博、微信、抖音等)发布品牌宣传内容,包括产品介绍、使用技巧、用户评价等,以增加品牌曝光率和用户互动性。
112
+
113
+ 3. 与知名博主合作推广:我们可以邀请一些知名的美妆博主或网红来试用我们的产品,并在其社交媒体上发布推荐视频或图文,以吸引更多的目标受众。
114
+
115
+ 4. 直播带货:我们可以邀请一些知名的带货主播或明星来为我们的产品进行直播带货,通过他们的影响力来吸引更多的消费者关注和购买。
116
+
117
+ 5. 参与电商平台活动:我们可以积极参与各大电商平台的促销活动,例如满减、满赠等,以吸引更多的消费者购买我们的产品。
118
+
119
+ 综上所述,以上品牌营销策略可以帮助我们在618期间提高品牌销售额,同时加入直播带货的方案可以更好地吸引目标受众,提高销售效果。
120
+
121
+ </details>
122
+
123
+ <details><summary><b>Poetry Writing</b></summary><p>
124
+
125
+ **Query**:写一个七言诗,描述岳飞的诗。诗词希望是宋代诗的风格。
126
+
127
+ **Answer**:《岳飞》
128
+
129
+ 铁马金戈壮士魂,
130
+
131
+ 精忠报国志凌云。
132
+
133
+ 千古英雄谁可及,
134
+
135
+ 岳家军威震乾坤。
136
+
137
+ </details>
138
+
139
+ <details><summary><b>Cultural knowledge</b></summary><p>
140
+
141
+ **Query**:我要去成都武侯祠参观,能帮我详细介绍下武侯的生平事迹吗?
142
+
143
+ **Answer**:诸葛亮(181年-234年),字孔明,号卧龙,徐州琅琊阳都(今山东临沂市沂南县)人,三国时期蜀汉丞相,杰出的政治家、军事家、文学家、书法家、发明家。早年随叔父诸葛玄到荆州,诸葛玄死后,诸葛亮就在隆中隐居。刘备三顾茅庐请出诸葛亮,联合东吴孙权于赤壁之战大败曹军,形成三国鼎足之势,又夺占荆益。刘备入蜀后,诸葛亮镇守荆州。刘备称帝后,诸葛亮被任命为丞相,主持朝政。后主刘禅继位,诸葛亮被封为武乡侯,领益州牧。勤勉谨慎,大小政事必亲自处理,赏罚严明;与东吴联盟,改善和西南各族��关系;实行屯田政策,加强战备。前后五次北伐中原,多以粮尽无功。终因积劳成疾,于234年病逝于五丈原(今陕西岐山县南),享年54岁。刘禅追封其为忠武侯,后世常以武侯尊称诸葛亮。东晋政权因其军事才能特追封他为武兴王。诸葛亮散文代表作有《出师表》、《诫子书》等。曾发明木牛流马、孔明灯等,并改造连弩,叫做诸葛连弩,可一弩十矢俱发。诸葛亮一生“鞠躬尽瘁、死而后已”,是中国传统文化中忠臣与智者的代表人物。
144
+
145
+ </details>
146
+
147
+ <details><summary><b>Code generation</b></summary><p>
148
+
149
+ **Query**:你是一名编程开发者,能写一个正则帮忙判断一个字符串是否是合理的手机号码, 用Python实现。
150
+
151
+ **Answer**:
152
+ ```python
153
+ import re
154
+
155
+ def is_valid_phone(phone):
156
+ pattern = r'^1[3-9]\d{9}$'
157
+ return bool(re.match(pattern, phone))
158
+
159
+ print(is_valid_phone('13800138000')) # True
160
+ print(is_valid_phone('12345678901')) # False
161
+ ```
162
+
163
+ </details>
164
+
165
+ # Deployment and Inference
166
+
167
+ ## Install Dependencies
168
+ ```shell
169
+ pip install -r requirements.txt
170
+ ```
171
+
172
+ + To use flash attention, you need to install [flash-attn](https://github.com/Dao-AILab/flash-attention)
173
+
174
+ + For using LoRA or other parameter efficient fine-tuning methods, please install [peft](https://github.com/huggingface/peft)
175
+
176
+ ## Python Example
177
+
178
+ ```python
179
+ >>> from transformers import AutoModelForCausalLM, AutoTokenizer
180
+ >>> tokenizer = AutoTokenizer.from_pretrained("deeplang-ai/LingoWhale-8B", trust_remote_code=True)
181
+ >>> model = AutoModelForCausalLM.from_pretrained("deeplang-ai/LingoWhale-8B", device_map="auto", trust_remote_code=True)
182
+ >>> inputs = tokenizer("陋室铭\n唐 刘禹锡\n", return_tensors="pt")
183
+ >>> inputs = inputs.to("cuda:0")
184
+ >>> pred = model.generate(**inputs, max_new_tokens=100, repetition_penalty=1.1)
185
+ >>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
186
+ ```
187
+
188
+ # Fine-tuning
189
+ The fine-tuning example is based on `transformers.Trainer`. For a more deteiled guide on the arguments usage, please refer to Huggingface [`Trainer`](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/trainer#trainer) tutorial.
190
+
191
+ The aim of this section is to showcase the fine-tuning process. No guarantees are made about the model performance under this fine-tuning configuration.
192
+
193
+ ## Single-Machine Training
194
+ Below is an example of fine-tuning on a single machine. The data used is 10,000 randomly selected instruction fine-tuning data from the COIG dataset. You can replace it with your own data.
195
+
196
+ ```shell
197
+ hostfile=""
198
+ deepspeed --hostfile=$hostfile finetune/finetune.py \
199
+ --report_to "none" \
200
+ --data_path "finetune/data/coig_10k.json" \
201
+ --model_name_or_path deeplang-ai/LingoWhale-8B \
202
+ --output_dir "output" \
203
+ --model_max_length 2048 \
204
+ --num_train_epochs 4 \
205
+ --per_device_train_batch_size 16 \
206
+ --gradient_accumulation_steps 1 \
207
+ --save_strategy epoch \
208
+ --learning_rate 2e-5 \
209
+ --lr_scheduler_type constant \
210
+ --adam_beta1 0.9 \
211
+ --adam_beta2 0.98 \
212
+ --adam_epsilon 1e-8 \
213
+ --max_grad_norm 1.0 \
214
+ --weight_decay 1e-4 \
215
+ --warmup_ratio 0.0 \
216
+ --logging_steps 1 \
217
+ --gradient_checkpointing True \
218
+ --deepspeed finetune/ds_config.json \
219
+ --bf16 True \
220
+ --tf32 True
221
+ ```
222
+
223
+ To use you own data, please convert it to the json format below
224
+ ```json
225
+ [
226
+ {
227
+ "id": 0,
228
+ "conversations": [
229
+ {
230
+ "from": "human",
231
+ "value": "请问什么是“模式年龄”?"
232
+ },
233
+ {
234
+ "from": "model",
235
+ "value": "模式年龄是指利用放射性衰变规律假定地质样品形成时的初始同位素组成计算得到的年龄。"
236
+ },
237
+ ...
238
+ ]
239
+ },
240
+ ...
241
+ ]
242
+ ```
243
+
244
+ ## Multi-Machine Training
245
+
246
+ For multi-machine training, you need to create a `hostfile` in the following format. Each line represents a machine. `ip_address-X` refers to the IP address of each machine, and the `slots` content indicates the number of available GPUs on the machine. The content format is as follows:
247
+
248
+ ```
249
+ ip_address-1 slots=8
250
+ ip_address-2 slots=8
251
+ ip_address-3 slots=8
252
+ ip_address-4 slots=8
253
+ ...
254
+ ```
255
+
256
+ Next, specify the `hostfile` argument using the path of the `hostfile`, and run the following command to start multi-machine training.
257
+
258
+ ```shell
259
+ hostfile="/path/to/hostfile"
260
+ deepspeed --hostfile=$hostfile finetune/finetune.py \
261
+ --report_to "none" \
262
+ --data_path "finetune/data/coig_10k.json" \
263
+ --model_name_or_path deeplang-ai/LingoWhale-8B \
264
+ --output_dir "output" \
265
+ --model_max_length 2048 \
266
+ --num_train_epochs 4 \
267
+ --per_device_train_batch_size 16 \
268
+ --gradient_accumulation_steps 1 \
269
+ --save_strategy epoch \
270
+ --learning_rate 2e-5 \
271
+ --lr_scheduler_type constant \
272
+ --adam_beta1 0.9 \
273
+ --adam_beta2 0.98 \
274
+ --adam_epsilon 1e-8 \
275
+ --max_grad_norm 1.0 \
276
+ --weight_decay 1e-4 \
277
+ --warmup_ratio 0.0 \
278
+ --logging_steps 1 \
279
+ --gradient_checkpointing True \
280
+ --deepspeed finetune/ds_config.json \
281
+ --bf16 True \
282
+ --tf32 True
283
+ ```
284
+
285
+ ## Parameter-Efficient Fine-Tuning
286
+ By using [peft](https://github.com/huggingface/peft), you can easily apply parameter-efficient fine-tuning methods like LoRA, Prefix-Tuning, etc. The training method for LoRA is currently integrated into the code, which can be activated by adding `--use_lora True`.
287
+
288
+ LoRA checkpoints can be loaded using the following code:
289
+ ```python
290
+ from peft import AutoPeftModelForCausalLM
291
+ model = AutoPeftModelForCausalLM.from_pretrained("output", trust_remote_code=True)
292
+ ```
293
+
294
+ # Open Source License
295
+ The community use of the LingoWhale-8B model must adhere to the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) and the [LingoWhale-8B Model License Agreement](MODEL_LICENSE.md). If you wish to use this source model or its derivatives for commercial purposes, please refer to [LingoWhale-8B Model License Agreement](MODEL_LICENSE.md).
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LingoWhaleForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_lingowhale.LingoWhaleConfig",
7
+ "AutoModelForCausalLM": "modeling_lingowhale.LingoWhaleForCausalLM"
8
+ },
9
+ "tokenizer_class": "LingoWhaleTokenizer",
10
+ "bos_token_id": 1,
11
+ "eos_token_id": 2,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 4096,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 11008,
16
+ "max_position_embeddings": 8192,
17
+ "model_max_length": 8192,
18
+ "model_type": "lingowhale",
19
+ "num_attention_heads": 32,
20
+ "num_hidden_layers": 36,
21
+ "pad_token_id": 0,
22
+ "rms_norm_eps": 1e-06,
23
+ "_from_model_config": true,
24
+ "tie_word_embeddings": false,
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.31.0",
27
+ "use_cache": true,
28
+ "emb_dropout_prob": 0.0,
29
+ "attn_dropout_prob": 0.0,
30
+ "vocab_size": 96000,
31
+ "rope_theta": 10000.0,
32
+ "use_flash_attention": true
33
+ }
configuration_lingowhale.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 DeepLang AI. All Rights Reserved.
2
+ #
3
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
4
+ # and OPT implementations in this library. It has been modified from its
5
+ # original forms to accommodate minor architectural differences compared
6
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
7
+ #
8
+ # Licensed under the Apache License, Version 2.0 (the "License");
9
+ # you may not use this file except in compliance with the License.
10
+ # You may obtain a copy of the License at
11
+ #
12
+ # http://www.apache.org/licenses/LICENSE-2.0
13
+ #
14
+ # Unless required by applicable law or agreed to in writing, software
15
+ # distributed under the License is distributed on an "AS IS" BASIS,
16
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17
+ # See the License for the specific language governing permissions and
18
+ # limitations under the License.
19
+
20
+ from transformers.configuration_utils import PretrainedConfig
21
+ from transformers.utils import logging
22
+
23
+ logger = logging.get_logger(__name__)
24
+
25
+
26
+ class LingoWhaleConfig(PretrainedConfig):
27
+ model_type = "lingowhale"
28
+ keys_to_ignore_at_inference = ["past_key_values"]
29
+
30
+ def __init__(
31
+ self,
32
+ vocab_size=96000,
33
+ hidden_size=4096,
34
+ intermediate_size=11008,
35
+ num_hidden_layers=36,
36
+ num_attention_heads=32,
37
+ hidden_act="silu",
38
+ max_position_embeddings=8192,
39
+ initializer_range=0.02,
40
+ rms_norm_eps=1e-6,
41
+ emb_dropout_prob=0.0,
42
+ attn_dropout_prob=0.0,
43
+ use_cache=True,
44
+ pad_token_id=0,
45
+ bos_token_id=1,
46
+ eos_token_id=2,
47
+ tie_word_embeddings=False,
48
+ rope_theta=10000.0,
49
+ rope_scaling=None,
50
+ use_flash_attention=True,
51
+ **kwargs,
52
+ ):
53
+ self.vocab_size = vocab_size
54
+ self.max_position_embeddings = max_position_embeddings
55
+ self.hidden_size = hidden_size
56
+ self.intermediate_size = intermediate_size
57
+ self.num_hidden_layers = num_hidden_layers
58
+ self.num_attention_heads = num_attention_heads
59
+ self.hidden_act = hidden_act
60
+ self.initializer_range = initializer_range
61
+ self.rms_norm_eps = rms_norm_eps
62
+ self.emb_dropout_prob = emb_dropout_prob
63
+ self.attn_dropout_prob = attn_dropout_prob
64
+ self.use_cache = use_cache
65
+ self.rope_theta = rope_theta
66
+ self.rope_scaling = rope_scaling
67
+ self.use_flash_attention = use_flash_attention
68
+ super().__init__(
69
+ pad_token_id=pad_token_id,
70
+ bos_token_id=bos_token_id,
71
+ eos_token_id=eos_token_id,
72
+ tie_word_embeddings=tie_word_embeddings,
73
+ **kwargs,
74
+ )
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "pad_token_id": 0,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "user_token_id": 3,
6
+ "assistant_token_id": 4,
7
+ "max_new_tokens": 2048,
8
+ "temperature": 0.9,
9
+ "top_k": 1,
10
+ "top_p": 1,
11
+ "repetition_penalty": 1.1,
12
+ "do_sample": false,
13
+ "transformers_version": "4.31.0"
14
+ }
modeling_lingowhale.py ADDED
@@ -0,0 +1,941 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 DeepLang AI. All Rights Reserved.
2
+ #
3
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
4
+ # and OPT implementations in this library. It has been modified from its
5
+ # original forms to accommodate minor architectural differences compared
6
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
7
+ #
8
+ # Licensed under the Apache License, Version 2.0 (the "License");
9
+ # you may not use this file except in compliance with the License.
10
+ # You may obtain a copy of the License at
11
+ #
12
+ # http://www.apache.org/licenses/LICENSE-2.0
13
+ #
14
+ # Unless required by applicable law or agreed to in writing, software
15
+ # distributed under the License is distributed on an "AS IS" BASIS,
16
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17
+ # See the License for the specific language governing permissions and
18
+ # limitations under the License.
19
+
20
+ import math
21
+ import os
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ import torch.utils.checkpoint
26
+ from torch import nn
27
+ from torch.nn import CrossEntropyLoss
28
+ from torch.nn import functional as F
29
+ from transformers import PretrainedConfig, PreTrainedModel
30
+ from transformers.activations import ACT2FN
31
+ from transformers.modeling_outputs import (BaseModelOutputWithPast,
32
+ CausalLMOutputWithPast)
33
+ from transformers.utils import logging
34
+
35
+ from .configuration_lingowhale import LingoWhaleConfig
36
+
37
+ logger = logging.get_logger(__name__)
38
+
39
+ try:
40
+ from einops import rearrange
41
+ except ImportError:
42
+ rearrange = None
43
+
44
+ try:
45
+ from flash_attn.flash_attn_interface import flash_attn_unpadded_func
46
+ except ImportError:
47
+ try:
48
+ from flash_attn.flash_attn_interface import \
49
+ flash_attn_varlen_func as flash_attn_unpadded_func
50
+ except ImportError:
51
+ flash_attn_unpadded_func = None
52
+
53
+
54
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
55
+ def _make_causal_mask(
56
+ input_ids_shape: torch.Size,
57
+ dtype: torch.dtype,
58
+ device: torch.device,
59
+ past_key_values_length: int = 0,
60
+ ):
61
+ """
62
+ Make causal mask used for bi-directional self-attention.
63
+ """
64
+ bsz, tgt_len = input_ids_shape
65
+ mask = torch.full(
66
+ (tgt_len, tgt_len),
67
+ torch.tensor(torch.finfo(dtype).min, device=device),
68
+ device=device,
69
+ )
70
+ mask_cond = torch.arange(mask.size(-1), device=device)
71
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
72
+ mask = mask.to(dtype)
73
+
74
+ if past_key_values_length > 0:
75
+ mask = torch.cat(
76
+ [
77
+ torch.zeros(tgt_len,
78
+ past_key_values_length,
79
+ dtype=dtype,
80
+ device=device),
81
+ mask,
82
+ ],
83
+ dim=-1,
84
+ )
85
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len,
86
+ tgt_len + past_key_values_length)
87
+
88
+
89
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
90
+ def _expand_mask(mask: torch.Tensor,
91
+ dtype: torch.dtype,
92
+ tgt_len: Optional[int] = None):
93
+ """
94
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
95
+ """
96
+
97
+ bsz, src_len = mask.size()
98
+ tgt_len = tgt_len if tgt_len is not None else src_len
99
+
100
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len,
101
+ src_len).to(dtype)
102
+
103
+ inverted_mask = 1.0 - expanded_mask
104
+
105
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool),
106
+ torch.finfo(dtype).min)
107
+
108
+
109
+ class LingoWhaleRMSNorm(torch.nn.Module):
110
+
111
+ def __init__(self, hidden_size: int, eps: float = 1e-6):
112
+ super().__init__()
113
+ self.eps = eps
114
+ self.weight = nn.Parameter(torch.ones(hidden_size))
115
+
116
+ def _norm(self, x):
117
+ return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
118
+
119
+ def forward(self, x):
120
+ output = self._norm(x.float()).type_as(x)
121
+ return output * self.weight
122
+
123
+
124
+ class LingoWhaleRotaryEmbedding(torch.nn.Module):
125
+
126
+ def __init__(self,
127
+ dim,
128
+ max_position_embeddings=2048,
129
+ base=10000,
130
+ device=None):
131
+ super().__init__()
132
+ self.inv_freq = 1.0 / (base**(
133
+ torch.arange(0, dim, 2).float().to(device) / dim))
134
+ self.max_seq_len_cached = max_position_embeddings
135
+ t = torch.arange(self.max_seq_len_cached,
136
+ device=self.inv_freq.device,
137
+ dtype=torch.float32)
138
+ freqs = torch.outer(t, self.inv_freq)
139
+ emb = torch.cat((freqs, freqs), dim=-1)
140
+ self.cos_cached = emb.cos()[None, None, :, :].to(torch.float32)
141
+ self.sin_cached = emb.sin()[None, None, :, :].to(torch.float32)
142
+
143
+ def forward(self, x, seq_len=None):
144
+ # x: [bs, num_attention_heads, seq_len, head_size]
145
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
146
+ if seq_len > self.max_seq_len_cached:
147
+ self.max_seq_len_cached = seq_len
148
+ t = torch.arange(
149
+ self.max_seq_len_cached,
150
+ device=self.inv_freq.device,
151
+ dtype=torch.float32,
152
+ )
153
+ freqs = torch.outer(t, self.inv_freq)
154
+ emb = torch.cat((freqs, freqs), dim=-1)
155
+ self.cos_cached = emb.cos()[None, None, :, :].to(torch.float32).to(
156
+ x.device)
157
+ self.sin_cached = emb.sin()[None, None, :, :].to(torch.float32).to(
158
+ x.device)
159
+ elif self.cos_cached.device != x.device:
160
+ self.cos_cached = self.cos_cached.to(x.device)
161
+ self.sin_cached = self.sin_cached.to(x.device)
162
+ return (
163
+ self.cos_cached[:, :, :seq_len, ...],
164
+ self.sin_cached[:, :, :seq_len, ...],
165
+ )
166
+
167
+
168
+ def rotate_half(x):
169
+ """Rotates half the hidden dims of the input."""
170
+ x1 = x[..., :x.shape[-1] // 2]
171
+ x2 = x[..., x.shape[-1] // 2:]
172
+ return torch.cat((-x2, x1), dim=-1)
173
+
174
+
175
+ def apply_rotary_pos_emb(q, k, cos_, sin_, position_ids):
176
+ cos = cos_.squeeze(1).squeeze(0) # [seq_len, dim]
177
+ sin = sin_.squeeze(1).squeeze(0) # [seq_len, dim]
178
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
179
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
180
+ q_embed = (q.float() * cos) + (rotate_half(q.float()) * sin)
181
+ k_embed = (k.float() * cos) + (rotate_half(k.float()) * sin)
182
+ return q_embed.to(q.dtype), k_embed.to(k.dtype)
183
+
184
+
185
+ class LingoWhaleMLP(nn.Module):
186
+
187
+ def __init__(self, config):
188
+ super().__init__()
189
+ self.config = config
190
+ self.hidden_size = config.hidden_size
191
+ self.intermediate_size = config.intermediate_size
192
+ self.gate_and_up_proj = nn.Linear(self.hidden_size,
193
+ self.intermediate_size * 2,
194
+ bias=False)
195
+ self.down_proj = nn.Linear(self.intermediate_size,
196
+ self.hidden_size,
197
+ bias=False)
198
+ self.act_fn = ACT2FN[config.hidden_act]
199
+
200
+ def forward(self, x):
201
+ gate_and_up = self.gate_and_up_proj(x)
202
+ [gate, up] = torch.chunk(gate_and_up, 2, dim=-1)
203
+
204
+ acted = self.act_fn(gate)
205
+ tmp = acted * up
206
+
207
+ result = self.down_proj(tmp)
208
+
209
+ return result
210
+
211
+
212
+ class LingoWhaleAttention(nn.Module):
213
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
214
+
215
+ def __init__(self, config: LingoWhaleConfig):
216
+ super().__init__()
217
+ self.config = config
218
+ self.hidden_size = config.hidden_size
219
+ self.num_heads = config.num_attention_heads
220
+ self.head_dim = self.hidden_size // self.num_heads
221
+ self.max_position_embeddings = config.max_position_embeddings
222
+ self.rope_theta = config.rope_theta
223
+ self.dropout_p = config.attn_dropout_prob
224
+
225
+ if (self.head_dim * self.num_heads) != self.hidden_size:
226
+ raise ValueError(
227
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
228
+ f" and `num_heads`: {self.num_heads}).")
229
+ self.qkv_proj = nn.Linear(self.hidden_size,
230
+ 3 * self.hidden_size,
231
+ bias=False)
232
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim,
233
+ self.hidden_size,
234
+ bias=False)
235
+ self.attention_dropout = torch.nn.Dropout(self.dropout_p)
236
+ self._init_rope()
237
+
238
+ def attention_mask_func(self, attention_scores, attention_mask):
239
+ attention_scores.masked_fill_(attention_mask, -10000.0)
240
+ return attention_scores
241
+
242
+ def forward_torch_softmax(self, input, mask):
243
+ input = input.float()
244
+ mask_output = (self.attention_mask_func(input, mask)
245
+ if mask is not None else input)
246
+ probs = torch.nn.Softmax(dim=-1)(mask_output)
247
+
248
+ probs = probs.bfloat16()
249
+
250
+ return probs
251
+
252
+ def _self_attention(self, query_layer, key_layer, value_layer,
253
+ attention_mask):
254
+ output_size = (
255
+ query_layer.size(1),
256
+ query_layer.size(2),
257
+ query_layer.size(0),
258
+ key_layer.size(0),
259
+ )
260
+
261
+ # [sq, b, np, hn] -> [sq, b * np, hn]
262
+ query_layer = query_layer.reshape(output_size[2],
263
+ output_size[0] * output_size[1], -1)
264
+
265
+ # [sk, b, np, hn] -> [sk, b * np, hn]
266
+ key_layer = key_layer.reshape(output_size[3],
267
+ output_size[0] * output_size[1], -1)
268
+
269
+ matmul_input_buffer = torch.randn(
270
+ (output_size[0] * output_size[1], output_size[2], output_size[3]),
271
+ dtype=query_layer.dtype,
272
+ device=query_layer.device,
273
+ )
274
+ norm_factor = math.sqrt(key_layer.shape[-1])
275
+ # Raw attention scores. [b * np, sq, sk]
276
+ matmul_result = torch.baddbmm(
277
+ matmul_input_buffer,
278
+ query_layer.transpose(0, 1), # [b * np, sq, hn]
279
+ key_layer.transpose(0, 1).transpose(1, 2), # [b * np, hn, sk]
280
+ beta=0.0,
281
+ alpha=(1.0 / norm_factor),
282
+ )
283
+
284
+ # change view to [b, np, sq, sk]
285
+ attention_scores = matmul_result.view(*output_size)
286
+
287
+ # attention scores and attention mask [b, np, sq, sk]
288
+ attention_probs = self.forward_torch_softmax(attention_scores,
289
+ attention_mask)
290
+
291
+ # This is actually dropping out entire tokens to attend to, which might
292
+ # seem a bit unusual, but is taken from the original Transformer paper.
293
+
294
+ attention_probs = self.attention_dropout(attention_probs)
295
+
296
+ # =========================
297
+ # Context layer. [sq, b, hp]
298
+ # =========================
299
+
300
+ # value_layer -> context layer.
301
+ # [sk, b, np, hn] --> [b, np, sq, hn]
302
+
303
+ # context layer shape: [b, np, sq, hn]
304
+ output_size = (
305
+ value_layer.size(1),
306
+ value_layer.size(2),
307
+ query_layer.size(0),
308
+ value_layer.size(3),
309
+ )
310
+
311
+ # change view [sk, b * np, hn]
312
+ value_layer = value_layer.reshape(value_layer.size(0),
313
+ output_size[0] * output_size[1], -1)
314
+
315
+ # change view [b * np, sq, sk]
316
+ attention_probs = attention_probs.view(output_size[0] * output_size[1],
317
+ output_size[2], -1)
318
+
319
+ # matmul: [b * np, sq, hn]
320
+ context_layer = torch.bmm(attention_probs, value_layer.transpose(0, 1))
321
+
322
+ # change view [b, np, sq, hn]
323
+ context_layer = context_layer.view(*output_size)
324
+
325
+ # [b, np, sq, hn] --> [sq, b, np, hn]
326
+ context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
327
+
328
+ # [sq, b, np, hn] --> [sq, b, hp]
329
+ new_context_layer_shape = context_layer.size()[:-2] + (
330
+ self.hidden_size, )
331
+
332
+ context_layer = context_layer.view(*new_context_layer_shape)
333
+
334
+ return context_layer
335
+
336
+ def _self_attention_flash(self, q, k, v):
337
+ batch_size, seqlen_q = q.shape[0], q.shape[1]
338
+ seqlen_k = k.shape[1]
339
+
340
+ q, k, v = [rearrange(x, "b s ... -> (b s) ...") for x in [q, k, v]]
341
+ cu_seqlens_q = torch.arange(
342
+ 0,
343
+ (batch_size + 1) * seqlen_q,
344
+ step=seqlen_q,
345
+ dtype=torch.int32,
346
+ device=q.device,
347
+ )
348
+
349
+ if self.training:
350
+ # during training q,k,v always have same seqlen
351
+ assert seqlen_k == seqlen_q
352
+
353
+ is_causal = True
354
+ cu_seqlens_k = cu_seqlens_q
355
+ dropout_p = self.dropout_p
356
+ else:
357
+ # turn off FA causal mask after first inference autoregressive iteration
358
+ # only on first autoregressive step q,k,v have same seqlen
359
+ is_causal = seqlen_q == seqlen_k
360
+ cu_seqlens_k = torch.arange(
361
+ 0,
362
+ (batch_size + 1) * seqlen_k,
363
+ step=seqlen_k,
364
+ dtype=torch.int32,
365
+ device=q.device,
366
+ )
367
+ dropout_p = 0
368
+
369
+ output = flash_attn_unpadded_func(
370
+ q,
371
+ k,
372
+ v,
373
+ cu_seqlens_q,
374
+ cu_seqlens_k,
375
+ seqlen_q,
376
+ seqlen_k,
377
+ dropout_p,
378
+ causal=is_causal,
379
+ )
380
+
381
+ output = rearrange(output, "(b s) ... -> b s ...", b=batch_size)
382
+ return output
383
+
384
+ def _init_rope(self):
385
+ self.rotary_emb = LingoWhaleRotaryEmbedding(
386
+ self.head_dim,
387
+ max_position_embeddings=self.max_position_embeddings,
388
+ base=self.rope_theta,
389
+ )
390
+
391
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
392
+ return (tensor.view(bsz, seq_len, self.num_heads,
393
+ self.head_dim).transpose(1, 2).contiguous())
394
+
395
+ def forward(
396
+ self,
397
+ hidden_states: torch.Tensor,
398
+ attention_mask: Optional[torch.Tensor] = None,
399
+ position_ids: Optional[torch.LongTensor] = None,
400
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
401
+ output_attentions: bool = False,
402
+ use_cache: bool = False,
403
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor],
404
+ Optional[Tuple[torch.Tensor]]]:
405
+ bsz, q_len, _ = hidden_states.size()
406
+
407
+ proj = self.qkv_proj(hidden_states)
408
+ proj = (proj.unflatten(-1,
409
+ (3, self.hidden_size)).unsqueeze(0).transpose(
410
+ 0, -2).squeeze(-2))
411
+
412
+ query_states = (proj[0].view(bsz, q_len, self.num_heads,
413
+ self.head_dim).transpose(1, 2))
414
+ key_states = (proj[1].view(bsz, q_len, self.num_heads,
415
+ self.head_dim).transpose(1, 2))
416
+ value_states = (proj[2].view(bsz, q_len, self.num_heads,
417
+ self.head_dim).transpose(1, 2))
418
+
419
+ kv_seq_len = key_states.shape[-2]
420
+ if past_key_value is not None:
421
+ kv_seq_len += past_key_value[0].shape[-2]
422
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
423
+ query_states, key_states = apply_rotary_pos_emb(
424
+ query_states, key_states, cos, sin, position_ids)
425
+
426
+ # [bsz, nh, t, hd]
427
+
428
+ if past_key_value is not None:
429
+ # reuse k, v, self_attention
430
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
431
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
432
+
433
+ past_key_value = (key_states, value_states) if use_cache else None
434
+
435
+ query_states = query_states.transpose(1, 2).transpose(0, 1)
436
+ value_states = value_states.transpose(1, 2).transpose(0, 1)
437
+ key_states = key_states.transpose(1, 2).transpose(0, 1)
438
+ attention_mask = attention_mask < -0.5
439
+
440
+ if self.config.use_flash_attention and flash_attn_unpadded_func is not None:
441
+ assert (
442
+ rearrange is not None
443
+ ), "Please install einops first, e.g., with pip install einops"
444
+ q, k, v = [
445
+ rearrange(x, "s b ... -> b s ...").contiguous()
446
+ for x in (query_states, key_states, value_states)
447
+ ]
448
+ attn_output = self._self_attention_flash(q, k, v)
449
+ attn_output = rearrange(attn_output,
450
+ "b s h d -> s b (h d)").contiguous()
451
+ else:
452
+ attn_output = self._self_attention(query_states, key_states,
453
+ value_states, attention_mask)
454
+ attn_output = attn_output.transpose(0, 1)
455
+ attn_output = self.o_proj(attn_output)
456
+
457
+ if not output_attentions:
458
+ attn_weights = None
459
+
460
+ return attn_output, attn_weights, past_key_value
461
+
462
+
463
+ class LingoWhaleDecoderLayer(nn.Module):
464
+
465
+ def __init__(self, config: LingoWhaleConfig):
466
+ super().__init__()
467
+ self.hidden_size = config.hidden_size
468
+ self.self_attn = LingoWhaleAttention(config=config)
469
+ self.mlp = LingoWhaleMLP(config)
470
+ self.input_layernorm = LingoWhaleRMSNorm(config.hidden_size,
471
+ eps=config.rms_norm_eps)
472
+ self.post_attention_layernorm = LingoWhaleRMSNorm(
473
+ config.hidden_size, eps=config.rms_norm_eps)
474
+
475
+ def forward(
476
+ self,
477
+ hidden_states: torch.Tensor,
478
+ attention_mask: Optional[torch.Tensor] = None,
479
+ position_ids: Optional[torch.LongTensor] = None,
480
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
481
+ output_attentions: Optional[bool] = False,
482
+ use_cache: Optional[bool] = False,
483
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor,
484
+ torch.FloatTensor]]]:
485
+ """
486
+ Args:
487
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
488
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
489
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
490
+ output_attentions (`bool`, *optional*):
491
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
492
+ returned tensors for more detail.
493
+ use_cache (`bool`, *optional*):
494
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
495
+ (see `past_key_values`).
496
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
497
+ """
498
+
499
+ residual = hidden_states
500
+
501
+ hidden_states = self.input_layernorm(hidden_states)
502
+
503
+ # Self Attention
504
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
505
+ hidden_states=hidden_states,
506
+ attention_mask=attention_mask,
507
+ position_ids=position_ids,
508
+ past_key_value=past_key_value,
509
+ output_attentions=output_attentions,
510
+ use_cache=use_cache,
511
+ )
512
+
513
+ hidden_states = residual + hidden_states
514
+
515
+ # Fully Connected
516
+ residual = hidden_states
517
+ hidden_states = self.post_attention_layernorm(hidden_states)
518
+ hidden_states = self.mlp(hidden_states)
519
+ hidden_states = residual + hidden_states
520
+ outputs = (hidden_states, )
521
+
522
+ if output_attentions:
523
+ outputs += (self_attn_weights, )
524
+
525
+ if use_cache:
526
+ outputs += (present_key_value, )
527
+
528
+ return outputs
529
+
530
+
531
+ class LingoWhalePreTrainedModel(PreTrainedModel):
532
+ config_class = LingoWhaleConfig
533
+ base_model_prefix = "model"
534
+ supports_gradient_checkpointing = True
535
+ _no_split_modules = ["LingoWhaleDecoderLayer"]
536
+
537
+ def _init_weights(self, module):
538
+ std = self.config.initializer_range
539
+ if isinstance(module, nn.Linear):
540
+ module.weight.data.normal_(mean=0.0, std=std)
541
+ if module.bias is not None:
542
+ module.bias.data.zero_()
543
+ elif isinstance(module, nn.Embedding):
544
+ module.weight.data.normal_(mean=0.0, std=std)
545
+ if module.padding_idx is not None:
546
+ module.weight.data[module.padding_idx].zero_()
547
+
548
+ def _set_gradient_checkpointing(self, module, value=False):
549
+ if isinstance(module, LingoWhaleModel):
550
+ module.gradient_checkpointing = value
551
+
552
+
553
+ class LingoWhaleModel(LingoWhalePreTrainedModel):
554
+
555
+ def __init__(self, config: LingoWhaleConfig):
556
+ super().__init__(config)
557
+ self.padding_idx = config.pad_token_id
558
+ self.vocab_size = config.vocab_size
559
+
560
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size,
561
+ self.padding_idx)
562
+ self.layers = nn.ModuleList([
563
+ LingoWhaleDecoderLayer(config)
564
+ for _ in range(config.num_hidden_layers)
565
+ ])
566
+ self.norm = LingoWhaleRMSNorm(config.hidden_size,
567
+ eps=config.rms_norm_eps)
568
+ self.drop = nn.Dropout(config.emb_dropout_prob)
569
+ self.gradient_checkpointing = False
570
+ # Initialize weights and apply final processing
571
+ self.post_init()
572
+
573
+ def get_input_embeddings(self):
574
+ return self.embed_tokens
575
+
576
+ def set_input_embeddings(self, value):
577
+ self.embed_tokens = value
578
+
579
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
580
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape,
581
+ inputs_embeds, past_key_values_length):
582
+ # create causal mask
583
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
584
+ combined_attention_mask = None
585
+ if input_shape[-1] > 1:
586
+ combined_attention_mask = _make_causal_mask(
587
+ input_shape,
588
+ inputs_embeds.dtype,
589
+ device=inputs_embeds.device,
590
+ past_key_values_length=past_key_values_length,
591
+ )
592
+
593
+ if attention_mask is not None:
594
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
595
+ expanded_attn_mask = _expand_mask(attention_mask,
596
+ inputs_embeds.dtype,
597
+ tgt_len=input_shape[-1]).to(
598
+ inputs_embeds.device)
599
+ combined_attention_mask = (expanded_attn_mask
600
+ if combined_attention_mask is None else
601
+ expanded_attn_mask +
602
+ combined_attention_mask)
603
+
604
+ return combined_attention_mask
605
+
606
+ def forward(
607
+ self,
608
+ input_ids: torch.LongTensor = None,
609
+ attention_mask: Optional[torch.Tensor] = None,
610
+ position_ids: Optional[torch.LongTensor] = None,
611
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
612
+ inputs_embeds: Optional[torch.FloatTensor] = None,
613
+ use_cache: Optional[bool] = None,
614
+ output_attentions: Optional[bool] = None,
615
+ output_hidden_states: Optional[bool] = None,
616
+ return_dict: Optional[bool] = None,
617
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
618
+ output_attentions = (output_attentions if output_attentions is not None
619
+ else self.config.output_attentions)
620
+ output_hidden_states = (output_hidden_states
621
+ if output_hidden_states is not None else
622
+ self.config.output_hidden_states)
623
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
624
+
625
+ return_dict = (return_dict if return_dict is not None else
626
+ self.config.use_return_dict)
627
+
628
+ # retrieve input_ids and inputs_embeds
629
+ if input_ids is not None and inputs_embeds is not None:
630
+ raise ValueError(
631
+ "You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time"
632
+ )
633
+ elif input_ids is not None:
634
+ batch_size, seq_length = input_ids.shape
635
+ elif inputs_embeds is not None:
636
+ batch_size, seq_length, _ = inputs_embeds.shape
637
+ else:
638
+ raise ValueError(
639
+ "You have to specify either decoder_input_ids or decoder_inputs_embeds"
640
+ )
641
+
642
+ seq_length_with_past = seq_length
643
+ past_key_values_length = 0
644
+
645
+ if past_key_values is not None:
646
+ past_key_values_length = past_key_values[0][0].shape[2]
647
+ seq_length_with_past = seq_length_with_past + past_key_values_length
648
+
649
+ if position_ids is None:
650
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
651
+ position_ids = torch.arange(
652
+ past_key_values_length,
653
+ seq_length + past_key_values_length,
654
+ dtype=torch.long,
655
+ device=device,
656
+ )
657
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
658
+ else:
659
+ position_ids = position_ids.view(-1, seq_length).long()
660
+
661
+ if inputs_embeds is None:
662
+ inputs_embeds = self.embed_tokens(input_ids)
663
+ # embed positions
664
+ if attention_mask is None:
665
+ attention_mask = torch.ones(
666
+ (batch_size, seq_length_with_past),
667
+ dtype=torch.bool,
668
+ device=inputs_embeds.device,
669
+ )
670
+ attention_mask = self._prepare_decoder_attention_mask(
671
+ attention_mask,
672
+ (batch_size, seq_length),
673
+ inputs_embeds,
674
+ past_key_values_length,
675
+ )
676
+
677
+ hidden_states = inputs_embeds
678
+ hidden_states = self.drop(hidden_states)
679
+ hidden_states = hidden_states.to(dtype=torch.bfloat16)
680
+
681
+ if self.gradient_checkpointing and self.training:
682
+ if use_cache:
683
+ logger.warning_once(
684
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
685
+ )
686
+ use_cache = False
687
+
688
+ # decoder layers
689
+ all_hidden_states = () if output_hidden_states else None
690
+ all_self_attns = () if output_attentions else None
691
+ next_decoder_cache = () if use_cache else None
692
+
693
+ for idx, decoder_layer in enumerate(self.layers):
694
+ if output_hidden_states:
695
+ all_hidden_states += (hidden_states, )
696
+
697
+ past_key_value = (past_key_values[idx]
698
+ if past_key_values is not None else None)
699
+
700
+ if self.gradient_checkpointing and self.training:
701
+
702
+ def create_custom_forward(module):
703
+
704
+ def custom_forward(*inputs):
705
+ # None for past_key_value
706
+ return module(*inputs, output_attentions, None)
707
+
708
+ return custom_forward
709
+
710
+ layer_outputs = torch.utils.checkpoint.checkpoint(
711
+ create_custom_forward(decoder_layer),
712
+ hidden_states,
713
+ attention_mask,
714
+ position_ids,
715
+ None,
716
+ )
717
+ else:
718
+ layer_outputs = decoder_layer(
719
+ hidden_states,
720
+ attention_mask=attention_mask,
721
+ position_ids=position_ids,
722
+ past_key_value=past_key_value,
723
+ output_attentions=output_attentions,
724
+ use_cache=use_cache,
725
+ )
726
+
727
+ hidden_states = layer_outputs[0]
728
+
729
+ if use_cache:
730
+ next_decoder_cache += (
731
+ layer_outputs[2 if output_attentions else 1], )
732
+
733
+ if output_attentions:
734
+ all_self_attns += (layer_outputs[1], )
735
+
736
+ hidden_states = self.norm(hidden_states)
737
+
738
+ if output_hidden_states:
739
+ all_hidden_states += (hidden_states, )
740
+
741
+ next_cache = next_decoder_cache if use_cache else None
742
+ if not return_dict:
743
+ return tuple(
744
+ v for v in
745
+ [hidden_states, next_cache, all_hidden_states, all_self_attns]
746
+ if v is not None)
747
+ return BaseModelOutputWithPast(
748
+ last_hidden_state=hidden_states,
749
+ past_key_values=next_cache,
750
+ hidden_states=all_hidden_states,
751
+ attentions=all_self_attns,
752
+ )
753
+
754
+
755
+ class LingoWhaleForCausalLM(LingoWhalePreTrainedModel):
756
+
757
+ def __init__(self, config):
758
+ super().__init__(config)
759
+ self.model = LingoWhaleModel(config)
760
+ self.vocab_size = config.vocab_size
761
+ self.lm_head = torch.nn.Linear(config.hidden_size,
762
+ config.vocab_size,
763
+ bias=False)
764
+
765
+ # Initialize weights and apply final processing
766
+ self.post_init()
767
+
768
+ def get_input_embeddings(self):
769
+ return self.model.embed_tokens
770
+
771
+ def set_input_embeddings(self, value):
772
+ self.model.embed_tokens = value
773
+
774
+ def get_output_embeddings(self):
775
+ return self.lm_head
776
+
777
+ def set_output_embeddings(self, new_embeddings):
778
+ self.lm_head = new_embeddings
779
+
780
+ def set_decoder(self, decoder):
781
+ self.model = decoder
782
+
783
+ def get_decoder(self):
784
+ return self.model
785
+
786
+ @classmethod
787
+ def from_pretrained(
788
+ cls,
789
+ pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
790
+ *model_args,
791
+ config: Optional[Union[PretrainedConfig, str, os.PathLike]] = None,
792
+ cache_dir: Optional[Union[str, os.PathLike]] = None,
793
+ ignore_mismatched_sizes: bool = False,
794
+ force_download: bool = False,
795
+ local_files_only: bool = False,
796
+ token: Optional[Union[str, bool]] = None,
797
+ revision: str = "main",
798
+ use_safetensors: bool = None,
799
+ **kwargs,
800
+ ):
801
+ # Load config if we don't provide a configuration
802
+ if not isinstance(config, PretrainedConfig):
803
+ config_path = (config if config is not None else
804
+ pretrained_model_name_or_path)
805
+ config, model_kwargs = cls.config_class.from_pretrained(
806
+ config_path,
807
+ cache_dir=cache_dir,
808
+ return_unused_kwargs=True,
809
+ force_download=force_download,
810
+ resume_download=False,
811
+ proxies=None,
812
+ local_files_only=local_files_only,
813
+ token=token,
814
+ revision=revision,
815
+ subfolder="",
816
+ _from_auto=False,
817
+ _from_pipeline=None,
818
+ **kwargs,
819
+ )
820
+ else:
821
+ model_kwargs = kwargs
822
+ if "torch_dtype" not in kwargs:
823
+ kwargs["torch_dtype"] = config.torch_dtype
824
+ return super(LingoWhaleForCausalLM, cls).from_pretrained(
825
+ pretrained_model_name_or_path,
826
+ *model_args,
827
+ config=config,
828
+ cache_dir=cache_dir,
829
+ ignore_mismatched_sizes=ignore_mismatched_sizes,
830
+ force_download=force_download,
831
+ local_files_only=local_files_only,
832
+ token=token,
833
+ revision=revision,
834
+ use_safetensors=use_safetensors,
835
+ **kwargs,
836
+ )
837
+
838
+ def forward(
839
+ self,
840
+ input_ids: torch.LongTensor = None,
841
+ attention_mask: Optional[torch.Tensor] = None,
842
+ position_ids: Optional[torch.LongTensor] = None,
843
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
844
+ inputs_embeds: Optional[torch.FloatTensor] = None,
845
+ labels: Optional[torch.LongTensor] = None,
846
+ use_cache: Optional[bool] = None,
847
+ output_attentions: Optional[bool] = None,
848
+ output_hidden_states: Optional[bool] = None,
849
+ return_dict: Optional[bool] = None,
850
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
851
+ output_attentions = (output_attentions if output_attentions is not None
852
+ else self.config.output_attentions)
853
+ output_hidden_states = (output_hidden_states
854
+ if output_hidden_states is not None else
855
+ self.config.output_hidden_states)
856
+ return_dict = (return_dict if return_dict is not None else
857
+ self.config.use_return_dict)
858
+
859
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
860
+ outputs = self.model(
861
+ input_ids=input_ids,
862
+ attention_mask=attention_mask,
863
+ position_ids=position_ids,
864
+ past_key_values=past_key_values,
865
+ inputs_embeds=inputs_embeds,
866
+ use_cache=use_cache,
867
+ output_attentions=output_attentions,
868
+ output_hidden_states=output_hidden_states,
869
+ return_dict=return_dict,
870
+ )
871
+
872
+ hidden_states = outputs[0]
873
+ logits = self.lm_head(hidden_states)
874
+
875
+ loss = None
876
+ if labels is not None:
877
+ # Shift so that tokens < n predict n
878
+ shift_logits = logits[..., :-1, :].contiguous()
879
+ shift_labels = labels[..., 1:].contiguous()
880
+ # Flatten the tokens
881
+ loss_fct = CrossEntropyLoss()
882
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
883
+ shift_labels = shift_labels.view(-1)
884
+ softmax_normalizer = shift_logits.max(-1).values**2
885
+ # Enable model parallelism
886
+ shift_labels = shift_labels.to(shift_logits.device)
887
+ loss = loss_fct(shift_logits, shift_labels)
888
+
889
+ if not return_dict:
890
+ output = (logits, ) + outputs[1:]
891
+ return (loss, ) + output if loss is not None else output
892
+
893
+ return CausalLMOutputWithPast(
894
+ loss=loss,
895
+ logits=logits,
896
+ past_key_values=outputs.past_key_values,
897
+ hidden_states=outputs.hidden_states,
898
+ attentions=outputs.attentions,
899
+ )
900
+
901
+ def prepare_inputs_for_generation(
902
+ self,
903
+ input_ids,
904
+ past_key_values=None,
905
+ attention_mask=None,
906
+ inputs_embeds=None,
907
+ **kwargs,
908
+ ):
909
+ if past_key_values:
910
+ input_ids = input_ids[:, -1:]
911
+
912
+ position_ids = kwargs.get("position_ids", None)
913
+ if attention_mask is not None and position_ids is None:
914
+ # create position_ids on the fly for batch generation
915
+ position_ids = attention_mask.long().cumsum(-1) - 1
916
+ position_ids.masked_fill_(attention_mask == 0, 1)
917
+ if past_key_values:
918
+ position_ids = position_ids[:, -1].unsqueeze(-1)
919
+
920
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
921
+ if inputs_embeds is not None and past_key_values is None:
922
+ model_inputs = {"inputs_embeds": inputs_embeds}
923
+ else:
924
+ model_inputs = {"input_ids": input_ids}
925
+
926
+ model_inputs.update({
927
+ "position_ids": position_ids,
928
+ "past_key_values": past_key_values,
929
+ "use_cache": kwargs.get("use_cache"),
930
+ "attention_mask": attention_mask,
931
+ })
932
+ return model_inputs
933
+
934
+ @staticmethod
935
+ def _reorder_cache(past_key_values, beam_idx):
936
+ reordered_past = ()
937
+ for layer_past in past_key_values:
938
+ reordered_past += (tuple(
939
+ past_state.index_select(0, beam_idx)
940
+ for past_state in layer_past), )
941
+ return reordered_past
tokenization_lingowhale.py ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 DeepLang AI. All Rights Reserved.
2
+ #
3
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
4
+ # and OPT implementations in this library. It has been modified from its
5
+ # original forms to accommodate minor architectural differences compared
6
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
7
+ #
8
+ # Licensed under the Apache License, Version 2.0 (the "License");
9
+ # you may not use this file except in compliance with the License.
10
+ # You may obtain a copy of the License at
11
+ #
12
+ # http://www.apache.org/licenses/LICENSE-2.0
13
+ #
14
+ # Unless required by applicable law or agreed to in writing, software
15
+ # distributed under the License is distributed on an "AS IS" BASIS,
16
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17
+ # See the License for the specific language governing permissions and
18
+ # limitations under the License.
19
+
20
+ import os
21
+ from shutil import copyfile
22
+ from typing import Any, Dict, List, Optional, Tuple
23
+
24
+ import sentencepiece as spm
25
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
26
+ from transformers.utils import logging
27
+
28
+ logger = logging.get_logger(__name__)
29
+
30
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
31
+
32
+ PRETRAINED_VOCAB_FILES_MAP = {
33
+ "vocab_file": {},
34
+ "tokenizer_file": {},
35
+ }
36
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
37
+
38
+
39
+ class LingoWhaleTokenizer(PreTrainedTokenizer):
40
+ """
41
+ Construct a LingoWhale tokenizer. Based on byte-level Byte-Pair-Encoding.
42
+
43
+ Args:
44
+ vocab_file (`str`):
45
+ Path to the vocabulary file.
46
+ """
47
+
48
+ vocab_files_names = VOCAB_FILES_NAMES
49
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
50
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
51
+ model_input_names = ["input_ids", "attention_mask"]
52
+
53
+ def __init__(
54
+ self,
55
+ vocab_file,
56
+ unk_token="<!!UNK!!>",
57
+ bos_token="<!!BOS!!>",
58
+ eos_token="<!!EOS!!>",
59
+ pad_token=None,
60
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
61
+ add_bos_token=True,
62
+ add_eos_token=False,
63
+ clean_up_tokenization_spaces=False,
64
+ **kwargs,
65
+ ):
66
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
67
+ bos_token = (AddedToken(bos_token, lstrip=False, rstrip=False)
68
+ if isinstance(bos_token, str) else bos_token)
69
+ eos_token = (AddedToken(eos_token, lstrip=False, rstrip=False)
70
+ if isinstance(eos_token, str) else eos_token)
71
+ unk_token = (AddedToken(unk_token, lstrip=False, rstrip=False)
72
+ if isinstance(unk_token, str) else unk_token)
73
+ pad_token = (AddedToken(pad_token, lstrip=False, rstrip=False)
74
+ if isinstance(pad_token, str) else pad_token)
75
+ super().__init__(
76
+ bos_token=bos_token,
77
+ eos_token=eos_token,
78
+ unk_token=unk_token,
79
+ pad_token=pad_token,
80
+ add_bos_token=add_bos_token,
81
+ add_eos_token=add_eos_token,
82
+ sp_model_kwargs=self.sp_model_kwargs,
83
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
84
+ **kwargs,
85
+ )
86
+ self.vocab_file = vocab_file
87
+ self.add_bos_token = add_bos_token
88
+ self.add_eos_token = add_eos_token
89
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
90
+ self.sp_model.Load(vocab_file)
91
+
92
+ @property
93
+ def unk_token_length(self):
94
+ return len(self.sp_model.encode(str(self.unk_token)))
95
+
96
+ def __getstate__(self):
97
+ state = self.__dict__.copy()
98
+ state["sp_model"] = None
99
+ return state
100
+
101
+ def __setstate__(self, d):
102
+ self.__dict__ = d
103
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
104
+ self.sp_model.Load(self.vocab_file)
105
+
106
+ @property
107
+ def vocab_size(self):
108
+ """Returns vocab size"""
109
+ return self.sp_model.get_piece_size()
110
+
111
+ def get_vocab(self):
112
+ """Returns vocab as a dict"""
113
+ vocab = {
114
+ self.convert_ids_to_tokens(i): i
115
+ for i in range(self.vocab_size)
116
+ }
117
+ vocab.update(self.added_tokens_encoder)
118
+ return vocab
119
+
120
+ def _tokenize(self, text):
121
+ """Returns a tokenized string."""
122
+ return self.sp_model.encode(text, out_type=str)
123
+
124
+ def _convert_token_to_id(self, token):
125
+ """Converts a token (str) in an id using the vocab."""
126
+ return self.sp_model.piece_to_id(token)
127
+
128
+ def _convert_id_to_token(self, index):
129
+ """Converts an index (integer) in a token (str) using the vocab."""
130
+ token = self.sp_model.IdToPiece(index)
131
+ return token
132
+
133
+ def convert_tokens_to_string(self, tokens):
134
+ """Converts a sequence of tokens (string) in a single string."""
135
+ current_sub_tokens = []
136
+ out_string = ""
137
+ prev_is_special = False
138
+ for i, token in enumerate(tokens):
139
+ # make sure that special tokens are not decoded using sentencepiece model
140
+ if token in self.all_special_tokens:
141
+ if not prev_is_special and i != 0:
142
+ out_string += " "
143
+ out_string += self.sp_model.decode(current_sub_tokens) + token
144
+ prev_is_special = True
145
+ current_sub_tokens = []
146
+ else:
147
+ current_sub_tokens.append(token)
148
+ prev_is_special = False
149
+ out_string += self.sp_model.decode(current_sub_tokens)
150
+ return out_string
151
+
152
+ def save_vocabulary(self,
153
+ save_directory,
154
+ filename_prefix: Optional[str] = None) -> Tuple[str]:
155
+ """
156
+ Save the vocabulary and special tokens file to a directory.
157
+
158
+ Args:
159
+ save_directory (`str`):
160
+ The directory in which to save the vocabulary.
161
+
162
+ Returns:
163
+ `Tuple(str)`: Paths to the files saved.
164
+ """
165
+ if not os.path.isdir(save_directory):
166
+ logger.error(
167
+ f"Vocabulary path ({save_directory}) should be a directory")
168
+ return
169
+ out_vocab_file = os.path.join(
170
+ save_directory,
171
+ (filename_prefix + "-" if filename_prefix else "") +
172
+ VOCAB_FILES_NAMES["vocab_file"],
173
+ )
174
+
175
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
176
+ out_vocab_file) and os.path.isfile(self.vocab_file):
177
+ copyfile(self.vocab_file, out_vocab_file)
178
+ elif not os.path.isfile(self.vocab_file):
179
+ with open(out_vocab_file, "wb") as fi:
180
+ content_spiece_model = self.sp_model.serialized_model_proto()
181
+ fi.write(content_spiece_model)
182
+
183
+ return (out_vocab_file, )
184
+
185
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
186
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
187
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
188
+
189
+ output = bos_token_id + token_ids_0 + eos_token_id
190
+
191
+ if token_ids_1 is not None:
192
+ output = output + bos_token_id + token_ids_1 + eos_token_id
193
+
194
+ return output
195
+
196
+ def get_special_tokens_mask(
197
+ self,
198
+ token_ids_0: List[int],
199
+ token_ids_1: Optional[List[int]] = None,
200
+ already_has_special_tokens: bool = False,
201
+ ) -> List[int]:
202
+ """
203
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
204
+ special tokens using the tokenizer `prepare_for_model` method.
205
+
206
+ Args:
207
+ token_ids_0 (`List[int]`):
208
+ List of IDs.
209
+ token_ids_1 (`List[int]`, *optional*):
210
+ Optional second list of IDs for sequence pairs.
211
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
212
+ Whether or not the token list is already formatted with special tokens for the model.
213
+
214
+ Returns:
215
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
216
+ """
217
+ if already_has_special_tokens:
218
+ return super().get_special_tokens_mask(
219
+ token_ids_0=token_ids_0,
220
+ token_ids_1=token_ids_1,
221
+ already_has_special_tokens=True,
222
+ )
223
+
224
+ bos_token_id = [1] if self.add_bos_token else []
225
+ eos_token_id = [1] if self.add_eos_token else []
226
+
227
+ if token_ids_1 is None:
228
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
229
+ return (bos_token_id + ([0] * len(token_ids_0)) + eos_token_id +
230
+ bos_token_id + ([0] * len(token_ids_1)) + eos_token_id)
231
+
232
+ def create_token_type_ids_from_sequences(
233
+ self,
234
+ token_ids_0: List[int],
235
+ token_ids_1: Optional[List[int]] = None) -> List[int]:
236
+ """
237
+ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
238
+ sequence pair mask has the following format:
239
+
240
+ ```
241
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
242
+ | first sequence | second sequence |
243
+ ```
244
+
245
+ if token_ids_1 is None, only returns the first portion of the mask (0s).
246
+
247
+ Args:
248
+ token_ids_0 (`List[int]`):
249
+ List of ids.
250
+ token_ids_1 (`List[int]`, *optional*):
251
+ Optional second list of IDs for sequence pairs.
252
+
253
+ Returns:
254
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
255
+ """
256
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
257
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
258
+
259
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
260
+
261
+ if token_ids_1 is not None:
262
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
263
+
264
+ return output
tokenizer_config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoTokenizer": [
4
+ "tokenization_lingowhale.LingoWhaleTokenizer",
5
+ null
6
+ ]
7
+ },
8
+ "add_bos_token": false,
9
+ "add_eos_token": false,
10
+ "use_fast": false,
11
+ "clean_up_tokenization_spaces": false,
12
+ "model_max_length": 8192,
13
+ "sp_model_kwargs": {},
14
+ "tokenizer_class": "LingoWhaleTokenizer",
15
+ "bos_token": {
16
+ "__type": "AddedToken",
17
+ "content": "<!!BOS!!>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": true
22
+ },
23
+ "eos_token": {
24
+ "__type": "AddedToken",
25
+ "content": "<!!EOS!!>",
26
+ "lstrip": false,
27
+ "normalized": true,
28
+ "rstrip": false,
29
+ "single_word": true
30
+ },
31
+ "pad_token": {
32
+ "__type": "AddedToken",
33
+ "content": "<!!UNK!!>",
34
+ "lstrip": false,
35
+ "normalized": true,
36
+ "rstrip": false,
37
+ "single_word": true
38
+ },
39
+ "unk_token": {
40
+ "__type": "AddedToken",
41
+ "content": "<!!UNK!!>",
42
+ "lstrip": false,
43
+ "normalized": true,
44
+ "rstrip": false,
45
+ "single_word": true
46
+ }
47
+ }