jymcc commited on
Commit
5f8c9fe
1 Parent(s): e513c55
README.md CHANGED
@@ -1,3 +1,160 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: other
6
+ tasks:
7
+ - text-generation
8
  ---
9
+
10
+ <!-- markdownlint-disable first-line-h1 -->
11
+ <!-- markdownlint-disable html -->
12
+ <div align="center">
13
+ <h1>
14
+ Baichuan 2
15
+ </h1>
16
+ </div>
17
+
18
+ <div align="center">
19
+ <a href="https://github.com/baichuan-inc/Baichuan2" target="_blank">🦉GitHub</a> | <a href="https://github.com/baichuan-inc/Baichuan-7B/blob/main/media/wechat.jpeg?raw=true" target="_blank">💬WeChat</a>
20
+ </div>
21
+ <div align="center">
22
+ 🚀 <a href="https://www.baichuan-ai.com/" target="_blank">百川大模型在线对话平台</a> 已正式向公众开放 🎉
23
+ </div>
24
+
25
+ # 目录/Table of Contents
26
+
27
+ - [📖 模型介绍/Introduction](#Introduction)
28
+ - [⚙️ 快速开始/Quick Start](#Start)
29
+ - [📊 Benchmark评估/Benchmark Evaluation](#Benchmark)
30
+ - [📜 声明与协议/Terms and Conditions](#Terms)
31
+
32
+
33
+ # <span id="Introduction">模型介绍/Introduction</span>
34
+
35
+ Baichuan 2 是[百川智能]推出的新一代开源大语言模型,采用 **2.6 万亿** Tokens 的高质量语料训练,在权威的中文和英文 benchmark
36
+ 上均取得同尺寸最好的效果。本次发布包含有 7B、13B 的 Base 和 Chat 版本,并提供了 Chat 版本的 4bits
37
+ 量化,所有版本不仅对学术研究完全开放,开发者也仅需[邮件申请]并获得官方商用许可后,即可以免费商用。具体发布版本和下载见下表:
38
+
39
+ Baichuan 2 is the new generation of large-scale open-source language models launched by [Baichuan Intelligence inc.](https://www.baichuan-ai.com/).
40
+ It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved the best performance in authoritative Chinese and English benchmarks of the same size.
41
+ This release includes 7B and 13B versions for both Base and Chat models, along with a 4bits quantized version for the Chat model.
42
+ All versions are fully open to academic research, and developers can also use them for free in commercial applications after obtaining an official commercial license through [email request](mailto:opensource@baichuan-inc.com).
43
+ The specific release versions and download links are listed in the table below:
44
+
45
+ | | Base Model | Chat Model | 4bits Quantized Chat Model |
46
+ |:---:|:--------------------:|:--------------------:|:--------------------------:|
47
+ | 7B | [Baichuan2-7B-Base](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base) | [Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) | [Baichuan2-7B-Chat-4bits](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base-4bits) |
48
+ | 13B | [Baichuan2-13B-Base](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base) | [Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) | [Baichuan2-13B-Chat-4bits](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits) |
49
+
50
+ # <span id="Start">快速开始/Quick Start</span>
51
+
52
+ 在Baichuan2系列模型中,我们为了加快推理速度使用了Pytorch2.0加入的新功能F.scaled_dot_product_attention,因此模型需要在Pytorch2.0环境下运行。
53
+
54
+ In the Baichuan 2 series models, we have utilized the new feature `F.scaled_dot_product_attention` introduced in PyTorch 2.0 to accelerate inference speed. Therefore, the model needs to be run in a PyTorch 2.0 environment.
55
+
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Base", use_fast=False, trust_remote_code=True)
61
+ model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Base", device_map="auto", trust_remote_code=True)
62
+ inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
63
+ inputs = inputs.to('cuda:0')
64
+ pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
65
+ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
66
+ ```
67
+
68
+ # <span id="Benchmark">Benchmark 结果/Benchmark Evaluation</span>
69
+
70
+ 我们在[通用]、[法律]、[医疗]、[数学]、[代码]和[多语言翻译]六个领域的中英文权威数据集上对模型进行了广泛测试,更多详细测评结果可查看[GitHub]。
71
+
72
+ We have extensively tested the model on authoritative Chinese-English datasets across six domains: [General](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#general-domain), [Legal](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#law-and-medicine), [Medical](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#law-and-medicine), [Mathematics](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#mathematics-and-code), [Code](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#mathematics-and-code), and [Multilingual Translation](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#multilingual-translation). For more detailed evaluation results, please refer to [GitHub](https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md).
73
+
74
+ ### 7B Model Results
75
+
76
+ | | **C-Eval** | **MMLU** | **CMMLU** | **Gaokao** | **AGIEval** | **BBH** |
77
+ |:-----------------------:|:----------:|:--------:|:---------:|:----------:|:-----------:|:-------:|
78
+ | | 5-shot | 5-shot | 5-shot | 5-shot | 5-shot | 3-shot |
79
+ | **GPT-4** | 68.40 | 83.93 | 70.33 | 66.15 | 63.27 | 75.12 |
80
+ | **GPT-3.5 Turbo** | 51.10 | 68.54 | 54.06 | 47.07 | 46.13 | 61.59 |
81
+ | **LLaMA-7B** | 27.10 | 35.10 | 26.75 | 27.81 | 28.17 | 32.38 |
82
+ | **LLaMA2-7B** | 28.90 | 45.73 | 31.38 | 25.97 | 26.53 | 39.16 |
83
+ | **MPT-7B** | 27.15 | 27.93 | 26.00 | 26.54 | 24.83 | 35.20 |
84
+ | **Falcon-7B** | 24.23 | 26.03 | 25.66 | 24.24 | 24.10 | 28.77 |
85
+ | **ChatGLM2-6B** | 50.20 | 45.90 | 49.00 | 49.44 | 45.28 | 31.65 |
86
+ | **[Baichuan-7B]** | 42.80 | 42.30 | 44.02 | 36.34 | 34.44 | 32.48 |
87
+ | **[Baichuan2-7B-Base]** | 54.00 | 54.16 | 57.07 | 47.47 | 42.73 | 41.56 |
88
+
89
+ ### 13B Model Results
90
+
91
+ | | **C-Eval** | **MMLU** | **CMMLU** | **Gaokao** | **AGIEval** | **BBH** |
92
+ |:---------------------------:|:----------:|:--------:|:---------:|:----------:|:-----------:|:-------:|
93
+ | | 5-shot | 5-shot | 5-shot | 5-shot | 5-shot | 3-shot |
94
+ | **GPT-4** | 68.40 | 83.93 | 70.33 | 66.15 | 63.27 | 75.12 |
95
+ | **GPT-3.5 Turbo** | 51.10 | 68.54 | 54.06 | 47.07 | 46.13 | 61.59 |
96
+ | **LLaMA-13B** | 28.50 | 46.30 | 31.15 | 28.23 | 28.22 | 37.89 |
97
+ | **LLaMA2-13B** | 35.80 | 55.09 | 37.99 | 30.83 | 32.29 | 46.98 |
98
+ | **Vicuna-13B** | 32.80 | 52.00 | 36.28 | 30.11 | 31.55 | 43.04 |
99
+ | **Chinese-Alpaca-Plus-13B** | 38.80 | 43.90 | 33.43 | 34.78 | 35.46 | 28.94 |
100
+ | **XVERSE-13B** | 53.70 | 55.21 | 58.44 | 44.69 | 42.54 | 38.06 |
101
+ | **[Baichuan-13B-Base]** | 52.40 | 51.60 | 55.30 | 49.69 | 43.20 | 43.01 |
102
+ | **[Baichuan2-13B-Base]** | 58.10 | 59.17 | 61.97 | 54.33 | 48.17 | 48.78 |
103
+
104
+
105
+ ## 训练过程模型/Training Dynamics
106
+
107
+ 除了训练了 2.6 万亿 Tokens 的 [Baichuan2-7B-Base](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base) 模型,我们还提供了在此之前的另外 11 个中间过程的模型(分别对应训练了约 0.2 ~ 2.4 万亿 Tokens)供社区研究使用
108
+ ([训练过程checkpoint下载](https://huggingface.co/baichuan-inc/Baichuan2-7B-Intermediate-Checkpoints))。下图给出了这些 checkpoints 在 C-Eval、MMLU、CMMLU 三个 benchmark 上的效果变化:
109
+
110
+ In addition to the [Baichuan2-7B-Base](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base) model trained on 2.6 trillion tokens, we also offer 11 additional intermediate-stage models for community research, corresponding to training on approximately 0.2 to 2.4 trillion tokens each ([Intermediate Checkpoints Download](https://huggingface.co/baichuan-inc/Baichuan2-7B-Intermediate-Checkpoints)). The graph below shows the performance changes of these checkpoints on three benchmarks: C-Eval, MMLU, and CMMLU.
111
+
112
+ ![checkpoint](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/checkpoints.jpeg)
113
+
114
+ # <span id="Terms">声明与协议/Terms and Conditions</span>
115
+
116
+ ## 声明
117
+
118
+ 我们在此声明,我们的开发团队并未基于 Baichuan 2 模型开发任何应用,无论是在 iOS、Android、网页或任何其他平台。我们强烈呼吁所有使用者,不要利用
119
+ Baichuan 2 模型进行任何危害国家社会安全或违法的活动。另外,我们也要求使用者不要将 Baichuan 2
120
+ 模型用于未经适当安全审查和备案的互联网服务。我们希望所有的使用者都能遵守这个原则,确保科技的发展能在规范和合法的环境下进行。
121
+
122
+ 我们已经尽我们所能,来确保模型训练过程中使用的数据的合规性。然而,尽管我们已经做出了巨大的努力,但由于模型和数据的复杂性,仍有可能存在一些无法预见的问题。因此,如果由于使用
123
+ Baichuan 2 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。
124
+
125
+ We hereby declare that our team has not developed any applications based on Baichuan 2 models, not on iOS, Android, the web, or any other platform. We strongly call on all users not to use Baichuan 2 models for any activities that harm national / social security or violate the law. Also, we ask users not to use Baichuan 2 models for Internet services that have not undergone appropriate security reviews and filings. We hope that all users can abide by this principle and ensure that the development of technology proceeds in a regulated and legal environment.
126
+
127
+ We have done our best to ensure the compliance of the data used in the model training process. However, despite our considerable efforts, there may still be some unforeseeable issues due to the complexity of the model and data. Therefore, if any problems arise due to the use of Baichuan 2 open-source models, including but not limited to data security issues, public opinion risks, or any risks and problems brought about by the model being misled, abused, spread or improperly exploited, we will not assume any responsibility.
128
+
129
+ ## 协议
130
+
131
+ Baichuan 2 模型的社区使用需遵循[《Baichuan 2 模型社区许可协议》]。Baichuan 2 支持商用。如果将 Baichuan 2 模型或其衍生品用作商业用途,请您按照如下方式联系许可方,以进行登记并向许可方申请书面授权:联系邮箱 [opensource@baichuan-inc.com]。
132
+
133
+ The use of the source code in this repository follows the open-source license Apache 2.0. Community use of the Baichuan 2 model must adhere to the [Community License for Baichuan 2 Model](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Baichuan%202%E6%A8%A1%E5%9E%8B%E7%A4%BE%E5%8C%BA%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf). Baichuan 2 supports commercial use. If you are using the Baichuan 2 models or their derivatives for commercial purposes, please contact the licensor in the following manner for registration and to apply for written authorization: Email opensource@baichuan-inc.com.
134
+
135
+ [GitHub]:https://github.com/baichuan-inc/Baichuan2
136
+ [Baichuan2]:https://github.com/baichuan-inc/Baichuan2
137
+
138
+ [Baichuan-7B]:https://huggingface.co/baichuan-inc/Baichuan-7B
139
+ [Baichuan2-7B-Base]:https://huggingface.co/baichuan-inc/Baichuan2-7B-Base
140
+ [Baichuan2-7B-Chat]:https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat
141
+ [Baichuan2-7B-Chat-4bits]:https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bits
142
+ [Baichuan-13B-Base]:https://huggingface.co/baichuan-inc/Baichuan-13B-Base
143
+ [Baichuan2-13B-Base]:https://huggingface.co/baichuan-inc/Baichuan2-13B-Base
144
+ [Baichuan2-13B-Chat]:https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat
145
+ [Baichuan2-13B-Chat-4bits]:https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits
146
+
147
+ [通用]:https://github.com/baichuan-inc/Baichuan2#%E9%80%9A%E7%94%A8%E9%A2%86%E5%9F%9F
148
+ [法律]:https://github.com/baichuan-inc/Baichuan2#%E6%B3%95%E5%BE%8B%E5%8C%BB%E7%96%97
149
+ [医疗]:https://github.com/baichuan-inc/Baichuan2#%E6%B3%95%E5%BE%8B%E5%8C%BB%E7%96%97
150
+ [数学]:https://github.com/baichuan-inc/Baichuan2#%E6%95%B0%E5%AD%A6%E4%BB%A3%E7%A0%81
151
+ [代码]:https://github.com/baichuan-inc/Baichuan2#%E6%95%B0%E5%AD%A6%E4%BB%A3%E7%A0%81
152
+ [多语言翻译]:https://github.com/baichuan-inc/Baichuan2#%E5%A4%9A%E8%AF%AD%E8%A8%80%E7%BF%BB%E8%AF%91
153
+
154
+ [《Baichuan 2 模型社区许可协议》]:https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Baichuan%202%E6%A8%A1%E5%9E%8B%E7%A4%BE%E5%8C%BA%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf
155
+
156
+ [邮件申请]: mailto:opensource@baichuan-inc.com
157
+ [Email]: mailto:opensource@baichuan-inc.com
158
+ [opensource@baichuan-inc.com]: mailto:opensource@baichuan-inc.com
159
+ [训练过程heckpoint下载]: https://huggingface.co/baichuan-inc/Baichuan2-7B-Intermediate-Checkpoints
160
+ [百川智能]: https://www.baichuan-ai.com
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "_name_or_path": "/mntcephfs/data/med/zhanghongbo/yaojishi/cjy/ckpts/huatuo2_13B_v3_final/checkpoint-0-8706/tfmr",
4
+ "architectures": [
5
+ "BaichuanForCausalLM"
6
+ ],
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_baichuan.BaichuanConfig",
9
+ "AutoModelForCausalLM": "modeling_baichuan.BaichuanForCausalLM"
10
+ },
11
+ "bos_token_id": 1,
12
+ "eos_token_id": 2,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 5120,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 13696,
17
+ "model_max_length": 4096,
18
+ "model_type": "baichuan",
19
+ "num_attention_heads": 40,
20
+ "num_hidden_layers": 40,
21
+ "pad_token_id": 0,
22
+ "rms_norm_eps": 1e-06,
23
+ "tie_word_embeddings": false,
24
+ "tokenizer_class": "BaichuanTokenizer",
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.30.2",
27
+ "use_cache": true,
28
+ "vocab_size": 125696,
29
+ "z_loss_weight": 0
30
+ }
configuration_baichuan.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2023, Baichuan Intelligent Technology. All rights reserved.
2
+
3
+ from transformers.configuration_utils import PretrainedConfig
4
+
5
+
6
+ class BaichuanConfig(PretrainedConfig):
7
+ model_type = "baichuan"
8
+ keys_to_ignore_at_inference = ["past_key_values"]
9
+
10
+ def __init__(
11
+ self,
12
+ vocab_size=64000,
13
+ hidden_size=5120,
14
+ intermediate_size=13696,
15
+ num_hidden_layers=40,
16
+ num_attention_heads=40,
17
+ hidden_act="silu",
18
+ model_max_length=4096,
19
+ initializer_range=0.02,
20
+ rms_norm_eps=1e-6,
21
+ use_cache=True,
22
+ pad_token_id=0,
23
+ bos_token_id=1,
24
+ eos_token_id=2,
25
+ tie_word_embeddings=False,
26
+ gradient_checkpointing=False,
27
+ z_loss_weight=0,
28
+ **kwargs,
29
+ ):
30
+ self.vocab_size = vocab_size
31
+ self.model_max_length = model_max_length
32
+ self.hidden_size = hidden_size
33
+ self.intermediate_size = intermediate_size
34
+ self.num_hidden_layers = num_hidden_layers
35
+ self.num_attention_heads = num_attention_heads
36
+ self.hidden_act = hidden_act
37
+ self.initializer_range = initializer_range
38
+ self.rms_norm_eps = rms_norm_eps
39
+ self.use_cache = use_cache
40
+ self.z_loss_weight = z_loss_weight
41
+ self.gradient_checkpointing = (gradient_checkpointing,)
42
+ super().__init__(
43
+ pad_token_id=pad_token_id,
44
+ bos_token_id=bos_token_id,
45
+ eos_token_id=eos_token_id,
46
+ tie_word_embeddings=tie_word_embeddings,
47
+ **kwargs,
48
+ )
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "do_sample": true,
4
+ "eos_token_id": 2,
5
+ "max_new_tokens": 2048,
6
+ "pad_token_id": 0,
7
+ "repetition_penalty": 1.1,
8
+ "temperature": 0.3,
9
+ "top_k": 5,
10
+ "top_p": 0.85,
11
+ "transformers_version": "4.33.1"
12
+ }
generation_utils.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+ from queue import Queue
3
+
4
+ import torch
5
+
6
+
7
+ # def build_chat_input(model, tokenizer, messages: List[dict], max_new_tokens: int=0):
8
+ # def _parse_messages(messages, split_role="user"):
9
+ # system, rounds = "", []
10
+ # round = []
11
+ # for i, message in enumerate(messages):
12
+ # if message["role"] == "system":
13
+ # assert i == 0
14
+ # system = message["content"]
15
+ # continue
16
+ # if message["role"] == split_role and round:
17
+ # rounds.append(round)
18
+ # round = []
19
+ # round.append(message)
20
+ # if round:
21
+ # rounds.append(round)
22
+ # return system, rounds
23
+
24
+ # max_new_tokens = max_new_tokens or model.generation_config.max_new_tokens
25
+ # max_input_tokens = model.config.model_max_length - max_new_tokens
26
+ # system, rounds = _parse_messages(messages, split_role="user")
27
+ # system_tokens = tokenizer.encode(system)
28
+ # max_history_tokens = max_input_tokens - len(system_tokens)
29
+
30
+ # history_tokens = []
31
+ # for round in rounds[::-1]:
32
+ # round_tokens = []
33
+ # for message in round:
34
+ # if message["role"] == "user":
35
+ # round_tokens.append(model.generation_config.user_token_id)
36
+ # else:
37
+ # round_tokens.append(model.generation_config.assistant_token_id)
38
+ # round_tokens.extend(tokenizer.encode(message["content"]))
39
+ # if len(history_tokens) == 0 or len(history_tokens) + len(round_tokens) <= max_history_tokens:
40
+ # history_tokens = round_tokens + history_tokens # concat left
41
+ # if len(history_tokens) < max_history_tokens:
42
+ # continue
43
+ # break
44
+
45
+ # input_tokens = system_tokens + history_tokens
46
+ # if messages[-1]["role"] != "assistant":
47
+ # input_tokens.append(model.generation_config.assistant_token_id)
48
+ # input_tokens = input_tokens[-max_input_tokens:] # truncate left
49
+ # return torch.LongTensor([input_tokens]).to(model.device)
50
+
51
+ # for HuatuoGPT2
52
+ def build_chat_input(model, tokenizer, messages: List[dict], max_new_tokens: int=0):
53
+ def _parse_messages(messages, split_role="user"):
54
+ system, rounds = "", []
55
+ round = []
56
+ for i, message in enumerate(messages):
57
+ # if message["role"] == "system":
58
+ # assert i == 0
59
+ # system = message["content"]
60
+ # continue
61
+ if message["role"] == split_role and round:
62
+ rounds.append(round)
63
+ round = []
64
+ round.append(message)
65
+ if round:
66
+ rounds.append(round)
67
+ return system, rounds
68
+
69
+ max_new_tokens = max_new_tokens or model.generation_config.max_new_tokens
70
+ max_input_tokens = model.config.model_max_length - max_new_tokens
71
+ system, rounds = _parse_messages(messages, split_role="user")
72
+ max_history_tokens = max_input_tokens
73
+ roles = ('<问>:','<答>:')
74
+ sep = '\n'
75
+
76
+ history_tokens = []
77
+ for round in rounds[::-1]:
78
+ round_tokens = []
79
+ for message in round:
80
+ message["content"]
81
+ if message["role"] == "user":
82
+ round_tokens.extend(tokenizer.encode(roles[0]+message["content"]+sep))
83
+ else:
84
+ round_tokens.extend(tokenizer.encode(roles[1]+message["content"]+sep))
85
+ if len(history_tokens) == 0 or len(history_tokens) + len(round_tokens) <= max_history_tokens:
86
+ history_tokens = round_tokens + history_tokens # concat left
87
+ if len(history_tokens) < max_history_tokens:
88
+ continue
89
+ break
90
+
91
+ input_tokens = history_tokens
92
+ if messages[-1]["role"] != "assistant":
93
+ input_tokens.extend(tokenizer.encode(roles[1]))
94
+ # debug
95
+ input_tokens = input_tokens[-max_input_tokens:] # truncate left
96
+ # print(tokenizer.decode(input_tokens),flush=True)
97
+ return torch.LongTensor([input_tokens]).to(model.device)
98
+
99
+
100
+ class TextIterStreamer:
101
+ def __init__(self, tokenizer, skip_prompt=False, skip_special_tokens=False):
102
+ self.tokenizer = tokenizer
103
+ self.skip_prompt = skip_prompt
104
+ self.skip_special_tokens = skip_special_tokens
105
+ self.tokens = []
106
+ self.text_queue = Queue()
107
+ self.next_tokens_are_prompt = True
108
+
109
+ def put(self, value):
110
+ if self.skip_prompt and self.next_tokens_are_prompt:
111
+ self.next_tokens_are_prompt = False
112
+ else:
113
+ if len(value.shape) > 1:
114
+ value = value[0]
115
+ self.tokens.extend(value.tolist())
116
+ self.text_queue.put(
117
+ self.tokenizer.decode(self.tokens, skip_special_tokens=self.skip_special_tokens))
118
+
119
+ def end(self):
120
+ self.text_queue.put(None)
121
+
122
+ def __iter__(self):
123
+ return self
124
+
125
+ def __next__(self):
126
+ value = self.text_queue.get()
127
+ if value is None:
128
+ raise StopIteration()
129
+ else:
130
+ return value
131
+
modeling_baichuan.py ADDED
@@ -0,0 +1,719 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 Baichuan Inc. All Rights Reserved.
2
+
3
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+
23
+ from .configuration_baichuan import BaichuanConfig
24
+ from .generation_utils import build_chat_input, TextIterStreamer
25
+
26
+ import math
27
+ from typing import List, Optional, Tuple, Union
28
+ from threading import Thread
29
+
30
+ import torch
31
+ import torch.utils.checkpoint
32
+ from torch import nn
33
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
34
+ from torch.nn import functional as F
35
+ from transformers import PreTrainedModel, PretrainedConfig
36
+ from transformers.activations import ACT2FN
37
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
38
+ from transformers.generation.utils import GenerationConfig
39
+ from transformers.utils import logging, ContextManagers
40
+
41
+ import os
42
+ from contextlib import contextmanager
43
+ logger = logging.get_logger(__name__)
44
+
45
+ try:
46
+ from xformers import ops as xops
47
+ except ImportError:
48
+ xops = None
49
+ logger.warning(
50
+ "Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers\npip install xformers."
51
+ )
52
+
53
+
54
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
55
+ def _make_causal_mask(
56
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
57
+ ):
58
+ """
59
+ Make causal mask used for bi-directional self-attention.
60
+ """
61
+ bsz, tgt_len = input_ids_shape
62
+ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
63
+ mask_cond = torch.arange(mask.size(-1), device=device)
64
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
65
+ mask = mask.to(dtype)
66
+
67
+ if past_key_values_length > 0:
68
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
69
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
70
+
71
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
72
+ """
73
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
74
+ """
75
+ if len(mask.size()) == 3:
76
+ bsz, src_len, _ = mask.size()
77
+ tgt_len = tgt_len if tgt_len is not None else src_len
78
+ expanded_mask = mask[:,None,:,:].expand(bsz, 1, tgt_len, src_len).to(dtype)
79
+ else:
80
+ bsz, src_len = mask.size()
81
+ tgt_len = tgt_len if tgt_len is not None else src_len
82
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
83
+
84
+ inverted_mask = 1.0 - expanded_mask
85
+
86
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
87
+
88
+
89
+ class RMSNorm(nn.Module):
90
+ def __init__(self, hidden_size, eps=1e-6):
91
+ """
92
+ RMSNorm is equivalent to T5LayerNorm
93
+ """
94
+ super().__init__()
95
+ self.weight = nn.Parameter(torch.ones(hidden_size))
96
+ self.variance_epsilon = eps
97
+
98
+ def forward(self, hidden_states):
99
+ variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
100
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
101
+
102
+ # convert into half-precision if necessary
103
+ if self.weight.dtype in [torch.float16, torch.bfloat16]:
104
+ hidden_states = hidden_states.to(self.weight.dtype)
105
+
106
+ return self.weight * hidden_states
107
+
108
+
109
+ class RotaryEmbedding(torch.nn.Module):
110
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
111
+ super().__init__()
112
+ self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
113
+ self.max_seq_len_cached = max_position_embeddings
114
+ t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=torch.float32)
115
+ freqs = torch.outer(t, self.inv_freq)
116
+ emb = torch.cat((freqs, freqs), dim=-1)
117
+ self.cos_cached = emb.cos()[None, None, :, :].to(torch.float32)
118
+ self.sin_cached = emb.sin()[None, None, :, :].to(torch.float32)
119
+ def forward(self, x, seq_len=None):
120
+ # x: [bs, num_attention_heads, seq_len, head_size]
121
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
122
+ if seq_len > self.max_seq_len_cached:
123
+ self.max_seq_len_cached = seq_len
124
+ t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=torch.float32)
125
+ freqs = torch.outer(t, self.inv_freq)
126
+ emb = torch.cat((freqs, freqs), dim=-1)
127
+ self.cos_cached = emb.cos()[None, None, :, :].to(torch.float32).to(x.device)
128
+ self.sin_cached = emb.sin()[None, None, :, :].to(torch.float32).to(x.device)
129
+ elif self.cos_cached.device != x.device:
130
+ self.cos_cached = self.cos_cached.to(x.device)
131
+ self.sin_cached = self.sin_cached.to(x.device)
132
+ return (
133
+ self.cos_cached[:, :, :seq_len, ...],
134
+ self.sin_cached[:, :, :seq_len, ...],
135
+ )
136
+
137
+
138
+ def rotate_half(x):
139
+ """Rotates half the hidden dims of the input."""
140
+ x1 = x[..., : x.shape[-1] // 2]
141
+ x2 = x[..., x.shape[-1] // 2:]
142
+ return torch.cat((-x2, x1), dim=-1)
143
+
144
+
145
+ def apply_rotary_pos_emb(q, k, cos_, sin_, position_ids):
146
+ cos = cos_.squeeze(1).squeeze(0) # [seq_len, dim]
147
+ sin = sin_.squeeze(1).squeeze(0) # [seq_len, dim]
148
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
149
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
150
+ q_embed = (q.float() * cos) + (rotate_half(q.float()) * sin)
151
+ k_embed = (k.float() * cos) + (rotate_half(k.float()) * sin)
152
+ return q_embed.to(q.dtype), k_embed.to(k.dtype)
153
+
154
+
155
+ class MLP(nn.Module):
156
+ def __init__(
157
+ self,
158
+ hidden_size: int,
159
+ intermediate_size: int,
160
+ hidden_act: str,
161
+ ):
162
+ super().__init__()
163
+ self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
164
+ self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
165
+ self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
166
+ self.act_fn = ACT2FN[hidden_act]
167
+
168
+ def forward(self, x):
169
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
170
+
171
+
172
+ class Attention(nn.Module):
173
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
174
+ def __init__(self, config: BaichuanConfig):
175
+ super().__init__()
176
+ self.config = config
177
+ self.hidden_size = config.hidden_size
178
+ self.num_heads = config.num_attention_heads
179
+ self.head_dim = self.hidden_size // self.num_heads
180
+ self.max_position_embeddings = config.max_position_embeddings
181
+
182
+ if (self.head_dim * self.num_heads) != self.hidden_size:
183
+ raise ValueError(
184
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
185
+ f" and `num_heads`: {self.num_heads})."
186
+ )
187
+ self.W_pack = nn.Linear(self.hidden_size, 3 * self.hidden_size, bias=False)
188
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
189
+ self.rotary_emb = RotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
190
+
191
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
192
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
193
+
194
+ def forward(
195
+ self,
196
+ hidden_states: torch.Tensor,
197
+ attention_mask: Optional[torch.Tensor] = None,
198
+ position_ids: Optional[torch.LongTensor] = None,
199
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
200
+ output_attentions: bool = False,
201
+ use_cache: bool = False,
202
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
203
+ bsz, q_len, _ = hidden_states.size()
204
+
205
+ proj = self.W_pack(hidden_states)
206
+ proj = proj.unflatten(-1, (3, self.hidden_size)).unsqueeze(0).transpose(0, -2).squeeze(-2)
207
+ query_states = proj[0].view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
208
+ key_states = proj[1].view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
209
+ value_states = proj[2].view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
210
+
211
+ kv_seq_len = key_states.shape[-2]
212
+ if past_key_value is not None:
213
+ kv_seq_len += past_key_value[0].shape[-2]
214
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
215
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
216
+ # [bsz, nh, t, hd]
217
+
218
+ if past_key_value is not None:
219
+ # reuse k, v, self_attention
220
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
221
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
222
+
223
+ past_key_value = (key_states, value_states) if use_cache else None
224
+ if xops is not None and self.training:
225
+ attn_weights = None
226
+ query_states = query_states.transpose(1, 2)
227
+ key_states = key_states.transpose(1, 2)
228
+ value_states = value_states.transpose(1, 2)
229
+ attn_output = xops.memory_efficient_attention(
230
+ query_states, key_states, value_states, attn_bias=xops.LowerTriangularMask()
231
+ )
232
+ else:
233
+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=True, enable_mem_efficient=True):
234
+ attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = attention_mask)
235
+ attn_output = attn_output.transpose(1, 2)
236
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
237
+ attn_output = self.o_proj(attn_output)
238
+
239
+ if not output_attentions:
240
+ attn_weights = None
241
+
242
+ return attn_output, attn_weights, past_key_value
243
+
244
+
245
+ class DecoderLayer(nn.Module):
246
+ def __init__(self, config: BaichuanConfig):
247
+ super().__init__()
248
+ self.hidden_size = config.hidden_size
249
+ self.self_attn = Attention(config=config)
250
+ self.mlp = MLP(
251
+ hidden_size=self.hidden_size,
252
+ intermediate_size=config.intermediate_size,
253
+ hidden_act=config.hidden_act,
254
+ )
255
+ self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
256
+ self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
257
+
258
+ def forward(
259
+ self,
260
+ hidden_states: torch.Tensor,
261
+ attention_mask: Optional[torch.Tensor] = None,
262
+ position_ids: Optional[torch.LongTensor] = None,
263
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
264
+ output_attentions: Optional[bool] = False,
265
+ use_cache: Optional[bool] = False,
266
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
267
+
268
+ residual = hidden_states
269
+
270
+ hidden_states = self.input_layernorm(hidden_states)
271
+
272
+ # Self Attention
273
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
274
+ hidden_states=hidden_states,
275
+ attention_mask=attention_mask,
276
+ position_ids=position_ids,
277
+ past_key_value=past_key_value,
278
+ output_attentions=output_attentions,
279
+ use_cache=use_cache,
280
+ )
281
+ hidden_states = residual + hidden_states
282
+
283
+ # Fully Connected
284
+ residual = hidden_states
285
+ hidden_states = self.post_attention_layernorm(hidden_states)
286
+ hidden_states = self.mlp(hidden_states)
287
+ hidden_states = residual + hidden_states
288
+
289
+ outputs = (hidden_states,)
290
+
291
+ if output_attentions:
292
+ outputs += (self_attn_weights,)
293
+
294
+ if use_cache:
295
+ outputs += (present_key_value,)
296
+
297
+ return outputs
298
+
299
+
300
+ class BaichuanPreTrainedModel(PreTrainedModel):
301
+ config_class = BaichuanConfig
302
+ base_model_prefix = "model"
303
+ supports_gradient_checkpointing = True
304
+ _no_split_modules = ["DecoderLayer"]
305
+ _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
306
+
307
+ def _init_weights(self, module):
308
+ std = self.config.initializer_range
309
+ if isinstance(module, nn.Linear):
310
+ module.weight.data.normal_(mean=0.0, std=std)
311
+ if module.bias is not None:
312
+ module.bias.data.zero_()
313
+ elif isinstance(module, nn.Embedding):
314
+ module.weight.data.normal_(mean=0.0, std=std)
315
+ if module.padding_idx is not None:
316
+ module.weight.data[module.padding_idx].zero_()
317
+
318
+ def _set_gradient_checkpointing(self, module, value=False):
319
+ if isinstance(module, BaichuanModel):
320
+ module.gradient_checkpointing = value
321
+
322
+
323
+ class BaichuanModel(BaichuanPreTrainedModel):
324
+ def __init__(self, config: BaichuanConfig):
325
+ super().__init__(config)
326
+ self.padding_idx = config.pad_token_id
327
+ self.vocab_size = config.vocab_size
328
+
329
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
330
+ self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
331
+ self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
332
+
333
+ self.gradient_checkpointing = False
334
+ # Initialize weights and apply final processing
335
+ self.post_init()
336
+
337
+ def get_input_embeddings(self):
338
+ return self.embed_tokens
339
+
340
+ def set_input_embeddings(self, value):
341
+ self.embed_tokens = value
342
+
343
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
344
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
345
+ # create causal mask
346
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
347
+ combined_attention_mask = None
348
+ if input_shape[-1] > 1:
349
+ combined_attention_mask = _make_causal_mask(
350
+ input_shape,
351
+ inputs_embeds.dtype,
352
+ device=inputs_embeds.device,
353
+ past_key_values_length=past_key_values_length,
354
+ )
355
+
356
+ if attention_mask is not None:
357
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
358
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
359
+ inputs_embeds.device
360
+ )
361
+ combined_attention_mask = (
362
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
363
+ )
364
+
365
+ return combined_attention_mask
366
+
367
+ def forward(
368
+ self,
369
+ input_ids: torch.LongTensor = None,
370
+ attention_mask: Optional[torch.Tensor] = None,
371
+ position_ids: Optional[torch.LongTensor] = None,
372
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
373
+ inputs_embeds: Optional[torch.FloatTensor] = None,
374
+ use_cache: Optional[bool] = None,
375
+ output_attentions: Optional[bool] = None,
376
+ output_hidden_states: Optional[bool] = None,
377
+ return_dict: Optional[bool] = None,
378
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
379
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
380
+ output_hidden_states = (
381
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
382
+ )
383
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
384
+
385
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
386
+
387
+ # retrieve input_ids and inputs_embeds
388
+ if input_ids is not None and inputs_embeds is not None:
389
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
390
+ elif input_ids is not None:
391
+ batch_size, seq_length = input_ids.shape
392
+ elif inputs_embeds is not None:
393
+ batch_size, seq_length, _ = inputs_embeds.shape
394
+ else:
395
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
396
+
397
+ seq_length_with_past = seq_length
398
+ past_key_values_length = 0
399
+
400
+ if past_key_values is not None:
401
+ past_key_values_length = past_key_values[0][0].shape[2]
402
+ seq_length_with_past = seq_length_with_past + past_key_values_length
403
+
404
+ if position_ids is None:
405
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
406
+ position_ids = torch.arange(
407
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
408
+ )
409
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
410
+ else:
411
+ position_ids = position_ids.view(-1, seq_length).long()
412
+
413
+ if inputs_embeds is None:
414
+ inputs_embeds = self.embed_tokens(input_ids)
415
+ # embed positions
416
+ if attention_mask is None:
417
+ attention_mask = torch.ones(
418
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
419
+ )
420
+ attention_mask = self._prepare_decoder_attention_mask(
421
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
422
+ )
423
+
424
+ hidden_states = inputs_embeds
425
+
426
+ if self.gradient_checkpointing and self.training:
427
+ if use_cache:
428
+ logger.warning_once(
429
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
430
+ )
431
+ use_cache = False
432
+
433
+ # decoder layers
434
+ all_hidden_states = () if output_hidden_states else None
435
+ all_self_attns = () if output_attentions else None
436
+ next_decoder_cache = () if use_cache else None
437
+
438
+ for idx, decoder_layer in enumerate(self.layers):
439
+ if output_hidden_states:
440
+ all_hidden_states += (hidden_states,)
441
+
442
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
443
+
444
+ if self.gradient_checkpointing and self.training:
445
+
446
+ def create_custom_forward(module):
447
+ def custom_forward(*inputs):
448
+ # None for past_key_value
449
+ return module(*inputs, output_attentions, None)
450
+
451
+ return custom_forward
452
+
453
+ layer_outputs = torch.utils.checkpoint.checkpoint(
454
+ create_custom_forward(decoder_layer),
455
+ hidden_states,
456
+ attention_mask,
457
+ position_ids,
458
+ None,
459
+ )
460
+ else:
461
+ layer_outputs = decoder_layer(
462
+ hidden_states,
463
+ attention_mask=attention_mask,
464
+ position_ids=position_ids,
465
+ past_key_value=past_key_value,
466
+ output_attentions=output_attentions,
467
+ use_cache=use_cache,
468
+ )
469
+
470
+ hidden_states = layer_outputs[0]
471
+
472
+ if use_cache:
473
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
474
+
475
+ if output_attentions:
476
+ all_self_attns += (layer_outputs[1],)
477
+
478
+ hidden_states = self.norm(hidden_states)
479
+
480
+ # add hidden states from the last decoder layer
481
+ if output_hidden_states:
482
+ all_hidden_states += (hidden_states,)
483
+
484
+ next_cache = next_decoder_cache if use_cache else None
485
+ if not return_dict:
486
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
487
+ return BaseModelOutputWithPast(
488
+ last_hidden_state=hidden_states,
489
+ past_key_values=next_cache,
490
+ hidden_states=all_hidden_states,
491
+ attentions=all_self_attns,
492
+ )
493
+
494
+
495
+ class NormHead(nn.Module):
496
+ def __init__(self, hidden_size, vocab_size, bias=False):
497
+ super().__init__()
498
+ self.weight = nn.Parameter(torch.empty((vocab_size, hidden_size)))
499
+ nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
500
+ self.first_flag = True
501
+
502
+ def forward(self, hidden_states):
503
+ if self.training:
504
+ norm_weight = nn.functional.normalize(self.weight)
505
+ self.first_flag = True
506
+ elif self.first_flag:
507
+ self.first_flag = False
508
+ self.weight.data = nn.functional.normalize(self.weight)
509
+ norm_weight = self.weight
510
+ else:
511
+ norm_weight = self.weight
512
+ return nn.functional.linear(hidden_states, norm_weight)
513
+
514
+ _init_weights = True
515
+ @contextmanager
516
+ def no_init_weights(_enable=True):
517
+ global _init_weights
518
+ old_init_weights = _init_weights
519
+ if _enable:
520
+ _init_weights = False
521
+ try:
522
+ yield
523
+ finally:
524
+ _init_weights = old_init_weights
525
+
526
+ class BaichuanForCausalLM(BaichuanPreTrainedModel):
527
+ def __init__(self, config, *model_args, **model_kwargs):
528
+ super().__init__(config, *model_args, **model_kwargs)
529
+ self.model = BaichuanModel(config)
530
+
531
+ self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
532
+ if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):
533
+ try:
534
+ from .quantizer import quantize_offline, init_model_weight_int4
535
+ except ImportError:
536
+ raise ImportError(f"Needs QLinear to run quantize.")
537
+ quantize_offline(self, 4)
538
+ # Initialize weights and apply final processing
539
+ self.post_init()
540
+
541
+ def get_input_embeddings(self):
542
+ return self.model.embed_tokens
543
+
544
+ def set_input_embeddings(self, value):
545
+ self.model.embed_tokens = value
546
+
547
+ def get_output_embeddings(self):
548
+ return self.lm_head
549
+
550
+ def set_output_embeddings(self, new_embeddings):
551
+ self.lm_head = new_embeddings
552
+
553
+ def set_decoder(self, decoder):
554
+ self.model = decoder
555
+
556
+ def get_decoder(self):
557
+ return self.model
558
+
559
+ @classmethod
560
+ def from_pretrained(
561
+ cls,
562
+ pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
563
+ *model_args,
564
+ config: Optional[Union[PretrainedConfig, str, os.PathLike]] = None,
565
+ cache_dir: Optional[Union[str, os.PathLike]] = None,
566
+ ignore_mismatched_sizes: bool = False,
567
+ force_download: bool = False,
568
+ local_files_only: bool = False,
569
+ token: Optional[Union[str, bool]] = None,
570
+ revision: str = "main",
571
+ use_safetensors: bool = None,
572
+ **kwargs,
573
+ ):
574
+ # Load config if we don't provide a configuration
575
+ if not isinstance(config, PretrainedConfig):
576
+ config_path = config if config is not None else pretrained_model_name_or_path
577
+ config, model_kwargs = cls.config_class.from_pretrained(
578
+ config_path,
579
+ cache_dir=cache_dir,
580
+ return_unused_kwargs=True,
581
+ force_download=force_download,
582
+ resume_download=False,
583
+ proxies=None,
584
+ local_files_only=local_files_only,
585
+ token=token,
586
+ revision=revision,
587
+ subfolder="",
588
+ _from_auto=False,
589
+ _from_pipeline=None,
590
+ **kwargs,
591
+ )
592
+ else:
593
+ model_kwargs = kwargs
594
+ return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, *model_args,
595
+ config=config, cache_dir=cache_dir, ignore_mismatched_sizes=ignore_mismatched_sizes,
596
+ force_download=force_download, local_files_only=local_files_only, token=token, revision=revision,
597
+ use_safetensors=use_safetensors, **kwargs)
598
+
599
+ def forward(
600
+ self,
601
+ input_ids: torch.LongTensor = None,
602
+ attention_mask: Optional[torch.Tensor] = None,
603
+ position_ids: Optional[torch.LongTensor] = None,
604
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
605
+ inputs_embeds: Optional[torch.FloatTensor] = None,
606
+ labels: Optional[torch.LongTensor] = None,
607
+ use_cache: Optional[bool] = None,
608
+ output_attentions: Optional[bool] = None,
609
+ output_hidden_states: Optional[bool] = None,
610
+ return_dict: Optional[bool] = None,
611
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
612
+
613
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
614
+ output_hidden_states = (
615
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
616
+ )
617
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
618
+
619
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
620
+ outputs = self.model(
621
+ input_ids=input_ids,
622
+ attention_mask=attention_mask,
623
+ position_ids=position_ids,
624
+ past_key_values=past_key_values,
625
+ inputs_embeds=inputs_embeds,
626
+ use_cache=use_cache,
627
+ output_attentions=output_attentions,
628
+ output_hidden_states=output_hidden_states,
629
+ return_dict=return_dict,
630
+ )
631
+
632
+ hidden_states = outputs[0]
633
+ logits = self.lm_head(hidden_states)
634
+ loss = None
635
+ if labels is not None:
636
+ # Shift so that tokens < n predict n
637
+ shift_logits = logits[..., :-1, :].contiguous()
638
+ shift_labels = labels[..., 1:].contiguous()
639
+ # Flatten the tokens
640
+ loss_fct = CrossEntropyLoss()
641
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
642
+ shift_labels = shift_labels.view(-1)
643
+ softmax_normalizer = shift_logits.max(-1).values ** 2
644
+ z_loss = self.config.z_loss_weight * softmax_normalizer.mean()
645
+ # Enable model parallelism
646
+ shift_labels = shift_labels.to(shift_logits.device)
647
+ loss = loss_fct(shift_logits, shift_labels) + z_loss
648
+
649
+ if not return_dict:
650
+ output = (logits,) + outputs[1:]
651
+ return (loss,) + output if loss is not None else output
652
+
653
+ return CausalLMOutputWithPast(
654
+ loss=loss,
655
+ logits=logits,
656
+ past_key_values=outputs.past_key_values,
657
+ hidden_states=outputs.hidden_states,
658
+ attentions=outputs.attentions,
659
+ )
660
+
661
+ def prepare_inputs_for_generation(
662
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
663
+ ):
664
+ if past_key_values:
665
+ input_ids = input_ids[:, -1:]
666
+
667
+ position_ids = kwargs.get("position_ids", None)
668
+ if attention_mask is not None and position_ids is None:
669
+ # create position_ids on the fly for batch generation
670
+ position_ids = attention_mask.long().cumsum(-1) - 1
671
+ position_ids.masked_fill_(attention_mask == 0, 1)
672
+ if past_key_values:
673
+ position_ids = position_ids[:, -1].unsqueeze(-1)
674
+
675
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
676
+ if inputs_embeds is not None and past_key_values is None:
677
+ model_inputs = {"inputs_embeds": inputs_embeds}
678
+ else:
679
+ model_inputs = {"input_ids": input_ids}
680
+
681
+ model_inputs.update(
682
+ {
683
+ "position_ids": position_ids,
684
+ "past_key_values": past_key_values,
685
+ "use_cache": kwargs.get("use_cache"),
686
+ "attention_mask": attention_mask,
687
+ }
688
+ )
689
+ return model_inputs
690
+
691
+ @staticmethod
692
+ def _reorder_cache(past_key_values, beam_idx):
693
+ reordered_past = ()
694
+ for layer_past in past_key_values:
695
+ reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
696
+ return reordered_past
697
+
698
+ def quantize(self, bits: int):
699
+ try:
700
+ from .quantizer import quantize_online
701
+ except ImportError:
702
+ raise ImportError(f"Needs QLinear to run quantize.")
703
+ return quantize_online(self, bits)
704
+
705
+ def chat(self, tokenizer, messages: List[dict], stream=False,
706
+ generation_config: Optional[GenerationConfig]=None):
707
+ generation_config = generation_config or self.generation_config
708
+ input_ids = build_chat_input(self, tokenizer, messages, generation_config.max_new_tokens)
709
+ if stream:
710
+ streamer = TextIterStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
711
+ Thread(target=self.generate, kwargs=dict(
712
+ inputs=input_ids, streamer=streamer,
713
+ generation_config=generation_config,
714
+ )).start()
715
+ return streamer
716
+ else:
717
+ outputs = self.generate(input_ids, generation_config=generation_config)
718
+ response = tokenizer.decode(outputs[0][len(input_ids[0]):], skip_special_tokens=True)
719
+ return response
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1dd2bce012e02d96f934ba84f0a90a35e51792a9aa397bb5816ed04f92846d9
3
+ size 29080502643
quantizer.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import bitsandbytes as bnb
2
+ from accelerate import init_empty_weights
3
+ from bitsandbytes.nn.modules import Params4bit, Int8Params
4
+ import torch
5
+
6
+ def Params4bitCuda(self, device):
7
+ self.data = self.data.cuda(device)
8
+ self.quant_state[0] = self.quant_state[0].cuda(device)
9
+ self.quant_state[4][0] = self.quant_state[4][0].cuda(device)
10
+ self.quant_state[4][1][0] = self.quant_state[4][1][0].cuda(device)
11
+ self.quant_state[4][1][1] = self.quant_state[4][1][1].cuda(device)
12
+
13
+ self.quant_state[6] = self.quant_state[6].cuda(device)
14
+ return self
15
+
16
+ class Linear4bitOnline(torch.nn.Module):
17
+ def __init__(self, weight, bias, quant_type):
18
+ super().__init__()
19
+ self.weight = Params4bit(
20
+ weight.data, requires_grad=False, compress_statistics=True, quant_type=quant_type
21
+ )
22
+ self.compute_dtype = None
23
+ #self.weight.cuda(weight.device)
24
+ self.bias = bias
25
+
26
+ def forward(self, x: torch.Tensor):
27
+ # weights are cast automatically as Int8Params, but the bias has to be cast manually
28
+ if self.bias is not None and self.bias.dtype != x.dtype:
29
+ self.bias.data = self.bias.data.to(x.dtype)
30
+
31
+ if getattr(self.weight, "quant_state", None) is None:
32
+ print(
33
+ "FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first."
34
+ )
35
+ inp_dtype = x.dtype
36
+ if self.compute_dtype is not None:
37
+ x = x.to(self.compute_dtype)
38
+
39
+ bias = None if self.bias is None else self.bias.to(self.compute_dtype)
40
+ out = bnb.matmul_4bit(
41
+ x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state
42
+ )
43
+
44
+ out = out.to(inp_dtype)
45
+
46
+ return out
47
+
48
+ class Linear8bitLtOnline(torch.nn.Module):
49
+ def __init__(
50
+ self,
51
+ weight,
52
+ bias,
53
+ has_fp16_weights=True,
54
+ memory_efficient_backward=False,
55
+ threshold=0.0,
56
+ index=None,
57
+ ):
58
+ super().__init__()
59
+ assert (
60
+ not memory_efficient_backward
61
+ ), "memory_efficient_backward is no longer required and the argument is deprecated in 0.37.0 and will be removed in 0.39.0"
62
+ self.state = bnb.MatmulLtState()
63
+ self.index = index
64
+
65
+ # Necessary for stacked layers
66
+ self.state.threshold = threshold
67
+ self.state.has_fp16_weights = has_fp16_weights
68
+ self.state.memory_efficient_backward = memory_efficient_backward
69
+ if threshold > 0.0 and not has_fp16_weights:
70
+ self.state.use_pool = True
71
+
72
+ self.weight = Int8Params(
73
+ weight.data,
74
+ has_fp16_weights=has_fp16_weights,
75
+ requires_grad=has_fp16_weights,
76
+ )
77
+ self.bias = bias
78
+
79
+ def init_8bit_state(self):
80
+ self.state.CB = self.weight.CB
81
+ self.state.SCB = self.weight.SCB
82
+ self.weight.CB = None
83
+ self.weight.SCB = None
84
+
85
+ def forward(self, x: torch.Tensor):
86
+ self.state.is_training = self.training
87
+ if self.weight.CB is not None:
88
+ self.init_8bit_state()
89
+
90
+ # weights are cast automatically as Int8Params, but the bias has to be cast manually
91
+ if self.bias is not None and self.bias.dtype != x.dtype:
92
+ self.bias.data = self.bias.data.to(x.dtype)
93
+
94
+ out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
95
+
96
+ if not self.state.has_fp16_weights:
97
+ if self.state.CB is not None and self.state.CxB is not None:
98
+ # we converted 8-bit row major to turing/ampere format in the first inference pass
99
+ # we no longer need the row-major weight
100
+ del self.state.CB
101
+ self.weight.data = self.state.CxB
102
+ return out
103
+
104
+ def quantize_offline(model, bits: int):
105
+ assert (bits == 4), f'bits: {bits} is not supported'
106
+
107
+ for i, layer in enumerate(model.model.layers):
108
+ layer.self_attn.W_pack = bnb.nn.Linear4bit(
109
+ layer.self_attn.W_pack.weight.shape[1],
110
+ layer.self_attn.W_pack.weight.shape[0],
111
+ False,
112
+ torch.float16,
113
+ compress_statistics=True,
114
+ quant_type="nf4",
115
+ )
116
+ layer.self_attn.o_proj = bnb.nn.Linear4bit(
117
+ layer.self_attn.o_proj.weight.shape[1],
118
+ layer.self_attn.o_proj.weight.shape[0],
119
+ False,
120
+ torch.float16,
121
+ compress_statistics=True,
122
+ quant_type="nf4",
123
+ )
124
+
125
+ layer.mlp.gate_proj = bnb.nn.Linear4bit(
126
+ layer.mlp.gate_proj.weight.shape[1],
127
+ layer.mlp.gate_proj.weight.shape[0],
128
+ False,
129
+ torch.float16,
130
+ compress_statistics=True,
131
+ quant_type="nf4",
132
+ )
133
+ layer.mlp.down_proj = bnb.nn.Linear4bit(
134
+ layer.mlp.down_proj.weight.shape[1],
135
+ layer.mlp.down_proj.weight.shape[0],
136
+ False,
137
+ torch.float16,
138
+ compress_statistics=True,
139
+ quant_type="nf4",
140
+ )
141
+ layer.mlp.up_proj = bnb.nn.Linear4bit(
142
+ layer.mlp.up_proj.weight.shape[1],
143
+ layer.mlp.up_proj.weight.shape[0],
144
+ False,
145
+ torch.float16,
146
+ compress_statistics=True,
147
+ quant_type="nf4",
148
+ )
149
+ return model
150
+
151
+ def quantize_online(model, bits: int):
152
+ def quant(weight, bias=None):
153
+ if bits == 8:
154
+ linear = Linear8bitLtOnline(
155
+ weight,
156
+ bias,
157
+ has_fp16_weights=False,
158
+ threshold=6.0,
159
+ )
160
+ if bias is not None:
161
+ linear.bias = torch.nn.Parameter(bias)
162
+ elif bits == 4:
163
+ linear = Linear4bitOnline(
164
+ weight,
165
+ bias,
166
+ quant_type="nf4", #fp4/nf4
167
+ )
168
+ else:
169
+ raise ValueError("quantize only support 4/8 bit")
170
+ return linear
171
+
172
+ for i, layer in enumerate(model.model.layers):
173
+ layer.self_attn.W_pack = quant(layer.self_attn.W_pack.weight)
174
+ layer.self_attn.o_proj = quant(layer.self_attn.o_proj.weight)
175
+ layer.mlp.gate_proj = quant(layer.mlp.gate_proj.weight)
176
+ layer.mlp.down_proj = quant(layer.mlp.down_proj.weight)
177
+ layer.mlp.up_proj = quant(layer.mlp.up_proj.weight)
178
+ return model
179
+
180
+ def init_model_weight_int4(config, model, state_dict):
181
+ #replace Params4bit.cuda with Params4bitCuda
182
+ Params4bit.cuda = Params4bitCuda
183
+
184
+ for i in range(config.num_hidden_layers):
185
+ weight_data = state_dict[f'model.layers.{i}.self_attn.W_pack.weight.data']
186
+ weight_quant_state = state_dict[f'model.layers.{i}.self_attn.W_pack.weight.quant_state']
187
+ model.model.layers[i].self_attn.W_pack.weight = Params4bit(weight_data, requires_grad=False, quant_state=weight_quant_state)
188
+
189
+ weight_data = state_dict[f'model.layers.{i}.self_attn.o_proj.weight.data']
190
+ weight_quant_state = state_dict[f'model.layers.{i}.self_attn.o_proj.weight.quant_state']
191
+ model.model.layers[i].self_attn.o_proj.weight = Params4bit(weight_data, requires_grad=False, quant_state=weight_quant_state)
192
+
193
+ weight_data = state_dict[f'model.layers.{i}.mlp.gate_proj.weight.data']
194
+ weight_quant_state = state_dict[f'model.layers.{i}.mlp.gate_proj.weight.quant_state']
195
+ model.model.layers[i].mlp.gate_proj.weight = Params4bit(weight_data, requires_grad=False, quant_state=weight_quant_state)
196
+
197
+ weight_data = state_dict[f'model.layers.{i}.mlp.up_proj.weight.data']
198
+ weight_quant_state = state_dict[f'model.layers.{i}.mlp.up_proj.weight.quant_state']
199
+ model.model.layers[i].mlp.up_proj.weight = Params4bit(weight_data, requires_grad=False, quant_state=weight_quant_state)
200
+
201
+ weight_data = state_dict[f'model.layers.{i}.mlp.down_proj.weight.data']
202
+ weight_quant_state = state_dict[f'model.layers.{i}.mlp.down_proj.weight.quant_state']
203
+ model.model.layers[i].mlp.down_proj.weight = Params4bit(weight_data, requires_grad=False, quant_state=weight_quant_state)
204
+
205
+ model.model.layers[i].input_layernorm.weight = state_dict[f'model.layers.{i}.input_layernorm.weight']
206
+ model.model.layers[i].post_attention_layernorm.weight = state_dict[f'model.layers.{i}.post_attention_layernorm.weight']
207
+
208
+ model.model.embed_tokens.weight = state_dict['model.embed_tokens.weight']
209
+ model.model.norm.weight = state_dict['model.norm.weight']
210
+ model.lm_head.weight = state_dict['lm_head.weight']
211
+ return model
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": true
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": true
15
+ },
16
+ "pad_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": true
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": true
29
+ }
30
+ }
tokenization_baichuan.py ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2023, Baichuan Intelligent Technology. All rights reserved.
2
+
3
+ import os
4
+ from shutil import copyfile
5
+ from typing import Any, Dict, List, Optional, Tuple
6
+
7
+ import sentencepiece as spm
8
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
9
+ from transformers.utils import logging
10
+
11
+
12
+ logger = logging.get_logger(__name__)
13
+
14
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
15
+
16
+ PRETRAINED_VOCAB_FILES_MAP = {
17
+ "vocab_file": {},
18
+ "tokenizer_file": {},
19
+ }
20
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
21
+
22
+
23
+ class BaichuanTokenizer(PreTrainedTokenizer):
24
+ """
25
+ Construct a Baichuan tokenizer. Based on byte-level Byte-Pair-Encoding.
26
+
27
+ Args:
28
+ vocab_file (`str`):
29
+ Path to the vocabulary file.
30
+ """
31
+
32
+ vocab_files_names = VOCAB_FILES_NAMES
33
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
34
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
35
+ model_input_names = ["input_ids", "attention_mask"]
36
+
37
+ def __init__(
38
+ self,
39
+ vocab_file,
40
+ unk_token="<unk>",
41
+ bos_token="<s>",
42
+ eos_token="</s>",
43
+ pad_token=None,
44
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
45
+ add_bos_token=True,
46
+ add_eos_token=False,
47
+ clean_up_tokenization_spaces=False,
48
+ **kwargs,
49
+ ):
50
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
51
+ bos_token = (
52
+ AddedToken(bos_token, lstrip=False, rstrip=False)
53
+ if isinstance(bos_token, str)
54
+ else bos_token
55
+ )
56
+ eos_token = (
57
+ AddedToken(eos_token, lstrip=False, rstrip=False)
58
+ if isinstance(eos_token, str)
59
+ else eos_token
60
+ )
61
+ unk_token = (
62
+ AddedToken(unk_token, lstrip=False, rstrip=False)
63
+ if isinstance(unk_token, str)
64
+ else unk_token
65
+ )
66
+ pad_token = (
67
+ AddedToken(pad_token, lstrip=False, rstrip=False)
68
+ if isinstance(pad_token, str)
69
+ else pad_token
70
+ )
71
+ super().__init__(
72
+ bos_token=bos_token,
73
+ eos_token=eos_token,
74
+ unk_token=unk_token,
75
+ pad_token=pad_token,
76
+ add_bos_token=add_bos_token,
77
+ add_eos_token=add_eos_token,
78
+ sp_model_kwargs=self.sp_model_kwargs,
79
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
80
+ **kwargs,
81
+ )
82
+ self.vocab_file = vocab_file
83
+ self.add_bos_token = add_bos_token
84
+ self.add_eos_token = add_eos_token
85
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
86
+ self.sp_model.Load(vocab_file)
87
+
88
+ def __getstate__(self):
89
+ state = self.__dict__.copy()
90
+ state["sp_model"] = None
91
+ return state
92
+
93
+ def __setstate__(self, d):
94
+ self.__dict__ = d
95
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
96
+ self.sp_model.Load(self.vocab_file)
97
+
98
+ @property
99
+ def vocab_size(self):
100
+ """Returns vocab size"""
101
+ return self.sp_model.get_piece_size()
102
+
103
+ def get_vocab(self):
104
+ """Returns vocab as a dict"""
105
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
106
+ vocab.update(self.added_tokens_encoder)
107
+ return vocab
108
+
109
+ def _tokenize(self, text):
110
+ """Returns a tokenized string."""
111
+ return self.sp_model.encode(text, out_type=str)
112
+
113
+ def _convert_token_to_id(self, token):
114
+ """Converts a token (str) in an id using the vocab."""
115
+ return self.sp_model.piece_to_id(token)
116
+
117
+ def _convert_id_to_token(self, index):
118
+ """Converts an index (integer) in a token (str) using the vocab."""
119
+ token = self.sp_model.IdToPiece(index)
120
+ return token
121
+
122
+ def convert_tokens_to_string(self, tokens):
123
+ """Converts a sequence of tokens (string) in a single string."""
124
+ current_sub_tokens = []
125
+ out_string = ""
126
+ prev_is_special = False
127
+ for i, token in enumerate(tokens):
128
+ # make sure that special tokens are not decoded using sentencepiece model
129
+ if token in self.all_special_tokens:
130
+ if not prev_is_special and i != 0:
131
+ out_string += " "
132
+ out_string += self.sp_model.decode(current_sub_tokens) + token
133
+ prev_is_special = True
134
+ current_sub_tokens = []
135
+ else:
136
+ current_sub_tokens.append(token)
137
+ prev_is_special = False
138
+ out_string += self.sp_model.decode(current_sub_tokens)
139
+ return out_string
140
+
141
+ def save_vocabulary(
142
+ self, save_directory, filename_prefix: Optional[str] = None
143
+ ) -> Tuple[str]:
144
+ """
145
+ Save the vocabulary and special tokens file to a directory.
146
+
147
+ Args:
148
+ save_directory (`str`):
149
+ The directory in which to save the vocabulary.
150
+
151
+ Returns:
152
+ `Tuple(str)`: Paths to the files saved.
153
+ """
154
+ if not os.path.isdir(save_directory):
155
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
156
+ return
157
+ out_vocab_file = os.path.join(
158
+ save_directory,
159
+ (filename_prefix + "-" if filename_prefix else "")
160
+ + VOCAB_FILES_NAMES["vocab_file"],
161
+ )
162
+
163
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
164
+ out_vocab_file
165
+ ) and os.path.isfile(self.vocab_file):
166
+ copyfile(self.vocab_file, out_vocab_file)
167
+ elif not os.path.isfile(self.vocab_file):
168
+ with open(out_vocab_file, "wb") as fi:
169
+ content_spiece_model = self.sp_model.serialized_model_proto()
170
+ fi.write(content_spiece_model)
171
+
172
+ return (out_vocab_file,)
173
+
174
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
175
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
176
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
177
+
178
+ output = bos_token_id + token_ids_0 + eos_token_id
179
+
180
+ if token_ids_1 is not None:
181
+ output = output + bos_token_id + token_ids_1 + eos_token_id
182
+
183
+ return output
184
+
185
+ def get_special_tokens_mask(
186
+ self,
187
+ token_ids_0: List[int],
188
+ token_ids_1: Optional[List[int]] = None,
189
+ already_has_special_tokens: bool = False,
190
+ ) -> List[int]:
191
+ """
192
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
193
+ special tokens using the tokenizer `prepare_for_model` method.
194
+
195
+ Args:
196
+ token_ids_0 (`List[int]`):
197
+ List of IDs.
198
+ token_ids_1 (`List[int]`, *optional*):
199
+ Optional second list of IDs for sequence pairs.
200
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
201
+ Whether or not the token list is already formatted with special tokens for the model.
202
+
203
+ Returns:
204
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
205
+ """
206
+ if already_has_special_tokens:
207
+ return super().get_special_tokens_mask(
208
+ token_ids_0=token_ids_0,
209
+ token_ids_1=token_ids_1,
210
+ already_has_special_tokens=True,
211
+ )
212
+
213
+ bos_token_id = [1] if self.add_bos_token else []
214
+ eos_token_id = [1] if self.add_eos_token else []
215
+
216
+ if token_ids_1 is None:
217
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
218
+ return (
219
+ bos_token_id
220
+ + ([0] * len(token_ids_0))
221
+ + eos_token_id
222
+ + bos_token_id
223
+ + ([0] * len(token_ids_1))
224
+ + eos_token_id
225
+ )
226
+
227
+ def create_token_type_ids_from_sequences(
228
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
229
+ ) -> List[int]:
230
+ """
231
+ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
232
+ sequence pair mask has the following format:
233
+
234
+ ```
235
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
236
+ | first sequence | second sequence |
237
+ ```
238
+
239
+ if token_ids_1 is None, only returns the first portion of the mask (0s).
240
+
241
+ Args:
242
+ token_ids_0 (`List[int]`):
243
+ List of ids.
244
+ token_ids_1 (`List[int]`, *optional*):
245
+ Optional second list of IDs for sequence pairs.
246
+
247
+ Returns:
248
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
249
+ """
250
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
251
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
252
+
253
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
254
+
255
+ if token_ids_1 is not None:
256
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
257
+
258
+ return output
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79452955be6b419a65984273a9f08af86042e1c2a75ee3ba989cbf620a133cc2
3
+ size 2001107
tokenizer_config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "auto_map": {
5
+ "AutoTokenizer": [
6
+ "tokenization_baichuan.BaichuanTokenizer",
7
+ null
8
+ ]
9
+ },
10
+ "bos_token": {
11
+ "__type": "AddedToken",
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": true,
15
+ "rstrip": false,
16
+ "single_word": true
17
+ },
18
+ "clean_up_tokenization_spaces": false,
19
+ "eos_token": {
20
+ "__type": "AddedToken",
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": true
26
+ },
27
+ "model_max_length": 4096,
28
+ "pad_token": {
29
+ "__type": "AddedToken",
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": true
35
+ },
36
+ "padding_side": "left",
37
+ "sp_model_kwargs": {},
38
+ "tokenizer_class": "BaichuanTokenizer",
39
+ "unk_token": {
40
+ "__type": "AddedToken",
41
+ "content": "<unk>",
42
+ "lstrip": false,
43
+ "normalized": true,
44
+ "rstrip": false,
45
+ "single_word": true
46
+ }
47
+ }