bartowski commited on
Commit
1254245
1 Parent(s): 413571f

Quant for 6.5

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -12,69 +12,341 @@ language:
12
  - zh
13
  - ar
14
  license: cc-by-nc-4.0
15
- quantized_by: bartowski
16
- pipeline_tag: text-generation
17
  ---
18
 
19
- ## Exllama v2 Quantizations of c4ai-command-r-v01
20
 
21
- Using <a href="https://github.com/turboderp/exllamav2/releases/tag/v0.0.16">turboderp's ExLlamaV2 v0.0.16</a> for quantization.
22
 
23
- ## The "main" branch only contains the measurement.json, download one of the other branches for the model (see below)
24
 
25
- Each branch contains an individual bits per weight, with the main one containing only the meaurement.json for further conversions.
26
 
27
- Conversion was done using the default calibration dataset.
28
 
29
- Default arguments used except when the bits per weight is above 6.0, at that point the lm_head layer is quantized at 8 bits per weight instead of the default 6.
 
 
 
 
30
 
31
- Original model: https://huggingface.co/CohereForAI/c4ai-command-r-v01
32
 
 
 
 
33
 
34
- <a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/8_0">8.0 bits per weight</a>
 
 
35
 
36
- <a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/6_5">6.5 bits per weight</a>
 
 
 
37
 
38
- <a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/5_0">5.0 bits per weight</a>
 
 
 
 
 
39
 
40
- <a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/4_25">4.25 bits per weight</a>
 
 
 
 
41
 
42
- <a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/3_5">3.5 bits per weight</a>
 
 
43
 
 
44
 
45
- ## Download instructions
 
 
46
 
47
- With git:
 
 
 
48
 
49
- ```shell
50
- git clone --single-branch --branch 6_5 https://huggingface.co/bartowski/c4ai-command-r-v01-exl2
 
 
 
 
 
 
 
51
  ```
52
 
53
- With huggingface hub (credit to TheBloke for instructions):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- ```shell
56
- pip3 install huggingface-hub
57
  ```
58
 
59
- To download the `main` (only useful if you only care about measurement.json) branch to a folder called `c4ai-command-r-v01-exl2`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- ```shell
62
- mkdir c4ai-command-r-v01-exl2
63
- huggingface-cli download bartowski/c4ai-command-r-v01-exl2 --local-dir c4ai-command-r-v01-exl2 --local-dir-use-symlinks False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ```
65
 
66
- To download from a different branch, add the `--revision` parameter:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- Linux:
 
 
69
 
70
- ```shell
71
- mkdir c4ai-command-r-v01-exl2-6_5
72
- huggingface-cli download bartowski/c4ai-command-r-v01-exl2 --revision 6_5 --local-dir c4ai-command-r-v01-exl2-6_5 --local-dir-use-symlinks False
 
73
  ```
74
 
75
- Windows (which apparently doesn't like _ in folders sometimes?):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
- ```shell
78
- mkdir c4ai-command-r-v01-exl2-6.5
79
- huggingface-cli download bartowski/c4ai-command-r-v01-exl2 --revision 6_5 --local-dir c4ai-command-r-v01-exl2-6.5 --local-dir-use-symlinks False
80
- ```
 
12
  - zh
13
  - ar
14
  license: cc-by-nc-4.0
 
 
15
  ---
16
 
17
+ # Model Card for C4AI Command-R
18
 
19
+ 🚨 **This model is non-quantized version of C4AI Command-R. You can find the quantized version of C4AI Command-R using bitsandbytes [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit)**.
20
 
21
+ ## Model Summary
22
 
23
+ C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
24
 
25
+ Developed by: Cohere and [Cohere For AI](https://cohere.for.ai)
26
 
27
+ - Point of Contact: Cohere For AI: [cohere.for.ai](https://cohere.for.ai/)
28
+ - License: [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license), requires also adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy)
29
+ - Model: c4ai-command-r-v01
30
+ - Model Size: 35 billion parameters
31
+ - Context length: 128K
32
 
33
+ **Use**
34
 
35
+ ```python
36
+ # pip install transformers
37
+ from transformers import AutoTokenizer, AutoModelForCausalLM
38
 
39
+ model_id = "CohereForAI/c4ai-command-r-v01"
40
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
41
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
42
 
43
+ # Format message with the command-r chat template
44
+ messages = [{"role": "user", "content": "Hello, how are you?"}]
45
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
46
+ ## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
47
 
48
+ gen_tokens = model.generate(
49
+ input_ids,
50
+ max_new_tokens=100,
51
+ do_sample=True,
52
+ temperature=0.3,
53
+ )
54
 
55
+ gen_text = tokenizer.decode(gen_tokens[0])
56
+ print(gen_text)
57
+ ```
58
+
59
+ **Quantized model through bitsandbytes, 8-bit precision**
60
 
61
+ ```python
62
+ # pip install transformers bitsandbytes accelerate
63
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
64
 
65
+ bnb_config = BitsAndBytesConfig(load_in_8bit=True)
66
 
67
+ model_id = "CohereForAI/c4ai-command-r-v01"
68
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
69
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config)
70
 
71
+ # Format message with the command-r chat template
72
+ messages = [{"role": "user", "content": "Hello, how are you?"}]
73
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
74
+ ## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
75
 
76
+ gen_tokens = model.generate(
77
+ input_ids,
78
+ max_new_tokens=100,
79
+ do_sample=True,
80
+ temperature=0.3,
81
+ )
82
+
83
+ gen_text = tokenizer.decode(gen_tokens[0])
84
+ print(gen_text)
85
  ```
86
 
87
+ **Quantized model through bitsandbytes, 4-bit precision**
88
+
89
+ ```python
90
+ # pip install transformers bitsandbytes accelerate
91
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
92
+
93
+ bnb_config = BitsAndBytesConfig(load_in_4bit=True)
94
+
95
+ model_id = "CohereForAI/c4ai-command-r-v01"
96
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
97
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config)
98
+
99
+ # Format message with the command-r chat template
100
+ messages = [{"role": "user", "content": "Hello, how are you?"}]
101
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
102
+ ## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
103
+
104
+ gen_tokens = model.generate(
105
+ input_ids,
106
+ max_new_tokens=100,
107
+ do_sample=True,
108
+ temperature=0.3,
109
+ )
110
 
111
+ gen_text = tokenizer.decode(gen_tokens[0])
112
+ print(gen_text)
113
  ```
114
 
115
+ ## Model Details
116
+
117
+ **Input**: Models input text only.
118
+
119
+ **Output**: Models generate text only.
120
+
121
+ **Model Architecture**: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety.
122
+
123
+ **Languages covered**: The model is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic.
124
+
125
+ Pre-training data additionally included the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, Persian.
126
+
127
+ **Context length**: Command-R supports a context length of 128K.
128
+
129
+ ### Tool use capabilities:
130
 
131
+ Command-R has been specifically trained with conversational tool use capabilities. These have been trained into the model via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template. Deviating from this prompt template will likely reduce performance, but we encourage experimentation.
132
+
133
+ Command-R’s tool use functionality takes a conversation as input (with an optional user-system preamble), along with a list of available tools. The model will then generate a json-formatted list of actions to execute on a subset of those tools. Command-R may use one of its supplied tools more than once.
134
+
135
+ The model has been trained to recognise a special `directly_answer` tool, which it uses to indicate that it doesn’t want to use any of its other tools. We recommend including the `directly_answer` tool, but encourage experimentation.
136
+
137
+ Comprehensive documentation and guides on prompting strategies for tool use will be provided shortly.
138
+
139
+ <details>
140
+ <summary><b>Usage: Rendering Tool Use Prompts [CLICK TO EXPAND]</b> </summary>
141
+
142
+ ```python
143
+ from transformers import AutoTokenizer
144
+
145
+ model_id = "CohereForAI/c4ai-command-r-v01"
146
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
147
+
148
+ # define conversation input:
149
+ conversation = [
150
+ {"role": "user", "content": "Whats the biggest penguin in the world?"}
151
+ ]
152
+ # Define tools available for the model to use:
153
+ tools = [
154
+ {
155
+ "name": "internet_search",
156
+ "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
157
+ "parameter_definitions": {
158
+ "query": {
159
+ "description": "Query to search the internet with",
160
+ "type": 'str',
161
+ "required": True
162
+ }
163
+ }
164
+ },
165
+ {
166
+ 'name': "directly_answer",
167
+ "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
168
+ 'parameter_definitions': {}
169
+ }
170
+ ]
171
+
172
+ # render the tool use prompt as a string:
173
+ tool_use_prompt = tokenizer.apply_tool_use_template(
174
+ conversation,
175
+ tools=tools,
176
+ tokenize=False,
177
+ add_generation_prompt=True,
178
+ )
179
+ print(tool_use_prompt)
180
  ```
181
 
182
+ </details>
183
+
184
+ <details>
185
+ <summary><b>Example Rendered Tool Use Prompt [CLICK TO EXPAND]</b></summary>
186
+
187
+ ````
188
+ <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
189
+ The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
190
+
191
+ # System Preamble
192
+ ## Basic Rules
193
+ You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
194
+
195
+ # User Preamble
196
+ ## Task and Context
197
+ You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
198
+
199
+ ## Style Guide
200
+ Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.
201
+
202
+ ## Available Tools
203
+ Here is a list of tools that you have available to you:
204
 
205
+ ```python
206
+ def internet_search(query: str) -> List[Dict]:
207
+ """Returns a list of relevant document snippets for a textual query retrieved from the internet
208
 
209
+ Args:
210
+ query (str): Query to search the internet with
211
+ """
212
+ pass
213
  ```
214
 
215
+ ```python
216
+ def directly_answer() -> List[Dict]:
217
+ """Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
218
+ """
219
+ pass
220
+ ```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
221
+ ```json
222
+ [
223
+ {
224
+ "tool_name": title of the tool in the specification,
225
+ "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
226
+ }
227
+ ]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
228
+
229
+ ````
230
+
231
+ </details>
232
+
233
+ <details>
234
+ <summary><b>Example Rendered Tool Use Completion [CLICK TO EXPAND]</b></summary>
235
+
236
+ ````
237
+ Action: ```json
238
+ [
239
+ {
240
+ "tool_name": "internet_search",
241
+ "parameters": {
242
+ "query": "biggest penguin in the world"
243
+ }
244
+ }
245
+ ]
246
+ ```
247
+ ````
248
+ </details>
249
+
250
+ ### Grounded Generation and RAG Capabilities:
251
+
252
+ Command-R has been specifically trained with grounded generation capabilities. This means that it can generate responses based on a list of supplied document snippets, and it will include grounding spans (citations) in its response indicating the source of the information.
253
+ This can be used to enable behaviors such as grounded summarization and the final step of Retrieval Augmented Generation (RAG).This behavior has been trained into the model via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template.
254
+ Deviating from this prompt template may reduce performance, but we encourage experimentation.
255
+
256
+ Command-R’s grounded generation behavior takes a conversation as input (with an optional user-supplied system preamble), along with a list of retrieved document snippets.
257
+ The document snippets should be chunks, rather than long documents, typically around 100-400 words per chunk. Document snippets consist of key-value pairs. The keys should be short descriptive strings, the values can be text or semi-structured.
258
+
259
+ By default, Command-R will generate grounded responses by first predicting which documents are relevant, then predicting which ones it will cite, then generating an answer.
260
+ Finally, it will then insert grounding spans into the answer. See below for an example. This is referred to as `accurate` grounded generation.
261
+
262
+ The model is trained with a number of other answering modes, which can be selected by prompt changes . A `fast` citation mode is supported in the tokenizer, which will directly generate an answer with grounding spans in it, without first writing the answer out in full. This sacrifices some grounding accuracy in favor of generating fewer tokens.
263
+
264
+ The code snippet below shows a minimal working example on how to render a prompt, generate and parse a completion.
265
+
266
+ Comprehensive documentation and guides on prompting strategies on grounded generation will be provided in follow-ups at a later stage.
267
+
268
+ <details>
269
+ <summary> <b>Usage: Rendering Grounded Generation prompts [CLICK TO EXPAND]</b> </summary>
270
+
271
+ ````python
272
+ from transformers import AutoTokenizer
273
+
274
+ model_id = "CohereForAI/c4ai-command-r-v01"
275
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
276
+
277
+ # define conversation input:
278
+ conversation = [
279
+ {"role": "user", "content": "Whats the biggest penguin in the world?"}
280
+ ]
281
+ # define documents to ground on:
282
+ documents = [
283
+ { "title": "Tall penguins", "text": "Emperor penguins are the tallest growing up to 122 cm in height." },
284
+ { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
285
+ ]
286
+
287
+ # render the tool use prompt as a string:
288
+ grounded_generation_prompt = tokenizer.apply_grounded_generation_template(
289
+ conversation,
290
+ documents=documents,
291
+ citation_mode="accurate", # or "fast"
292
+ tokenize=False,
293
+ add_generation_prompt=True,
294
+ )
295
+ print(grounded_generation_prompt)
296
+ ````
297
+ </details>
298
+
299
+ <details>
300
+ <summary><b>Example Rendered Grounded Generation Prompt [CLICK TO EXPAND]</b></summary>
301
+
302
+ ````<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
303
+ The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
304
+
305
+ # System Preamble
306
+ ## Basic Rules
307
+ You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
308
+
309
+ # User Preamble
310
+ ## Task and Context
311
+ You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
312
+
313
+ ## Style Guide
314
+ Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
315
+ Document: 0
316
+ title: Tall penguins
317
+ text: Emperor penguins are the tallest growing up to 122 cm in height.
318
+
319
+ Document: 1
320
+ title: Penguin habitats
321
+ text: Emperor penguins only live in Antarctica.
322
+ </results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
323
+ Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
324
+ Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
325
+ Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
326
+ Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
327
+ ````
328
+
329
+ </details>
330
+
331
+ <details>
332
+ <summary><b>Example Rendered Grounded Generation Completion [CLICK TO EXPAND]</b></summary>
333
+
334
+ ````
335
+ Relevant Documents: 0,1
336
+ Cited Documents: 0,1
337
+ Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
338
+ Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
339
+ ````
340
+ </details>
341
+
342
+ ### Code Capabilities:
343
+ Command-R has been optimized to interact with your code, by requesting code snippets, code explanations, or code rewrites. It might not perform well out-of-the-box for pure code completion. For better performance, we also recommend using a low temperature (and even greedy decoding) for code-generation related instructions.
344
+
345
+ ### Model Card Contact
346
+ For errors or additional questions about details in this model card, contact [info@for.ai](mailto:info@for.ai).
347
+
348
+ ### Terms of Use:
349
+ We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 35 billion parameter model to researchers all over the world. This model is governed by a [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license) License with an acceptable use addendum, and also requires adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
350
 
351
+ ### Try Chat:
352
+ You can try Command-R chat in the playground [here](https://dashboard.cohere.com/playground/chat).
 
 
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/ahmet_cohere_com/HF_Final_weight_tie",
3
+ "architectures": [
4
+ "CohereForCausalLM"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "configuration_cohere.CohereConfig",
10
+ "AutoModel": "modeling_cohere.CohereModel",
11
+ "AutoModelForCausalLM": "modeling_cohere.CohereForCausalLM"
12
+ },
13
+ "bos_token_id": 5,
14
+ "eos_token_id": 255001,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 8192,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 22528,
19
+ "layer_norm_eps": 1e-05,
20
+ "logit_scale": 0.0625,
21
+ "max_position_embeddings": 8192,
22
+ "model_max_length": 131072,
23
+ "model_type": "cohere",
24
+ "num_attention_heads": 64,
25
+ "num_hidden_layers": 40,
26
+ "num_key_value_heads": 64,
27
+ "pad_token_id": 0,
28
+ "pretraining_tp": 1,
29
+ "rope_theta": 8000000.0,
30
+ "torch_dtype": "float16",
31
+ "transformers_version": "4.38.2",
32
+ "use_cache": true,
33
+ "vocab_size": 256000,
34
+ "tie_word_embeddings": true,
35
+ "quantization_config": {
36
+ "quant_method": "exl2",
37
+ "version": "0.0.16",
38
+ "bits": 6.5,
39
+ "head_bits": 8,
40
+ "calibration": {
41
+ "rows": 100,
42
+ "length": 2048,
43
+ "dataset": "(default)"
44
+ }
45
+ }
46
+ }
configuration_cohere.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Cohere team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ Cohere model configuration"""
21
+
22
+ from transformers import PretrainedConfig, AutoConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+
29
+ class CohereConfig(PretrainedConfig):
30
+ r"""
31
+ This is the configuration class to store the configuration of a [`CohereModel`]. It is used to instantiate an Cohere
32
+ model according to the specified arguments, defining the model architecture.
33
+
34
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
35
+ documentation from [`PretrainedConfig`] for more information.
36
+
37
+
38
+ Args:
39
+ vocab_size (`int`, *optional*, defaults to 256000):
40
+ Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the
41
+ `inputs_ids` passed when calling [`CohereModel`]
42
+ hidden_size (`int`, *optional*, defaults to 8192):
43
+ Dimension of the hidden representations.
44
+ intermediate_size (`int`, *optional*, defaults to 22528):
45
+ Dimension of the MLP representations.
46
+ num_hidden_layers (`int`, *optional*, defaults to 40):
47
+ Number of hidden layers in the Transformer decoder.
48
+ num_attention_heads (`int`, *optional*, defaults to 64):
49
+ Number of attention heads for each attention layer in the Transformer decoder.
50
+ num_key_value_heads (`int`, *optional*):
51
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
52
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
53
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
54
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
55
+ by meanpooling all the original heads within that group. For more details checkout [this
56
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
57
+ `num_attention_heads`.
58
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
59
+ The non-linear activation function (function or string) in the decoder.
60
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
61
+ The maximum sequence length that this model might ever be used with.
62
+ initializer_range (`float`, *optional*, defaults to 0.02):
63
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
64
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
65
+ The epsilon used by the layer normalization.
66
+ use_cache (`bool`, *optional*, defaults to `True`):
67
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
68
+ relevant if `config.is_decoder=True`.
69
+ pad_token_id (`int`, *optional*, defaults to 0):
70
+ Padding token id.
71
+ bos_token_id (`int`, *optional*, defaults to 5):
72
+ Beginning of stream token id.
73
+ eos_token_id (`int`, *optional*, defaults to 255001):
74
+ End of stream token id.
75
+ pretraining_tp (`int`, *optional*, defaults to 1):
76
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
77
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
78
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
79
+ issue](https://github.com/pytorch/pytorch/issues/76232).
80
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
81
+ Whether to tie weight embeddings
82
+ rope_theta (`float`, *optional*, defaults to 10000.0):
83
+ The base period of the RoPE embeddings.
84
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
85
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
86
+ attention_dropout (`float`, *optional*, defaults to 0.0):
87
+ The dropout ratio for the attention probabilities.
88
+
89
+ ```python
90
+ >>> from transformers import CohereModel, CohereConfig
91
+
92
+ >>> # Initializing a Cohere model configuration
93
+ >>> configuration = CohereConfig()
94
+
95
+ >>> # Initializing a model from the Cohere configuration
96
+ >>> model = CohereModel(configuration)
97
+
98
+ >>> # Accessing the model configuration
99
+ >>> configuration = model.config
100
+ ```"""
101
+
102
+ model_type = "cohere"
103
+ keys_to_ignore_at_inference = ["past_key_values"]
104
+
105
+ def __init__(
106
+ self,
107
+ vocab_size=256000,
108
+ hidden_size=8192,
109
+ intermediate_size=22528,
110
+ num_hidden_layers=40,
111
+ num_attention_heads=64,
112
+ num_key_value_heads=None,
113
+ hidden_act="silu",
114
+ max_position_embeddings=8192,
115
+ initializer_range=0.02,
116
+ layer_norm_eps=1e-5,
117
+ use_cache=True,
118
+ pad_token_id=0,
119
+ bos_token_id=5,
120
+ eos_token_id=255001,
121
+ pretraining_tp=1,
122
+ tie_word_embeddings=True,
123
+ rope_theta=10000.0,
124
+ attention_bias=False,
125
+ attention_dropout=0.0,
126
+ **kwargs,
127
+ ):
128
+ self.vocab_size = vocab_size
129
+ self.max_position_embeddings = max_position_embeddings
130
+ self.hidden_size = hidden_size
131
+ self.intermediate_size = intermediate_size
132
+ self.num_hidden_layers = num_hidden_layers
133
+ self.num_attention_heads = num_attention_heads
134
+
135
+ # for backward compatibility
136
+ if num_key_value_heads is None:
137
+ num_key_value_heads = num_attention_heads
138
+
139
+ self.num_key_value_heads = num_key_value_heads
140
+ self.hidden_act = hidden_act
141
+ self.initializer_range = initializer_range
142
+ self.layer_norm_eps = layer_norm_eps
143
+ self.pretraining_tp = pretraining_tp
144
+ self.use_cache = use_cache
145
+ self.rope_theta = rope_theta
146
+ self.attention_bias = attention_bias
147
+ self.attention_dropout = attention_dropout
148
+
149
+ super().__init__(
150
+ pad_token_id=pad_token_id,
151
+ bos_token_id=bos_token_id,
152
+ eos_token_id=eos_token_id,
153
+ tie_word_embeddings=tie_word_embeddings,
154
+ **kwargs,
155
+ )
156
+
157
+
158
+ # register the model config to AutoConfig
159
+ AutoConfig.register("cohere", CohereConfig)
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 5,
4
+ "eos_token_id": 255001,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.38.2"
7
+ }
model.safetensors.index.json ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 69961662464
4
+ },
5
+ "weight_map": {
6
+ "model.embed_tokens.weight": "model-00001-of-00015.safetensors",
7
+ "model.layers.0.input_layernorm.weight": "model-00002-of-00015.safetensors",
8
+ "model.layers.0.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
9
+ "model.layers.0.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
10
+ "model.layers.0.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
11
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00015.safetensors",
12
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00015.safetensors",
13
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00015.safetensors",
14
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00015.safetensors",
15
+ "model.layers.1.input_layernorm.weight": "model-00002-of-00015.safetensors",
16
+ "model.layers.1.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
17
+ "model.layers.1.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
18
+ "model.layers.1.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
19
+ "model.layers.1.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
20
+ "model.layers.1.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
21
+ "model.layers.1.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
22
+ "model.layers.1.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
23
+ "model.layers.10.input_layernorm.weight": "model-00005-of-00015.safetensors",
24
+ "model.layers.10.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
25
+ "model.layers.10.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
26
+ "model.layers.10.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
27
+ "model.layers.10.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
28
+ "model.layers.10.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
29
+ "model.layers.10.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
30
+ "model.layers.10.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
31
+ "model.layers.11.input_layernorm.weight": "model-00005-of-00015.safetensors",
32
+ "model.layers.11.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
33
+ "model.layers.11.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
34
+ "model.layers.11.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
35
+ "model.layers.11.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
36
+ "model.layers.11.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
37
+ "model.layers.11.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
38
+ "model.layers.11.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
39
+ "model.layers.12.input_layernorm.weight": "model-00006-of-00015.safetensors",
40
+ "model.layers.12.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
41
+ "model.layers.12.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
42
+ "model.layers.12.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
43
+ "model.layers.12.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
44
+ "model.layers.12.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
45
+ "model.layers.12.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
46
+ "model.layers.12.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
47
+ "model.layers.13.input_layernorm.weight": "model-00006-of-00015.safetensors",
48
+ "model.layers.13.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
49
+ "model.layers.13.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
50
+ "model.layers.13.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
51
+ "model.layers.13.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
52
+ "model.layers.13.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
53
+ "model.layers.13.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
54
+ "model.layers.13.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
55
+ "model.layers.14.input_layernorm.weight": "model-00006-of-00015.safetensors",
56
+ "model.layers.14.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
57
+ "model.layers.14.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
58
+ "model.layers.14.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
59
+ "model.layers.14.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
60
+ "model.layers.14.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
61
+ "model.layers.14.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
62
+ "model.layers.14.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
63
+ "model.layers.15.input_layernorm.weight": "model-00007-of-00015.safetensors",
64
+ "model.layers.15.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
65
+ "model.layers.15.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
66
+ "model.layers.15.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
67
+ "model.layers.15.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
68
+ "model.layers.15.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
69
+ "model.layers.15.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
70
+ "model.layers.15.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
71
+ "model.layers.16.input_layernorm.weight": "model-00007-of-00015.safetensors",
72
+ "model.layers.16.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
73
+ "model.layers.16.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
74
+ "model.layers.16.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
75
+ "model.layers.16.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
76
+ "model.layers.16.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
77
+ "model.layers.16.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
78
+ "model.layers.16.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
79
+ "model.layers.17.input_layernorm.weight": "model-00007-of-00015.safetensors",
80
+ "model.layers.17.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
81
+ "model.layers.17.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
82
+ "model.layers.17.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
83
+ "model.layers.17.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
84
+ "model.layers.17.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
85
+ "model.layers.17.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
86
+ "model.layers.17.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
87
+ "model.layers.18.input_layernorm.weight": "model-00008-of-00015.safetensors",
88
+ "model.layers.18.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
89
+ "model.layers.18.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
90
+ "model.layers.18.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
91
+ "model.layers.18.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
92
+ "model.layers.18.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
93
+ "model.layers.18.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
94
+ "model.layers.18.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
95
+ "model.layers.19.input_layernorm.weight": "model-00008-of-00015.safetensors",
96
+ "model.layers.19.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
97
+ "model.layers.19.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
98
+ "model.layers.19.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
99
+ "model.layers.19.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
100
+ "model.layers.19.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
101
+ "model.layers.19.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
102
+ "model.layers.19.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
103
+ "model.layers.2.input_layernorm.weight": "model-00002-of-00015.safetensors",
104
+ "model.layers.2.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
105
+ "model.layers.2.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
106
+ "model.layers.2.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
107
+ "model.layers.2.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
108
+ "model.layers.2.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
109
+ "model.layers.2.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
110
+ "model.layers.2.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
111
+ "model.layers.20.input_layernorm.weight": "model-00008-of-00015.safetensors",
112
+ "model.layers.20.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
113
+ "model.layers.20.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
114
+ "model.layers.20.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
115
+ "model.layers.20.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
116
+ "model.layers.20.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
117
+ "model.layers.20.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
118
+ "model.layers.20.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
119
+ "model.layers.21.input_layernorm.weight": "model-00009-of-00015.safetensors",
120
+ "model.layers.21.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
121
+ "model.layers.21.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
122
+ "model.layers.21.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
123
+ "model.layers.21.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
124
+ "model.layers.21.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
125
+ "model.layers.21.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
126
+ "model.layers.21.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
127
+ "model.layers.22.input_layernorm.weight": "model-00009-of-00015.safetensors",
128
+ "model.layers.22.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
129
+ "model.layers.22.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
130
+ "model.layers.22.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
131
+ "model.layers.22.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
132
+ "model.layers.22.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
133
+ "model.layers.22.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
134
+ "model.layers.22.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
135
+ "model.layers.23.input_layernorm.weight": "model-00009-of-00015.safetensors",
136
+ "model.layers.23.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
137
+ "model.layers.23.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
138
+ "model.layers.23.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
139
+ "model.layers.23.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
140
+ "model.layers.23.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
141
+ "model.layers.23.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
142
+ "model.layers.23.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
143
+ "model.layers.24.input_layernorm.weight": "model-00010-of-00015.safetensors",
144
+ "model.layers.24.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
145
+ "model.layers.24.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
146
+ "model.layers.24.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
147
+ "model.layers.24.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
148
+ "model.layers.24.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
149
+ "model.layers.24.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
150
+ "model.layers.24.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
151
+ "model.layers.25.input_layernorm.weight": "model-00010-of-00015.safetensors",
152
+ "model.layers.25.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
153
+ "model.layers.25.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
154
+ "model.layers.25.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
155
+ "model.layers.25.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
156
+ "model.layers.25.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
157
+ "model.layers.25.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
158
+ "model.layers.25.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
159
+ "model.layers.26.input_layernorm.weight": "model-00010-of-00015.safetensors",
160
+ "model.layers.26.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
161
+ "model.layers.26.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
162
+ "model.layers.26.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
163
+ "model.layers.26.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
164
+ "model.layers.26.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
165
+ "model.layers.26.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
166
+ "model.layers.26.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
167
+ "model.layers.27.input_layernorm.weight": "model-00011-of-00015.safetensors",
168
+ "model.layers.27.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
169
+ "model.layers.27.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
170
+ "model.layers.27.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
171
+ "model.layers.27.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
172
+ "model.layers.27.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
173
+ "model.layers.27.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
174
+ "model.layers.27.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
175
+ "model.layers.28.input_layernorm.weight": "model-00011-of-00015.safetensors",
176
+ "model.layers.28.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
177
+ "model.layers.28.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
178
+ "model.layers.28.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
179
+ "model.layers.28.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
180
+ "model.layers.28.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
181
+ "model.layers.28.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
182
+ "model.layers.28.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
183
+ "model.layers.29.input_layernorm.weight": "model-00011-of-00015.safetensors",
184
+ "model.layers.29.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
185
+ "model.layers.29.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
186
+ "model.layers.29.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
187
+ "model.layers.29.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
188
+ "model.layers.29.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
189
+ "model.layers.29.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
190
+ "model.layers.29.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
191
+ "model.layers.3.input_layernorm.weight": "model-00003-of-00015.safetensors",
192
+ "model.layers.3.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
193
+ "model.layers.3.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
194
+ "model.layers.3.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
195
+ "model.layers.3.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
196
+ "model.layers.3.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
197
+ "model.layers.3.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
198
+ "model.layers.3.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
199
+ "model.layers.30.input_layernorm.weight": "model-00012-of-00015.safetensors",
200
+ "model.layers.30.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
201
+ "model.layers.30.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
202
+ "model.layers.30.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
203
+ "model.layers.30.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
204
+ "model.layers.30.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
205
+ "model.layers.30.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
206
+ "model.layers.30.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
207
+ "model.layers.31.input_layernorm.weight": "model-00012-of-00015.safetensors",
208
+ "model.layers.31.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
209
+ "model.layers.31.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
210
+ "model.layers.31.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
211
+ "model.layers.31.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
212
+ "model.layers.31.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
213
+ "model.layers.31.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
214
+ "model.layers.31.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
215
+ "model.layers.32.input_layernorm.weight": "model-00012-of-00015.safetensors",
216
+ "model.layers.32.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
217
+ "model.layers.32.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
218
+ "model.layers.32.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
219
+ "model.layers.32.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
220
+ "model.layers.32.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
221
+ "model.layers.32.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
222
+ "model.layers.32.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
223
+ "model.layers.33.input_layernorm.weight": "model-00013-of-00015.safetensors",
224
+ "model.layers.33.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
225
+ "model.layers.33.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
226
+ "model.layers.33.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
227
+ "model.layers.33.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
228
+ "model.layers.33.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
229
+ "model.layers.33.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
230
+ "model.layers.33.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
231
+ "model.layers.34.input_layernorm.weight": "model-00013-of-00015.safetensors",
232
+ "model.layers.34.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
233
+ "model.layers.34.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
234
+ "model.layers.34.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
235
+ "model.layers.34.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
236
+ "model.layers.34.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
237
+ "model.layers.34.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
238
+ "model.layers.34.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
239
+ "model.layers.35.input_layernorm.weight": "model-00013-of-00015.safetensors",
240
+ "model.layers.35.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
241
+ "model.layers.35.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
242
+ "model.layers.35.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
243
+ "model.layers.35.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
244
+ "model.layers.35.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
245
+ "model.layers.35.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
246
+ "model.layers.35.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
247
+ "model.layers.36.input_layernorm.weight": "model-00014-of-00015.safetensors",
248
+ "model.layers.36.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
249
+ "model.layers.36.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
250
+ "model.layers.36.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
251
+ "model.layers.36.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
252
+ "model.layers.36.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
253
+ "model.layers.36.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
254
+ "model.layers.36.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
255
+ "model.layers.37.input_layernorm.weight": "model-00014-of-00015.safetensors",
256
+ "model.layers.37.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
257
+ "model.layers.37.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
258
+ "model.layers.37.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
259
+ "model.layers.37.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
260
+ "model.layers.37.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
261
+ "model.layers.37.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
262
+ "model.layers.37.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
263
+ "model.layers.38.input_layernorm.weight": "model-00014-of-00015.safetensors",
264
+ "model.layers.38.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
265
+ "model.layers.38.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
266
+ "model.layers.38.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
267
+ "model.layers.38.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
268
+ "model.layers.38.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
269
+ "model.layers.38.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
270
+ "model.layers.38.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
271
+ "model.layers.39.input_layernorm.weight": "model-00015-of-00015.safetensors",
272
+ "model.layers.39.mlp.down_proj.weight": "model-00015-of-00015.safetensors",
273
+ "model.layers.39.mlp.gate_proj.weight": "model-00015-of-00015.safetensors",
274
+ "model.layers.39.mlp.up_proj.weight": "model-00015-of-00015.safetensors",
275
+ "model.layers.39.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
276
+ "model.layers.39.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
277
+ "model.layers.39.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
278
+ "model.layers.39.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
279
+ "model.layers.4.input_layernorm.weight": "model-00003-of-00015.safetensors",
280
+ "model.layers.4.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
281
+ "model.layers.4.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
282
+ "model.layers.4.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
283
+ "model.layers.4.self_attn.k_proj.weight": "model-00003-of-00015.safetensors",
284
+ "model.layers.4.self_attn.o_proj.weight": "model-00003-of-00015.safetensors",
285
+ "model.layers.4.self_attn.q_proj.weight": "model-00003-of-00015.safetensors",
286
+ "model.layers.4.self_attn.v_proj.weight": "model-00003-of-00015.safetensors",
287
+ "model.layers.5.input_layernorm.weight": "model-00003-of-00015.safetensors",
288
+ "model.layers.5.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
289
+ "model.layers.5.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
290
+ "model.layers.5.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
291
+ "model.layers.5.self_attn.k_proj.weight": "model-00003-of-00015.safetensors",
292
+ "model.layers.5.self_attn.o_proj.weight": "model-00003-of-00015.safetensors",
293
+ "model.layers.5.self_attn.q_proj.weight": "model-00003-of-00015.safetensors",
294
+ "model.layers.5.self_attn.v_proj.weight": "model-00003-of-00015.safetensors",
295
+ "model.layers.6.input_layernorm.weight": "model-00004-of-00015.safetensors",
296
+ "model.layers.6.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
297
+ "model.layers.6.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
298
+ "model.layers.6.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
299
+ "model.layers.6.self_attn.k_proj.weight": "model-00003-of-00015.safetensors",
300
+ "model.layers.6.self_attn.o_proj.weight": "model-00003-of-00015.safetensors",
301
+ "model.layers.6.self_attn.q_proj.weight": "model-00003-of-00015.safetensors",
302
+ "model.layers.6.self_attn.v_proj.weight": "model-00003-of-00015.safetensors",
303
+ "model.layers.7.input_layernorm.weight": "model-00004-of-00015.safetensors",
304
+ "model.layers.7.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
305
+ "model.layers.7.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
306
+ "model.layers.7.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
307
+ "model.layers.7.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
308
+ "model.layers.7.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
309
+ "model.layers.7.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
310
+ "model.layers.7.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
311
+ "model.layers.8.input_layernorm.weight": "model-00004-of-00015.safetensors",
312
+ "model.layers.8.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
313
+ "model.layers.8.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
314
+ "model.layers.8.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
315
+ "model.layers.8.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
316
+ "model.layers.8.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
317
+ "model.layers.8.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
318
+ "model.layers.8.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
319
+ "model.layers.9.input_layernorm.weight": "model-00005-of-00015.safetensors",
320
+ "model.layers.9.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
321
+ "model.layers.9.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
322
+ "model.layers.9.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
323
+ "model.layers.9.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
324
+ "model.layers.9.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
325
+ "model.layers.9.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
326
+ "model.layers.9.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
327
+ "model.norm.weight": "model-00015-of-00015.safetensors"
328
+ }
329
+ }
modeling_cohere.py ADDED
@@ -0,0 +1,1280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Cohere and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+
21
+ # This file is based on the LLama model definition file in transformers
22
+
23
+ """PyTorch Cohere model."""
24
+
25
+ import math
26
+ import warnings
27
+ from typing import List, Optional, Tuple, Union
28
+
29
+ import torch
30
+ import torch.nn.functional as F
31
+ import torch.utils.checkpoint
32
+ from torch import nn
33
+ from torch.nn import CrossEntropyLoss
34
+
35
+ from transformers import AutoModel, AutoModelForCausalLM
36
+ from transformers.activations import ACT2FN
37
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
38
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
39
+ from transformers.modeling_outputs import (
40
+ BaseModelOutputWithPast,
41
+ CausalLMOutputWithPast,
42
+ )
43
+ from transformers.modeling_utils import PreTrainedModel
44
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
45
+ from transformers.utils import (
46
+ add_start_docstrings,
47
+ add_start_docstrings_to_model_forward,
48
+ is_flash_attn_2_available,
49
+ is_flash_attn_greater_or_equal_2_10,
50
+ logging,
51
+ replace_return_docstrings,
52
+ )
53
+ from .configuration_cohere import CohereConfig
54
+
55
+
56
+ logger = logging.get_logger(__name__)
57
+
58
+ _CONFIG_FOR_DOC = "CohereConfig"
59
+
60
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
61
+ def _get_unpad_data(attention_mask):
62
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
63
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
64
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
65
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
66
+ return (
67
+ indices,
68
+ cu_seqlens,
69
+ max_seqlen_in_batch,
70
+ )
71
+
72
+
73
+ class LayerNorm(nn.Module):
74
+ def __init__(self, hidden_size, eps=1e-5, bias=False):
75
+ super().__init__()
76
+ self.weight = nn.Parameter(torch.ones(hidden_size))
77
+ self.bias = nn.Parameter(torch.zeros(hidden_size)) if bias else None
78
+ self.variance_epsilon = eps
79
+
80
+ def forward(self, hidden_states):
81
+ input_dtype = hidden_states.dtype
82
+ hidden_states = hidden_states.to(torch.float32)
83
+ mean = hidden_states.mean(-1, keepdim=True)
84
+ variance = (hidden_states - mean).pow(2).mean(-1, keepdim=True)
85
+ hidden_states = (hidden_states - mean) * torch.rsqrt(variance + self.variance_epsilon)
86
+ hidden_states = self.weight.to(torch.float32) * hidden_states
87
+ if self.bias is not None:
88
+ hidden_states = hidden_states + self.bias.to(torch.float32)
89
+ return hidden_states.to(input_dtype)
90
+
91
+
92
+ ALL_LAYERNORM_LAYERS.append(LayerNorm)
93
+
94
+
95
+ class CohereRotaryEmbedding(nn.Module):
96
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
97
+ super().__init__()
98
+ self.scaling_factor = scaling_factor
99
+ self.dim = dim
100
+ self.max_position_embeddings = max_position_embeddings
101
+ self.base = base
102
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
103
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
104
+ # For BC we register cos and sin cached
105
+ self.max_seq_len_cached = max_position_embeddings
106
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
107
+ t = t / self.scaling_factor
108
+ freqs = torch.outer(t, self.inv_freq)
109
+ emb = torch.repeat_interleave(freqs, 2, dim=-1)
110
+ self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
111
+ self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)
112
+
113
+ @property
114
+ def sin_cached(self):
115
+ logger.warning_once(
116
+ "The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
117
+ "the forward method of RoPE from now on instead. It is not used in the `CohereAttention` class"
118
+ )
119
+ return self._sin_cached
120
+
121
+ @property
122
+ def cos_cached(self):
123
+ logger.warning_once(
124
+ "The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
125
+ "the forward method of RoPE from now on instead. It is not used in the `CohereAttention` class"
126
+ )
127
+ return self._cos_cached
128
+
129
+ @torch.no_grad()
130
+ def forward(self, x, position_ids, seq_len=None):
131
+ if seq_len is not None:
132
+ logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
133
+
134
+ # x: [bs, num_attention_heads, seq_len, head_size]
135
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
136
+ position_ids_expanded = position_ids[:, None, :].float()
137
+ # Force float32 since bfloat16 loses precision on long contexts
138
+ # See https://github.com/huggingface/transformers/pull/29285
139
+ device_type = x.device.type
140
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
141
+ with torch.autocast(device_type=device_type, enabled=False):
142
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
143
+ emb = torch.repeat_interleave(freqs, 2, dim=-1)
144
+ cos = emb.cos()
145
+ sin = emb.sin()
146
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
147
+
148
+
149
+ def rotate_half(x):
150
+ # Split and rotate
151
+ x1 = x[..., ::2]
152
+ x2 = x[..., 1::2]
153
+ rot_x = torch.stack([-x2, x1], dim=-1).flatten(-2)
154
+ return rot_x
155
+
156
+
157
+ # copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
158
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
159
+ """Applies Rotary Position Embedding to the query and key tensors.
160
+
161
+ Args:
162
+ q (`torch.Tensor`): The query tensor.
163
+ k (`torch.Tensor`): The key tensor.
164
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
165
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
166
+ position_ids (`torch.Tensor`, *optional*):
167
+ Deprecated and unused.
168
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
169
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
170
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
171
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
172
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
173
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
174
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
175
+ Returns:
176
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
177
+ """
178
+ cos = cos.unsqueeze(unsqueeze_dim)
179
+ sin = sin.unsqueeze(unsqueeze_dim)
180
+ q_embed = (q * cos) + (rotate_half(q) * sin)
181
+ k_embed = (k * cos) + (rotate_half(k) * sin)
182
+ return q_embed, k_embed
183
+
184
+
185
+ # Copied from transformers.models.llama.modeling_llama.LlamaMLP Llama->Cohere
186
+ class CohereMLP(nn.Module):
187
+ def __init__(self, config):
188
+ super().__init__()
189
+ self.config = config
190
+ self.hidden_size = config.hidden_size
191
+ self.intermediate_size = config.intermediate_size
192
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
193
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
194
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
195
+ self.act_fn = ACT2FN[config.hidden_act]
196
+
197
+ def forward(self, x):
198
+ if self.config.pretraining_tp > 1:
199
+ slice = self.intermediate_size // self.config.pretraining_tp
200
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
201
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
202
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
203
+
204
+ gate_proj = torch.cat(
205
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
206
+ )
207
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
208
+
209
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
210
+ down_proj = [
211
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
212
+ ]
213
+ down_proj = sum(down_proj)
214
+ else:
215
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
216
+
217
+ return down_proj
218
+
219
+
220
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
221
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
222
+ """
223
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
224
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
225
+ """
226
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
227
+ if n_rep == 1:
228
+ return hidden_states
229
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
230
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
231
+
232
+
233
+ class Attention(nn.Module):
234
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
235
+
236
+ def __init__(self, config: CohereConfig, layer_idx: Optional[int] = None):
237
+ super().__init__()
238
+ self.config = config
239
+ self.layer_idx = layer_idx
240
+ if layer_idx is None:
241
+ logger.warning_once(
242
+ f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
243
+ "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
244
+ "when creating this class."
245
+ )
246
+
247
+ self.attention_dropout = config.attention_dropout
248
+ self.hidden_size = config.hidden_size
249
+ self.num_heads = config.num_attention_heads
250
+ self.head_dim = self.hidden_size // self.num_heads
251
+ self.num_key_value_heads = config.num_key_value_heads
252
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
253
+ self.max_position_embeddings = config.max_position_embeddings
254
+ self.rope_theta = config.rope_theta
255
+ self.is_causal = True
256
+
257
+ if (self.head_dim * self.num_heads) != self.hidden_size:
258
+ raise ValueError(
259
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
260
+ f" and `num_heads`: {self.num_heads})."
261
+ )
262
+
263
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
264
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
265
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
266
+ self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=config.attention_bias)
267
+ self.rotary_emb = CohereRotaryEmbedding(
268
+ self.head_dim,
269
+ max_position_embeddings=self.max_position_embeddings,
270
+ base=self.rope_theta,
271
+ )
272
+
273
+ def forward(
274
+ self,
275
+ hidden_states: torch.Tensor,
276
+ attention_mask: Optional[torch.Tensor] = None,
277
+ position_ids: Optional[torch.LongTensor] = None,
278
+ past_key_value: Optional[Cache] = None,
279
+ output_attentions: bool = False,
280
+ use_cache: bool = False,
281
+ cache_position: Optional[torch.LongTensor] = None,
282
+ **kwargs,
283
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
284
+ bsz, q_len, _ = hidden_states.size()
285
+
286
+ if self.config.pretraining_tp > 1:
287
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
288
+ query_slices = self.q_proj.weight.split(
289
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
290
+ )
291
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
292
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
293
+
294
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
295
+ query_states = torch.cat(query_states, dim=-1)
296
+
297
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
298
+ key_states = torch.cat(key_states, dim=-1)
299
+
300
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
301
+ value_states = torch.cat(value_states, dim=-1)
302
+
303
+ else:
304
+ query_states = self.q_proj(hidden_states)
305
+ key_states = self.k_proj(hidden_states)
306
+ value_states = self.v_proj(hidden_states)
307
+
308
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
309
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
310
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
311
+
312
+ past_key_value = getattr(self, "past_key_value", past_key_value)
313
+ cos, sin = self.rotary_emb(value_states, position_ids)
314
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
315
+
316
+ if past_key_value is not None:
317
+ # sin and cos are specific to RoPE models; position_ids needed for the static cache
318
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
319
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
320
+
321
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
322
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
323
+
324
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
325
+
326
+ if attention_mask is not None: # no matter the length, we just slice it
327
+ causal_mask = attention_mask
328
+ if cache_position is not None:
329
+ causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
330
+ attn_weights = attn_weights + causal_mask
331
+
332
+ # upcast attention to fp32
333
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
334
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
335
+ attn_output = torch.matmul(attn_weights, value_states)
336
+
337
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
338
+ raise ValueError(
339
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
340
+ f" {attn_output.size()}"
341
+ )
342
+
343
+ attn_output = attn_output.transpose(1, 2).contiguous()
344
+
345
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
346
+
347
+ if self.config.pretraining_tp > 1:
348
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
349
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
350
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
351
+ else:
352
+ attn_output = self.o_proj(attn_output)
353
+
354
+ if not output_attentions:
355
+ attn_weights = None
356
+
357
+ return attn_output, attn_weights, past_key_value
358
+
359
+
360
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2 Llama->Cohere
361
+ class CohereFlashAttention2(Attention):
362
+ """
363
+ Cohere flash attention module. This module inherits from `Attention` as the weights of the module stays
364
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
365
+ flash attention and deal with padding tokens in case the input contains any of them.
366
+ """
367
+
368
+ def __init__(self, *args, **kwargs):
369
+ super().__init__(*args, **kwargs)
370
+
371
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
372
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
373
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
374
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
375
+
376
+ def forward(
377
+ self,
378
+ hidden_states: torch.Tensor,
379
+ attention_mask: Optional[torch.LongTensor] = None,
380
+ position_ids: Optional[torch.LongTensor] = None,
381
+ past_key_value: Optional[Cache] = None,
382
+ output_attentions: bool = False,
383
+ use_cache: bool = False,
384
+ cache_position: Optional[torch.LongTensor] = None,
385
+ **kwargs,
386
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
387
+ output_attentions = False
388
+
389
+ bsz, q_len, _ = hidden_states.size()
390
+
391
+ query_states = self.q_proj(hidden_states)
392
+ key_states = self.k_proj(hidden_states)
393
+ value_states = self.v_proj(hidden_states)
394
+
395
+ # Flash attention requires the input to have the shape
396
+ # batch_size x seq_length x head_dim x hidden_dim
397
+ # therefore we just need to keep the original shape
398
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
399
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
400
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
401
+
402
+ cos, sin = self.rotary_emb(value_states, position_ids)
403
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
404
+
405
+ past_key_value = getattr(self, "past_key_value", past_key_value)
406
+
407
+ if past_key_value is not None:
408
+ # sin and cos are specific to RoPE models; position_ids needed for the static cache
409
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
410
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
411
+
412
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
413
+ # to be able to avoid many of these transpose/reshape/view.
414
+ query_states = query_states.transpose(1, 2)
415
+ key_states = key_states.transpose(1, 2)
416
+ value_states = value_states.transpose(1, 2)
417
+
418
+ dropout_rate = self.attention_dropout if self.training else 0.0
419
+
420
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
421
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
422
+ # cast them back in the correct dtype just to be sure everything works as expected.
423
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
424
+ # in fp32.
425
+
426
+ input_dtype = query_states.dtype
427
+ if input_dtype == torch.float32:
428
+ if torch.is_autocast_enabled():
429
+ target_dtype = torch.get_autocast_gpu_dtype()
430
+ # Handle the case where the model is quantized
431
+ elif hasattr(self.config, "_pre_quantization_dtype"):
432
+ target_dtype = self.config._pre_quantization_dtype
433
+ else:
434
+ target_dtype = self.q_proj.weight.dtype
435
+
436
+ logger.warning_once(
437
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
438
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
439
+ f" {target_dtype}."
440
+ )
441
+
442
+ query_states = query_states.to(target_dtype)
443
+ key_states = key_states.to(target_dtype)
444
+ value_states = value_states.to(target_dtype)
445
+
446
+ attn_output = self._flash_attention_forward(
447
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
448
+ )
449
+
450
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
451
+ attn_output = self.o_proj(attn_output)
452
+
453
+ if not output_attentions:
454
+ attn_weights = None
455
+
456
+ return attn_output, attn_weights, past_key_value
457
+
458
+ def _flash_attention_forward(
459
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
460
+ ):
461
+ """
462
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
463
+ first unpad the input, then computes the attention scores and pad the final attention scores.
464
+
465
+ Args:
466
+ query_states (`torch.Tensor`):
467
+ Input query states to be passed to Flash Attention API
468
+ key_states (`torch.Tensor`):
469
+ Input key states to be passed to Flash Attention API
470
+ value_states (`torch.Tensor`):
471
+ Input value states to be passed to Flash Attention API
472
+ attention_mask (`torch.Tensor`):
473
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
474
+ position of padding tokens and 1 for the position of non-padding tokens.
475
+ dropout (`int`, *optional*):
476
+ Attention dropout
477
+ softmax_scale (`float`, *optional*):
478
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
479
+ """
480
+ if not self._flash_attn_uses_top_left_mask:
481
+ causal = self.is_causal
482
+ else:
483
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in CohereFlashAttention2 __init__.
484
+ causal = self.is_causal and query_length != 1
485
+
486
+ # Contains at least one padding token in the sequence
487
+ if attention_mask is not None:
488
+ batch_size = query_states.shape[0]
489
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
490
+ query_states, key_states, value_states, attention_mask, query_length
491
+ )
492
+
493
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
494
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
495
+
496
+ attn_output_unpad = flash_attn_varlen_func(
497
+ query_states,
498
+ key_states,
499
+ value_states,
500
+ cu_seqlens_q=cu_seqlens_q,
501
+ cu_seqlens_k=cu_seqlens_k,
502
+ max_seqlen_q=max_seqlen_in_batch_q,
503
+ max_seqlen_k=max_seqlen_in_batch_k,
504
+ dropout_p=dropout,
505
+ softmax_scale=softmax_scale,
506
+ causal=causal,
507
+ )
508
+
509
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
510
+ else:
511
+ attn_output = flash_attn_func(
512
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
513
+ )
514
+
515
+ return attn_output
516
+
517
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
518
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
519
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
520
+
521
+ key_layer = index_first_axis(
522
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
523
+ )
524
+ value_layer = index_first_axis(
525
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
526
+ )
527
+ if query_length == kv_seq_len:
528
+ query_layer = index_first_axis(
529
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
530
+ )
531
+ cu_seqlens_q = cu_seqlens_k
532
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
533
+ indices_q = indices_k
534
+ elif query_length == 1:
535
+ max_seqlen_in_batch_q = 1
536
+ cu_seqlens_q = torch.arange(
537
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
538
+ ) # There is a memcpy here, that is very bad.
539
+ indices_q = cu_seqlens_q[:-1]
540
+ query_layer = query_layer.squeeze(1)
541
+ else:
542
+ # The -q_len: slice assumes left padding.
543
+ attention_mask = attention_mask[:, -query_length:]
544
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
545
+
546
+ return (
547
+ query_layer,
548
+ key_layer,
549
+ value_layer,
550
+ indices_q,
551
+ (cu_seqlens_q, cu_seqlens_k),
552
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
553
+ )
554
+
555
+
556
+ class SdpaAttention(Attention):
557
+ """
558
+ Attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
559
+ `Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
560
+ SDPA API.
561
+ """
562
+
563
+ # Adapted from Attention.forward
564
+ def forward(
565
+ self,
566
+ hidden_states: torch.Tensor,
567
+ attention_mask: Optional[torch.Tensor] = None,
568
+ position_ids: Optional[torch.LongTensor] = None,
569
+ past_key_value: Optional[Cache] = None,
570
+ output_attentions: bool = False,
571
+ use_cache: bool = False,
572
+ cache_position: Optional[torch.LongTensor] = None,
573
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
574
+ if output_attentions:
575
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
576
+ logger.warning_once(
577
+ "CohereModel is using SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
578
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
579
+ )
580
+ return super().forward(
581
+ hidden_states=hidden_states,
582
+ attention_mask=attention_mask,
583
+ position_ids=position_ids,
584
+ past_key_value=past_key_value,
585
+ output_attentions=output_attentions,
586
+ use_cache=use_cache,
587
+ cache_position=cache_position,
588
+ )
589
+
590
+ bsz, q_len, _ = hidden_states.size()
591
+
592
+ query_states = self.q_proj(hidden_states)
593
+ key_states = self.k_proj(hidden_states)
594
+ value_states = self.v_proj(hidden_states)
595
+
596
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
597
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
598
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
599
+
600
+ cos, sin = self.rotary_emb(value_states, position_ids)
601
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
602
+
603
+ # In case static cache is used, it is an instance attribute.
604
+ past_key_value = getattr(self, "past_key_value", past_key_value)
605
+
606
+ if past_key_value is not None:
607
+ # sin and cos are specific to RoPE models; position_ids needed for the static cache
608
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
609
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
610
+
611
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
612
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
613
+
614
+ causal_mask = attention_mask
615
+ if attention_mask is not None and cache_position is not None:
616
+ causal_mask = causal_mask[:, :, cache_position, : key_states.shape[-2]]
617
+
618
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
619
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
620
+ if query_states.device.type == "cuda" and causal_mask is not None:
621
+ query_states = query_states.contiguous()
622
+ key_states = key_states.contiguous()
623
+ value_states = value_states.contiguous()
624
+
625
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
626
+ query_states,
627
+ key_states,
628
+ value_states,
629
+ attn_mask=causal_mask,
630
+ dropout_p=self.attention_dropout if self.training else 0.0,
631
+ )
632
+
633
+ attn_output = attn_output.transpose(1, 2).contiguous()
634
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
635
+
636
+ attn_output = self.o_proj(attn_output)
637
+
638
+ return attn_output, None, past_key_value
639
+
640
+
641
+ COHERE_ATTENTION_CLASSES = {
642
+ "eager": Attention,
643
+ "flash_attention_2": CohereFlashAttention2,
644
+ "sdpa": SdpaAttention,
645
+ }
646
+
647
+
648
+ class CohereDecoderLayer(nn.Module):
649
+ def __init__(self, config: CohereConfig, layer_idx: int):
650
+ super().__init__()
651
+ self.hidden_size = config.hidden_size
652
+
653
+ self.self_attn = COHERE_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
654
+
655
+ self.mlp = CohereMLP(config)
656
+ self.input_layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
657
+
658
+ def forward(
659
+ self,
660
+ hidden_states: torch.Tensor,
661
+ attention_mask: Optional[torch.Tensor] = None,
662
+ position_ids: Optional[torch.LongTensor] = None,
663
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
664
+ output_attentions: Optional[bool] = False,
665
+ use_cache: Optional[bool] = False,
666
+ cache_position: Optional[torch.LongTensor] = None,
667
+ **kwargs,
668
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
669
+ """
670
+ Args:
671
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
672
+ attention_mask (`torch.FloatTensor`, *optional*):
673
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
674
+ query_sequence_length, key_sequence_length)` if default attention is used.
675
+ output_attentions (`bool`, *optional*):
676
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
677
+ returned tensors for more detail.
678
+ use_cache (`bool`, *optional*):
679
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
680
+ (see `past_key_values`).
681
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
682
+ """
683
+ if "padding_mask" in kwargs:
684
+ warnings.warn(
685
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
686
+ )
687
+
688
+ residual = hidden_states
689
+
690
+ hidden_states = self.input_layernorm(hidden_states)
691
+
692
+ # Self Attention
693
+ hidden_states_attention, self_attn_weights, present_key_value = self.self_attn(
694
+ hidden_states=hidden_states,
695
+ attention_mask=attention_mask,
696
+ position_ids=position_ids,
697
+ past_key_value=past_key_value,
698
+ output_attentions=output_attentions,
699
+ use_cache=use_cache,
700
+ cache_position=cache_position,
701
+ **kwargs,
702
+ )
703
+
704
+ # Fully Connected
705
+ hidden_states_mlp = self.mlp(hidden_states)
706
+
707
+ # Add everything together
708
+ hidden_states = residual + hidden_states_attention + hidden_states_mlp
709
+
710
+ outputs = (hidden_states,)
711
+
712
+ if output_attentions:
713
+ outputs += (self_attn_weights,)
714
+
715
+ if use_cache:
716
+ outputs += (present_key_value,)
717
+
718
+ return outputs
719
+
720
+
721
+ COHERE_START_DOCSTRING = r"""
722
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
723
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
724
+ etc.)
725
+
726
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
727
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
728
+ and behavior.
729
+
730
+ Parameters:
731
+ config ([`CohereConfig`]):
732
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
733
+ load the weights associated with the model, only the configuration. Check out the
734
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
735
+ """
736
+
737
+
738
+ @add_start_docstrings(
739
+ "The bare Cohere Model outputting raw hidden-states without any specific head on top.",
740
+ COHERE_START_DOCSTRING,
741
+ )
742
+ class CoherePreTrainedModel(PreTrainedModel):
743
+ config_class = CohereConfig
744
+ base_model_prefix = "model"
745
+ supports_gradient_checkpointing = True
746
+ _no_split_modules = ["CohereDecoderLayer"]
747
+ _skip_keys_device_placement = ["past_key_values", "causal_mask"]
748
+ _supports_flash_attn_2 = True
749
+ _supports_sdpa = True
750
+ _supports_cache_class = True
751
+
752
+ def _init_weights(self, module):
753
+ std = self.config.initializer_range
754
+ if isinstance(module, nn.Linear):
755
+ module.weight.data.normal_(mean=0.0, std=std)
756
+ if module.bias is not None:
757
+ module.bias.data.zero_()
758
+ elif isinstance(module, nn.Embedding):
759
+ module.weight.data.normal_(mean=0.0, std=std)
760
+ if module.padding_idx is not None:
761
+ module.weight.data[module.padding_idx].zero_()
762
+
763
+ def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] = None):
764
+ if self.config._attn_implementation == "flash_attention_2" and cache_cls == StaticCache:
765
+ raise ValueError(
766
+ "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
767
+ "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
768
+ )
769
+
770
+ if max_cache_len > self.model.causal_mask.shape[-1] or self.device != self.model.causal_mask.device:
771
+ causal_mask = torch.full(
772
+ (max_cache_len, max_cache_len), fill_value=True, device=self.device, dtype=torch.bool
773
+ )
774
+ self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
775
+
776
+ for layer in self.model.layers:
777
+ device = layer.input_layernorm.weight.device
778
+ if hasattr(self.config, "_pre_quantization_dtype"):
779
+ dtype = self.config._pre_quantization_dtype
780
+ else:
781
+ dtype = layer.self_attn.o_proj.weight.dtype
782
+ layer.self_attn.past_key_value = cache_cls(
783
+ self.config, max_batch_size, max_cache_len, device=device, dtype=dtype
784
+ )
785
+
786
+ def _reset_cache(self):
787
+ for layer in self.model.layers:
788
+ layer.self_attn.past_key_value = None
789
+
790
+
791
+ COHERE_INPUTS_DOCSTRING = r"""
792
+ Args:
793
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
794
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
795
+ it.
796
+
797
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
798
+ [`PreTrainedTokenizer.__call__`] for details.
799
+
800
+ [What are input IDs?](../glossary#input-ids)
801
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
802
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
803
+
804
+ - 1 for tokens that are **not masked**,
805
+ - 0 for tokens that are **masked**.
806
+
807
+ [What are attention masks?](../glossary#attention-mask)
808
+
809
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
810
+ [`PreTrainedTokenizer.__call__`] for details.
811
+
812
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
813
+ `past_key_values`).
814
+
815
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
816
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
817
+ information on the default strategy.
818
+
819
+ - 1 indicates the head is **not masked**,
820
+ - 0 indicates the head is **masked**.
821
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
822
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
823
+ config.n_positions - 1]`.
824
+
825
+ [What are position IDs?](../glossary#position-ids)
826
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
827
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
828
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
829
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
830
+
831
+ Two formats are allowed:
832
+ - a [`~cache_utils.Cache`] instance;
833
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
834
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
835
+ cache format.
836
+
837
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
838
+ legacy cache format will be returned.
839
+
840
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
841
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
842
+ of shape `(batch_size, sequence_length)`.
843
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
844
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
845
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
846
+ model's internal embedding lookup matrix.
847
+ use_cache (`bool`, *optional*):
848
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
849
+ `past_key_values`).
850
+ output_attentions (`bool`, *optional*):
851
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
852
+ tensors for more detail.
853
+ output_hidden_states (`bool`, *optional*):
854
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
855
+ more detail.
856
+ return_dict (`bool`, *optional*):
857
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
858
+ """
859
+
860
+
861
+ @add_start_docstrings(
862
+ "The bare Cohere Model outputting raw hidden-states without any specific head on top.",
863
+ COHERE_START_DOCSTRING,
864
+ )
865
+ class CohereModel(CoherePreTrainedModel):
866
+ """
867
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`CohereDecoderLayer`]
868
+
869
+ Args:
870
+ config: CohereConfig
871
+ """
872
+
873
+ def __init__(self, config: CohereConfig):
874
+ super().__init__(config)
875
+ self.padding_idx = config.pad_token_id
876
+ self.vocab_size = config.vocab_size
877
+
878
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
879
+ self.layers = nn.ModuleList(
880
+ [CohereDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
881
+ )
882
+ self.norm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
883
+ self.gradient_checkpointing = False
884
+
885
+ # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
886
+ # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_position_embeddings`.
887
+ causal_mask = torch.full(
888
+ (config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool
889
+ )
890
+ self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
891
+ # Initialize weights and apply final processing
892
+ self.post_init()
893
+
894
+ def get_input_embeddings(self):
895
+ return self.embed_tokens
896
+
897
+ def set_input_embeddings(self, value):
898
+ self.embed_tokens = value
899
+
900
+ @add_start_docstrings_to_model_forward(COHERE_INPUTS_DOCSTRING)
901
+ def forward(
902
+ self,
903
+ input_ids: torch.LongTensor = None,
904
+ attention_mask: Optional[torch.Tensor] = None,
905
+ position_ids: Optional[torch.LongTensor] = None,
906
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
907
+ inputs_embeds: Optional[torch.FloatTensor] = None,
908
+ use_cache: Optional[bool] = None,
909
+ output_attentions: Optional[bool] = None,
910
+ output_hidden_states: Optional[bool] = None,
911
+ return_dict: Optional[bool] = None,
912
+ cache_position: Optional[torch.LongTensor] = None,
913
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
914
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
915
+ output_hidden_states = (
916
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
917
+ )
918
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
919
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
920
+
921
+ if (input_ids is None) ^ (inputs_embeds is not None):
922
+ raise ValueError(
923
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
924
+ )
925
+
926
+ if self.gradient_checkpointing and self.training and use_cache:
927
+ logger.warning_once(
928
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
929
+ )
930
+ use_cache = False
931
+
932
+ if inputs_embeds is None:
933
+ inputs_embeds = self.embed_tokens(input_ids)
934
+
935
+ past_seen_tokens = 0
936
+ if use_cache: # kept for BC (cache positions)
937
+ if not isinstance(past_key_values, StaticCache):
938
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
939
+ past_seen_tokens = past_key_values.get_seq_length()
940
+
941
+ if cache_position is None:
942
+ if isinstance(past_key_values, StaticCache):
943
+ raise ValueError("cache_position is a required argument when using StaticCache.")
944
+ cache_position = torch.arange(
945
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
946
+ )
947
+
948
+ if position_ids is None:
949
+ position_ids = cache_position.unsqueeze(0)
950
+
951
+ causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
952
+
953
+ # embed positions
954
+ hidden_states = inputs_embeds
955
+
956
+ # decoder layers
957
+ all_hidden_states = () if output_hidden_states else None
958
+ all_self_attns = () if output_attentions else None
959
+ next_decoder_cache = None
960
+
961
+ for decoder_layer in self.layers:
962
+ if output_hidden_states:
963
+ all_hidden_states += (hidden_states,)
964
+
965
+ if self.gradient_checkpointing and self.training:
966
+ layer_outputs = self._gradient_checkpointing_func(
967
+ decoder_layer.__call__,
968
+ hidden_states,
969
+ causal_mask,
970
+ position_ids,
971
+ past_key_values,
972
+ output_attentions,
973
+ use_cache,
974
+ cache_position,
975
+ )
976
+ else:
977
+ layer_outputs = decoder_layer(
978
+ hidden_states,
979
+ attention_mask=causal_mask,
980
+ position_ids=position_ids,
981
+ past_key_value=past_key_values,
982
+ output_attentions=output_attentions,
983
+ use_cache=use_cache,
984
+ cache_position=cache_position,
985
+ )
986
+
987
+ hidden_states = layer_outputs[0]
988
+
989
+ if use_cache:
990
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
991
+
992
+ if output_attentions:
993
+ all_self_attns += (layer_outputs[1],)
994
+
995
+ hidden_states = self.norm(hidden_states)
996
+
997
+ # add hidden states from the last decoder layer
998
+ if output_hidden_states:
999
+ all_hidden_states += (hidden_states,)
1000
+
1001
+ next_cache = None
1002
+ if use_cache:
1003
+ next_cache = (
1004
+ next_decoder_cache.to_legacy_cache() if isinstance(next_decoder_cache, Cache) else next_decoder_cache
1005
+ )
1006
+ if not return_dict:
1007
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1008
+ return BaseModelOutputWithPast(
1009
+ last_hidden_state=hidden_states,
1010
+ past_key_values=next_cache,
1011
+ hidden_states=all_hidden_states,
1012
+ attentions=all_self_attns,
1013
+ )
1014
+
1015
+ # TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
1016
+ # KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
1017
+ # (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
1018
+ # `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
1019
+ def _update_causal_mask(self, attention_mask, input_tensor):
1020
+ if self.config._attn_implementation == "flash_attention_2":
1021
+ if attention_mask is not None and 0.0 in attention_mask:
1022
+ return attention_mask
1023
+ return None
1024
+
1025
+ batch_size, seq_length = input_tensor.shape[:2]
1026
+ dtype = input_tensor.dtype
1027
+ device = input_tensor.device
1028
+
1029
+ # support going beyond cached `max_position_embedding`
1030
+ if seq_length > self.causal_mask.shape[-1]:
1031
+ causal_mask = torch.full((2 * self.causal_mask.shape[-1], 2 * self.causal_mask.shape[-1]), fill_value=1)
1032
+ self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
1033
+
1034
+ # We use the current dtype to avoid any overflows
1035
+ min_dtype = torch.finfo(dtype).min
1036
+ causal_mask = self.causal_mask[None, None, :, :].to(dtype=dtype, device=device) * min_dtype
1037
+ causal_mask = causal_mask.expand(batch_size, 1, -1, -1)
1038
+ if attention_mask is not None and attention_mask.dim() == 2:
1039
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
1040
+ mask_length = attention_mask.shape[-1]
1041
+ padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
1042
+ causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
1043
+
1044
+ if (
1045
+ self.config._attn_implementation == "sdpa"
1046
+ and attention_mask is not None
1047
+ and attention_mask.device.type == "cuda"
1048
+ ):
1049
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
1050
+ is_tracing = (
1051
+ torch.jit.is_tracing()
1052
+ or isinstance(input_tensor, torch.fx.Proxy)
1053
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
1054
+ )
1055
+ if not is_tracing and torch.any(attention_mask != 1):
1056
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
1057
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1058
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1059
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
1060
+
1061
+ return causal_mask
1062
+
1063
+
1064
+ class CohereForCausalLM(CoherePreTrainedModel):
1065
+ _tied_weights_keys = ["model.embed_tokens.weight", "lm_head.weight"]
1066
+
1067
+ def __init__(self, config):
1068
+ super().__init__(config)
1069
+ self.model = CohereModel(config)
1070
+ self.vocab_size = config.vocab_size
1071
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1072
+ self.logit_scale = config.logit_scale
1073
+ # Initialize weights and apply final processing
1074
+ self.post_init()
1075
+
1076
+ def get_input_embeddings(self):
1077
+ return self.model.embed_tokens
1078
+
1079
+ def set_input_embeddings(self, value):
1080
+ self.model.embed_tokens = value
1081
+
1082
+ def get_output_embeddings(self):
1083
+ return self.lm_head
1084
+
1085
+ def set_output_embeddings(self, new_embeddings):
1086
+ self.lm_head = new_embeddings
1087
+
1088
+ def set_decoder(self, decoder):
1089
+ self.model = decoder
1090
+
1091
+ def get_decoder(self):
1092
+ return self.model
1093
+
1094
+ @add_start_docstrings_to_model_forward(COHERE_INPUTS_DOCSTRING)
1095
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1096
+ def forward(
1097
+ self,
1098
+ input_ids: torch.LongTensor = None,
1099
+ attention_mask: Optional[torch.Tensor] = None,
1100
+ position_ids: Optional[torch.LongTensor] = None,
1101
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1102
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1103
+ labels: Optional[torch.LongTensor] = None,
1104
+ use_cache: Optional[bool] = None,
1105
+ output_attentions: Optional[bool] = None,
1106
+ output_hidden_states: Optional[bool] = None,
1107
+ return_dict: Optional[bool] = None,
1108
+ cache_position: Optional[torch.LongTensor] = None,
1109
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1110
+ r"""
1111
+ Args:
1112
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1113
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1114
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1115
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1116
+
1117
+ Returns:
1118
+
1119
+ Example:
1120
+
1121
+ ```python
1122
+ >>> from transformers import AutoTokenizer, CohereForCausalLM
1123
+
1124
+ #TODO: Model name needs to be updated
1125
+ >>> model = CohereForCausalLM.from_pretrained("CohereForAI/Cohere-model")
1126
+ >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/Cohere-model")
1127
+
1128
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1129
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1130
+
1131
+ >>> # Generate
1132
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1133
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1134
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1135
+ ```"""
1136
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1137
+ output_hidden_states = (
1138
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1139
+ )
1140
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1141
+
1142
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1143
+ outputs = self.model(
1144
+ input_ids=input_ids,
1145
+ attention_mask=attention_mask,
1146
+ position_ids=position_ids,
1147
+ past_key_values=past_key_values,
1148
+ inputs_embeds=inputs_embeds,
1149
+ use_cache=use_cache,
1150
+ output_attentions=output_attentions,
1151
+ output_hidden_states=output_hidden_states,
1152
+ return_dict=return_dict,
1153
+ cache_position=cache_position,
1154
+ )
1155
+
1156
+ hidden_states = outputs[0]
1157
+ if self.config.pretraining_tp > 1:
1158
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1159
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1160
+ logits = torch.cat(logits, dim=-1)
1161
+ else:
1162
+ logits = self.lm_head(hidden_states)
1163
+ logits = logits * self.logit_scale
1164
+ logits = logits.float()
1165
+
1166
+ loss = None
1167
+ if labels is not None:
1168
+ # Shift so that tokens < n predict n
1169
+ shift_logits = logits[..., :-1, :].contiguous()
1170
+ shift_labels = labels[..., 1:].contiguous()
1171
+ # Flatten the tokens
1172
+ loss_fct = CrossEntropyLoss()
1173
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1174
+ shift_labels = shift_labels.view(-1)
1175
+ # Enable model parallelism
1176
+ shift_labels = shift_labels.to(shift_logits.device)
1177
+ loss = loss_fct(shift_logits, shift_labels)
1178
+
1179
+ if not return_dict:
1180
+ output = (logits,) + outputs[1:]
1181
+ return (loss,) + output if loss is not None else output
1182
+
1183
+ return CausalLMOutputWithPast(
1184
+ loss=loss,
1185
+ logits=logits,
1186
+ past_key_values=outputs.past_key_values,
1187
+ hidden_states=outputs.hidden_states,
1188
+ attentions=outputs.attentions,
1189
+ )
1190
+
1191
+ def prepare_inputs_for_generation(
1192
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1193
+ ):
1194
+ past_length = 0
1195
+ if past_key_values is not None:
1196
+ if isinstance(past_key_values, Cache):
1197
+ cache_length = past_key_values.get_seq_length()
1198
+ past_length = past_key_values.seen_tokens
1199
+ max_cache_length = past_key_values.get_max_length()
1200
+ else:
1201
+ cache_length = past_length = past_key_values[0][0].shape[2]
1202
+ max_cache_length = None
1203
+
1204
+ # Keep only the unprocessed tokens:
1205
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1206
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1207
+ # input)
1208
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1209
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1210
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1211
+ # input_ids based on the past_length.
1212
+ elif past_length < input_ids.shape[1]:
1213
+ input_ids = input_ids[:, past_length:]
1214
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1215
+
1216
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1217
+ if (
1218
+ max_cache_length is not None
1219
+ and attention_mask is not None
1220
+ and cache_length + input_ids.shape[1] > max_cache_length
1221
+ ):
1222
+ attention_mask = attention_mask[:, -max_cache_length:]
1223
+
1224
+ position_ids = kwargs.get("position_ids", None)
1225
+ if attention_mask is not None and position_ids is None:
1226
+ # create position_ids on the fly for batch generation
1227
+ position_ids = attention_mask.long().cumsum(-1) - 1
1228
+ position_ids.masked_fill_(attention_mask == 0, 1)
1229
+ if past_key_values:
1230
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1231
+
1232
+ if self.generation_config.cache_implementation == "static":
1233
+ # generation with static cache
1234
+ cache_position = kwargs.get("cache_position", None)
1235
+ if cache_position is None:
1236
+ past_length = 0
1237
+ else:
1238
+ past_length = cache_position[-1] + 1
1239
+ input_ids = input_ids[:, past_length:]
1240
+ position_ids = position_ids[:, past_length:]
1241
+
1242
+ # TODO @gante we should only keep a `cache_position` in generate, and do +=1.
1243
+ # same goes for position ids. Could also help with continued generation.
1244
+ input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
1245
+ cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
1246
+ position_ids = position_ids.contiguous() if position_ids is not None else None
1247
+
1248
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1249
+ if inputs_embeds is not None and past_key_values is None:
1250
+ model_inputs = {"inputs_embeds": inputs_embeds}
1251
+ else:
1252
+ # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
1253
+ # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
1254
+ # TODO: use `next_tokens` directly instead.
1255
+ model_inputs = {"input_ids": input_ids.contiguous()}
1256
+
1257
+ model_inputs.update(
1258
+ {
1259
+ "position_ids": position_ids,
1260
+ "cache_position": cache_position,
1261
+ "past_key_values": past_key_values,
1262
+ "use_cache": kwargs.get("use_cache"),
1263
+ "attention_mask": attention_mask,
1264
+ }
1265
+ )
1266
+ return model_inputs
1267
+
1268
+ @staticmethod
1269
+ def _reorder_cache(past_key_values, beam_idx):
1270
+ reordered_past = ()
1271
+ for layer_past in past_key_values:
1272
+ reordered_past += (
1273
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1274
+ )
1275
+ return reordered_past
1276
+
1277
+
1278
+ # register models as AutoModel and AutoModelForCausalLM
1279
+ AutoModel.register(CohereConfig, CohereModel)
1280
+ AutoModelForCausalLM.register(CohereConfig, CohereForCausalLM)
original_repo_url.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://huggingface.co/CohereForAI/c4ai-command-r-v01
output-00001-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88133c16e799264d13192e376b32de7ae1bfa4fd8f1867f3885d2b26055b16bb
3
+ size 8587925566
output-00002-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd8bd6d89b19c414e2ed8301d98eb25c3a4306f1f7871d485ccca753dfe48046
3
+ size 8569599568
output-00003-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18779478f820e7677b7468e191967cda01f8cdf4eee27082120d732f82423a8c
3
+ size 8582850448
output-00004-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a252c8446fad63ebee87edd843b402920acf88d608f34eed9ce97945c47b692
3
+ size 6871343664
output-00005-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cba5ab17ad5c45e11404996de897e2f740e98727cde2921f781c161b528ba62f
3
+ size 2097152096
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<BOS_TOKEN>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|END_OF_TURN_TOKEN|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<PAD>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenization_cohere_fast.py ADDED
@@ -0,0 +1,754 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Cohere and The HuggingFace Inc. team.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ # This file is based on the tokenization_llama_fast.py file in transformers
17
+
18
+
19
+ import os
20
+ from shutil import copyfile
21
+ from typing import Optional, Tuple, Dict, Union, List, Literal
22
+
23
+ from tokenizers import processors
24
+ from transformers import AutoTokenizer
25
+ from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
26
+ from transformers.utils import logging
27
+ from transformers.utils.versions import require_version
28
+ from transformers.tokenization_utils_base import TensorType
29
+ from transformers.pipelines.conversational import Conversation
30
+
31
+ from .configuration_cohere import CohereConfig
32
+
33
+ require_version("tokenizers>=0.13.3")
34
+
35
+ logger = logging.get_logger(__name__)
36
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.json"}
37
+
38
+ PRETRAINED_VOCAB_FILES_MAP = {
39
+ "vocab_file": {
40
+ "cohere-tokenizer": "https://huggingface.co/Cohere/Command-nightly/blob/main/tokenizer.json",
41
+ },
42
+ }
43
+
44
+ # fmt: off
45
+ DEFAULT_SYSTEM_PROMPT = "You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere."
46
+ DEFAULT_RAG_PREAMBLE = """## Task and Context
47
+ You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
48
+
49
+ ## Style Guide
50
+ Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling."""
51
+ # fmt: on
52
+
53
+
54
+ class CohereTokenizerFast(PreTrainedTokenizerFast):
55
+ """
56
+ Construct a Cohere tokenizer. Based on byte-level Byte-Pair-Encoding.
57
+
58
+ This uses notably ByteFallback and NFC normalization.
59
+
60
+ ```python
61
+ >>> from transformers import AutoTokenizer
62
+
63
+ >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-0.1")
64
+ >>> tokenizer.encode("Hello this is a test")
65
+ [1, 15043, 445, 338, 263, 1243]
66
+ ```
67
+
68
+ If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
69
+ call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
70
+ values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
71
+ [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
72
+
73
+
74
+ This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
75
+ refer to this superclass for more information regarding those methods.
76
+
77
+ Args:
78
+ vocab_file (`str`, *optional*):
79
+ [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
80
+ contains the vocabulary necessary to instantiate a tokenizer.
81
+ tokenizer_file (`str`, *optional*):
82
+ [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
83
+ contains everything needed to load the tokenizer.
84
+ clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
85
+ Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
86
+ extra spaces.
87
+ unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`):
88
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
89
+ token instead.
90
+ bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`):
91
+ The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
92
+ eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`):
93
+ The end of sequence token.
94
+ add_bos_token (`bool`, *optional*, defaults to `True`):
95
+ Whether or not to add an `bos_token` at the start of sequences.
96
+ add_eos_token (`bool`, *optional*, defaults to `False`):
97
+ Whether or not to add an `eos_token` at the end of sequences.
98
+ use_default_system_prompt (`bool`, *optional*, defaults to `False`):
99
+ Whether or not the default system prompt for Cohere tokenizer should be used.
100
+ add_prefix_space (`bool`, *optional*):
101
+ Whether or not the tokenizer should automatically add a prefix space
102
+ """
103
+
104
+ vocab_files_names = VOCAB_FILES_NAMES
105
+ padding_side = "left"
106
+ model_input_names = ["input_ids", "attention_mask"]
107
+
108
+ def __init__(
109
+ self,
110
+ vocab_file=None,
111
+ tokenizer_file=None,
112
+ clean_up_tokenization_spaces=False,
113
+ unk_token="<UNK>",
114
+ bos_token="<BOS_TOKEN>",
115
+ eos_token="<EOS_TOKEN>",
116
+ add_bos_token=True,
117
+ add_eos_token=False,
118
+ use_default_system_prompt=False,
119
+ add_prefix_space=None,
120
+ **kwargs,
121
+ ):
122
+ if add_prefix_space is not None:
123
+ logger.warning_once(
124
+ "You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers"
125
+ )
126
+ kwargs["from_slow"] = True
127
+
128
+ super().__init__(
129
+ vocab_file=vocab_file,
130
+ tokenizer_file=tokenizer_file,
131
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
132
+ unk_token=unk_token,
133
+ bos_token=bos_token,
134
+ eos_token=eos_token,
135
+ add_bos_token=add_bos_token,
136
+ add_eos_token=add_eos_token,
137
+ use_default_system_prompt=use_default_system_prompt,
138
+ **kwargs,
139
+ )
140
+ self._add_bos_token = add_bos_token
141
+ self._add_eos_token = add_eos_token
142
+ self.update_post_processor()
143
+ self.use_default_system_prompt = use_default_system_prompt
144
+ self.vocab_file = vocab_file
145
+ self.grounded_generation_template = kwargs.pop("grounded_generation_template", None)
146
+ self.tool_use_template = kwargs.pop("tool_use_template", None)
147
+
148
+ def update_post_processor(self):
149
+ """
150
+ Updates the underlying post processor with the current `bos_token` and `eos_token`.
151
+ """
152
+ bos = self.bos_token
153
+ bos_token_id = self.bos_token_id
154
+ if bos is None and self.add_bos_token:
155
+ raise ValueError("add_bos_token = True but bos_token = None")
156
+
157
+ eos = self.eos_token
158
+ eos_token_id = self.eos_token_id
159
+ if eos is None and self.add_eos_token:
160
+ raise ValueError("add_eos_token = True but eos_token = None")
161
+
162
+ single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
163
+ pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
164
+
165
+ special_tokens = []
166
+ if self.add_bos_token:
167
+ special_tokens.append((bos, bos_token_id))
168
+ if self.add_eos_token:
169
+ special_tokens.append((eos, eos_token_id))
170
+ self._tokenizer.post_processor = processors.TemplateProcessing(
171
+ single=single, pair=pair, special_tokens=special_tokens
172
+ )
173
+
174
+ @property
175
+ def add_eos_token(self):
176
+ return self._add_eos_token
177
+
178
+ @property
179
+ def add_bos_token(self):
180
+ return self._add_bos_token
181
+
182
+ @add_eos_token.setter
183
+ def add_eos_token(self, value):
184
+ self._add_eos_token = value
185
+ self.update_post_processor()
186
+
187
+ @add_bos_token.setter
188
+ def add_bos_token(self, value):
189
+ self._add_bos_token = value
190
+ self.update_post_processor()
191
+
192
+ @property
193
+ def default_chat_template(self):
194
+ """
195
+ Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
196
+ Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
197
+ for user, assitant and system messages respectively.
198
+
199
+ The output should look something like:
200
+ <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|>
201
+ <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|>
202
+ <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>
203
+
204
+ Use add_generation_prompt to add a prompt for the model to generate a response:
205
+
206
+ >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
207
+ >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
208
+ <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
209
+
210
+ """
211
+ logger.warning_once(
212
+ "\nNo chat template is defined for this tokenizer - using the default template "
213
+ f"for the {self.__class__.__name__} class. If the default is not appropriate for "
214
+ "your model, please set `tokenizer.chat_template` to an appropriate template. "
215
+ "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n"
216
+ )
217
+ template = (
218
+ "{{ bos_token }}"
219
+ "{% if messages[0]['role'] == 'system' %}"
220
+ "{% set loop_messages = messages[1:] %}" # Extract system message if it's present
221
+ "{% set system_message = messages[0]['content'] %}"
222
+ "{% elif USE_DEFAULT_PROMPT == true %}"
223
+ "{% set loop_messages = messages %}" # Or use the default system message if the flag is set
224
+ "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
225
+ "{% else %}"
226
+ "{% set loop_messages = messages %}"
227
+ "{% set system_message = false %}"
228
+ "{% endif %}"
229
+ "{% if system_message != false %}" # Start with system message
230
+ "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
231
+ "{% endif %}"
232
+ "{% for message in loop_messages %}" # Loop over all non-system messages
233
+ "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
234
+ "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
235
+ "{% endif %}"
236
+ "{% set content = message['content'] %}"
237
+ "{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
238
+ "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
239
+ "{% elif message['role'] == 'assistant' %}"
240
+ "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
241
+ "{% endif %}"
242
+ "{% endfor %}"
243
+ "{% if add_generation_prompt %}"
244
+ "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
245
+ "{% endif %}"
246
+ )
247
+ template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
248
+ default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
249
+ template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
250
+
251
+ return template
252
+
253
+ @property
254
+ def default_tool_use_template(self):
255
+ template = (
256
+ "{{ bos_token }}"
257
+ "{% if messages[0]['role'] == 'system' %}"
258
+ "{% set loop_messages = messages[1:] %}" # Extract system message if it's present
259
+ "{% set system_message = messages[0]['content'] %}"
260
+ "{% else %}"
261
+ "{% set loop_messages = messages %}"
262
+ "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
263
+ "{% endif %}"
264
+ "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
265
+ "{{ '# Safety Preamble' }}"
266
+ "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
267
+ "{{ '\n\n# System Preamble' }}"
268
+ "{{ '\n## Basic Rules' }}"
269
+ "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
270
+ "{{ '\n\n# User Preamble' }}"
271
+ "{{ '\n' + system_message }}"
272
+ "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
273
+ "{% for tool in tools %}"
274
+ "{% if loop.index0 != 0 %}"
275
+ "{{ '\n\n'}}"
276
+ "{% endif %}"
277
+ "{{'```python\ndef ' + tool.name + '('}}"
278
+ "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
279
+ "{% if loop.index0 != 0 %}"
280
+ "{{ ', '}}"
281
+ "{% endif %}"
282
+ "{{param_name}}: "
283
+ "{% if not param_fields.required %}"
284
+ "{{'Optional[' + param_fields.type + '] = None'}}"
285
+ "{% else %}"
286
+ "{{ param_fields.type }}"
287
+ "{% endif %}"
288
+ "{% endfor %}"
289
+ "{{ ') -> List[Dict]:\n \"\"\"'}}"
290
+ "{{ tool.description }}"
291
+ "{% if tool.parameter_definitions|length != 0 %}"
292
+ "{{ '\n\n Args:\n '}}"
293
+ "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
294
+ "{% if loop.index0 != 0 %}"
295
+ "{{ '\n ' }}"
296
+ "{% endif %}"
297
+ "{{ param_name + ' ('}}"
298
+ "{% if not param_fields.required %}"
299
+ "{{'Optional[' + param_fields.type + ']'}}"
300
+ "{% else %}"
301
+ "{{ param_fields.type }}"
302
+ "{% endif %}"
303
+ "{{ '): ' + param_fields.description }}"
304
+ "{% endfor %}"
305
+ "{% endif %}"
306
+ "{{ '\n \"\"\"\n pass\n```' }}"
307
+ "{% endfor %}"
308
+ "{{ '<|END_OF_TURN_TOKEN|>'}}"
309
+ "{% for message in loop_messages %}"
310
+ "{% set content = message['content'] %}"
311
+ "{% if message['role'] == 'user' %}"
312
+ "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
313
+ "{% elif message['role'] == 'system' %}"
314
+ "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
315
+ "{% elif message['role'] == 'assistant' %}"
316
+ "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
317
+ "{% endif %}"
318
+ "{% endfor %}"
319
+ "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n {\n \"tool_name\": title of the tool in the specification,\n \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n }\n]```<|END_OF_TURN_TOKEN|>'}}"
320
+ "{% if add_generation_prompt %}"
321
+ "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
322
+ "{% endif %}"
323
+ )
324
+ default_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
325
+ template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
326
+ return template
327
+
328
+ @property
329
+ def default_grounded_generation_template(self):
330
+ template = (
331
+ "{{ bos_token }}"
332
+ "{% if messages[0]['role'] == 'system' %}"
333
+ "{% set loop_messages = messages[1:] %}" # Extract system message if it's present
334
+ "{% set system_message = messages[0]['content'] %}"
335
+ "{% else %}"
336
+ "{% set loop_messages = messages %}"
337
+ "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
338
+ "{% endif %}"
339
+ "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
340
+ "{{ '# Safety Preamble' }}"
341
+ "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
342
+ "{{ '\n\n# System Preamble' }}"
343
+ "{{ '\n## Basic Rules' }}"
344
+ "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
345
+ "{{ '\n\n# User Preamble' }}"
346
+ "{{ '\n' + system_message }}"
347
+ "{{ '<|END_OF_TURN_TOKEN|>'}}"
348
+ "{% for message in loop_messages %}" # Loop over all non-system messages
349
+ "{% set content = message['content'] %}"
350
+ "{% if message['role'] == 'user' %}" # After all of that, handle messages/roles in a fairly normal way
351
+ "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
352
+ "{% elif message['role'] == 'system' %}"
353
+ "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
354
+ "{% elif message['role'] == 'assistant' %}"
355
+ "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
356
+ "{% endif %}"
357
+ "{% endfor %}"
358
+ "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
359
+ "{{ '<results>' }}"
360
+ "{% for document in documents %}" # Loop over all non-system messages
361
+ "{{ '\nDocument: ' }}"
362
+ "{{ loop.index0 }}\n"
363
+ "{% for key, value in document.items() %}"
364
+ "{{ key }}: {{value}}\n"
365
+ "{% endfor %}"
366
+ "{% endfor %}"
367
+ "{{ '</results>'}}"
368
+ "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
369
+ "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
370
+ "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
371
+ "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
372
+ "{% if citation_mode=='accurate' %}"
373
+ "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
374
+ "{% endif %}"
375
+ "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
376
+ "{{ '<|END_OF_TURN_TOKEN|>' }}"
377
+ "{% if add_generation_prompt %}"
378
+ "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
379
+ "{% endif %}"
380
+ )
381
+ default_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
382
+ template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
383
+ return template
384
+
385
+ def _apply_template_with_arguments(
386
+ self,
387
+ conversation: Union[List[Dict[str, str]], "Conversation"],
388
+ template: Optional[str] = None,
389
+ add_generation_prompt: bool = False,
390
+ tokenize: bool = True,
391
+ padding: bool = False,
392
+ truncation: bool = False,
393
+ max_length: Optional[int] = None,
394
+ return_tensors: Optional[Union[str, TensorType]] = None,
395
+ return_dict: bool = False,
396
+ **kwargs,
397
+ ) -> Union[str, List[int]]:
398
+ """Just tokenization_utils_base.apply_chat_template, but modified so that the jinjia template can take kwargs"""
399
+ if hasattr(conversation, "messages"):
400
+ # Indicates it's a Conversation object
401
+ conversation = conversation.messages
402
+
403
+ # Compilation function uses a cache to avoid recompiling the same template
404
+ compiled_template = self._compile_jinja_template(template)
405
+
406
+ rendered = compiled_template.render(
407
+ messages=conversation,
408
+ add_generation_prompt=add_generation_prompt,
409
+ **kwargs,
410
+ **self.special_tokens_map
411
+ )
412
+
413
+ if padding is True:
414
+ padding = "max_length" # There's only one sequence here, so "longest" makes no sense
415
+ if tokenize:
416
+ if return_dict:
417
+ return self(
418
+ rendered,
419
+ padding=padding,
420
+ truncation=truncation,
421
+ max_length=max_length,
422
+ add_special_tokens=False,
423
+ return_tensors=return_tensors,
424
+ **kwargs,
425
+ )
426
+ else:
427
+ return self.encode(
428
+ rendered,
429
+ padding=padding,
430
+ truncation=truncation,
431
+ max_length=max_length,
432
+ add_special_tokens=False,
433
+ return_tensors=return_tensors,
434
+ **kwargs,
435
+ )
436
+ else:
437
+ return rendered
438
+
439
+ def apply_tool_use_template(
440
+ self,
441
+ conversation: Union[List[Dict[str, str]], "Conversation"],
442
+ tools: List[Dict],
443
+ tool_use_template: Optional[str] = None,
444
+ **kwargs
445
+ ) -> Union[str, List[int]]:
446
+ """Create a Command-R tool-use prompt.
447
+
448
+ Once rendered, the prompt instructs the model to generate a list of actions to perform on a set of user supplied tools
449
+ to help carry out the user's requests.
450
+
451
+ Conceptually, this works in the same way as `apply_chat_format`, but takes an additional `tools` parameter.
452
+
453
+ Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of available
454
+ tools for the model to use into a prompt string, or a list of token ids.
455
+ This method will use the tokenizer's `default_tool_use_template` template specified at the class level.
456
+ You can override the default template using the `tool_use_template` kwarg but the quality of your results may decrease.
457
+
458
+ Args:
459
+ conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
460
+ with "role" and "content" keys, representing the chat history so far.
461
+ tools (List[Dict]): a list of tools to render into the prompt for the model to choose from.
462
+ See an example at the bottom of the docstring.
463
+ The format should be:
464
+ * name (str): The name of the tool to be called. Valid names contain only the characters a-z,
465
+ A-Z, 0-9, _ and must not begin with a digit.
466
+ * description (str): The description of what the tool does, the model uses the description to
467
+ choose when and how to call the function.
468
+ * parameter_definitions (List[Dict]): The input parameters of the tool. Accepts a dictionary
469
+ where the key is the name of the parameter and the value is the parameter spec.
470
+ Valid parameter names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit.
471
+ Parameter specs are as follows:
472
+ * description (str): The description of the parameter.
473
+ * type (str): the type of the parameter - most effective for python builtin data types, such as 'str', 'bool'
474
+ * required: boolean: Denotes whether the parameter is always present (required) or not. Defaults to not required.
475
+ tool_use_template (str, *optional*): A Jinja template to use for this conversion. If
476
+ this is not passed, the model's default chat template will be used instead.
477
+ add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
478
+ the start of an assistant message. This is useful when you want to generate a response from the model.
479
+ Note that this argument will be passed to the chat template, and so it must be supported in the
480
+ template for this argument to have any effect.
481
+ tokenize (`bool`, defaults to `True`):
482
+ Whether to tokenize the output. If `False`, the output will be a string.
483
+ padding (`bool`, defaults to `False`):
484
+ Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
485
+ truncation (`bool`, defaults to `False`):
486
+ Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
487
+ max_length (`int`, *optional*):
488
+ Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
489
+ not specified, the tokenizer's `max_length` attribute will be used as a default.
490
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
491
+ If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
492
+ values are:
493
+ - `'tf'`: Return TensorFlow `tf.Tensor` objects.
494
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
495
+ - `'np'`: Return NumPy `np.ndarray` objects.
496
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
497
+ return_dict (`bool`, *optional*, defaults to `False`):
498
+ Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
499
+ **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.
500
+
501
+ Returns:
502
+ `str`: A rendered prompt string.
503
+ or if tokenize=True:
504
+ `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
505
+ output is ready to pass to the model, either directly or via methods like `generate()`.
506
+
507
+ Examples:
508
+
509
+ ```python
510
+ >>> tokenizer = CohereTokenizerFast.from_pretrained("CohereForAI/c4ai-command-r-0.1")
511
+ >>> tools = [
512
+ {
513
+ "name": "internet_search",
514
+ "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
515
+ "parameter_definitions": {
516
+ "query": {
517
+ "description": "Query to search the internet with",
518
+ "type": "str",
519
+ "required": True
520
+ }
521
+ }
522
+ },
523
+ {
524
+ "name': "directly_answer",
525
+ "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
526
+ "parameter_definitions": {}
527
+ }
528
+ ]
529
+ >>> conversation = [
530
+ {"role": "user", "content": "Whats the biggest penguin in the world?"}
531
+ ]
532
+ >>> # render the prompt, ready for user to inspect, or for input into the model:
533
+ >>> prompt = tokenizer.apply_tool_use_template(
534
+ conversation,
535
+ tools=tools,
536
+ tokenize=False,
537
+ add_generation_prompt=True,
538
+ )
539
+ >>> print(prompt)
540
+ <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
541
+ The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
542
+
543
+ # System Preamble
544
+ ## Basic Rules
545
+ You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
546
+
547
+ # User Preamble
548
+ ## Task and Context
549
+ You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
550
+
551
+ ## Style Guide
552
+ Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.
553
+
554
+ ## Available Tools
555
+ Here is a list of tools that you have available to you:
556
+
557
+ \`\`\`python
558
+ def internet_search(query: str) -> List[Dict]:
559
+ \"\"\"Returns a list of relevant document snippets for a textual query retrieved from the internet
560
+
561
+ Args:
562
+ query (str): Query to search the internet with
563
+ \"\"\"
564
+ pass
565
+ \`\`\`
566
+
567
+ \`\`\`python
568
+ def directly_answer() -> List[Dict]:
569
+ \"\"\"Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
570
+ \"\"\"
571
+ pass
572
+ \`\`\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
573
+ \`\`\`json
574
+ [
575
+ {
576
+ "tool_name": title of the tool in the specification,
577
+ "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
578
+ }
579
+ ]\`\`\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
580
+ ```
581
+ >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
582
+ >>> outputs = model.generate(inputs, max_new_tokens=128)
583
+ >>> print(tokenizer.decode(outputs[0]))
584
+ Action: ```json
585
+ [
586
+ {
587
+ "tool_name": "internet_search",
588
+ "parameters": {
589
+ "query": "biggest penguin in the world"
590
+ }
591
+ }
592
+ ]
593
+ ```
594
+ """
595
+ # priority: `tool_use_template` argument > `tokenizer.tool_use_template` > `tokenizer.default_tool_use_template`
596
+ if tool_use_template is None:
597
+ if self.tool_use_template is not None:
598
+ tool_use_template = self.tool_use_template
599
+ else:
600
+ tool_use_template = self.default_tool_use_template
601
+
602
+ return self._apply_template_with_arguments(
603
+ conversation,
604
+ tools=tools,
605
+ template=tool_use_template,
606
+ **kwargs,
607
+ )
608
+
609
+ def apply_grounded_generation_template(
610
+ self,
611
+ conversation: Union[List[Dict[str, str]], "Conversation"],
612
+ documents: List[Dict],
613
+ citation_mode: Literal["fast", "accurate"] = "accurate",
614
+ grounded_generation_template: Optional[str] = None,
615
+ **kwargs
616
+ ) -> Union[str, List[int]]:
617
+ """Create a Command-R grounded generation (aka RAG) prompt.
618
+
619
+ Once rendered, the prompt instructs the model to generate a response with citations in, based on supplied documents.
620
+
621
+ Conceptually, this works in the same way as `apply_chat_format`, but takes additional `documents`
622
+ and parameter `citation_mode` parameters.
623
+
624
+ Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of
625
+ documents for the model to ground its response on into a prompt string, or a list of token ids.
626
+ This method will use the tokenizer's `grounded_generation_template` template specified at the class level.
627
+ You can override the default template using the `grounded_generation_template` kwarg but the quality of your results may decrease.
628
+
629
+ Args:
630
+ conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
631
+ with "role" and "content" keys, representing the chat history so far.
632
+ documents (List[Dict[str, str]): A list of dicts, representing documents or tool outputs to ground your
633
+ generation on. A document is a semistructured dict, wiht a string to string mapping. Common fields are
634
+ `url`, `title`, `snippet` etc but should be descriptive of the key. They will get rendered into the prompt.
635
+ citation_mode: either "accurate" (prompt the model to generate an answer first, then rewrite it with citation
636
+ spans in) or "fast", where the prompt instructs the model to generate an answer with citations in directly.
637
+ The former has higher quality citations, the latter requires fewer tokens to be generated.
638
+ grounded_generation_template (str, *optional*): A Jinja template to use for this conversion. If
639
+ this is not passed, the model's default grounded_generation_template template will be used instead.
640
+ add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
641
+ the start of an assistant message. This is useful when you want to generate a response from the model.
642
+ Note that this argument will be passed to the chat template, and so it must be supported in the
643
+ template for this argument to have any effect.
644
+ tokenize (`bool`, defaults to `True`):
645
+ Whether to tokenize the output. If `False`, the output will be a string.
646
+ padding (`bool`, defaults to `False`):
647
+ Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
648
+ truncation (`bool`, defaults to `False`):
649
+ Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
650
+ max_length (`int`, *optional*):
651
+ Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
652
+ not specified, the tokenizer's `max_length` attribute will be used as a default.
653
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
654
+ If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
655
+ values are:
656
+ - `'tf'`: Return TensorFlow `tf.Tensor` objects.
657
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
658
+ - `'np'`: Return NumPy `np.ndarray` objects.
659
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
660
+ return_dict (`bool`, *optional*, defaults to `False`):
661
+ Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
662
+ **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.
663
+
664
+ Returns:
665
+ `str`: A rendered prompt string.
666
+ or if tokenize=True:
667
+ `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
668
+ output is ready to pass to the model, either directly or via methods like `generate()`.
669
+
670
+ Examples:
671
+
672
+ ```python
673
+ >>> tokenizer = CohereTokenizerFast.from_pretrained('CohereForAI/c4ai-command-r-0.1')
674
+
675
+ >>> # define documents:
676
+ >>> documents = [
677
+ { "title": "Tall penguins", "text": "Emperor penguins are the tallest." },
678
+ { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
679
+ ]
680
+ >>> # define a conversation:
681
+ >>> conversation = [
682
+ {"role": "user", "content": "Whats the biggest penguin in the world?"}
683
+ ]
684
+ >>> # render the prompt, ready for user to inspect, or for input into the model:
685
+ >>> grounded_generation_prompt = tokenizer.apply_grounded_generation_template(
686
+ conversation,
687
+ documents=documents,
688
+ tokenize=False,
689
+ add_generation_prompt=True,
690
+ )
691
+ >>> print(grounded_generation_prompt)
692
+ <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
693
+ The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
694
+
695
+ ## Basic Rules
696
+ You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
697
+
698
+ # User Preamble
699
+ ## Task and Context
700
+ You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
701
+
702
+ ## Style Guide
703
+ Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
704
+ Document: 0
705
+ title: Tall penguins
706
+ text: Emperor penguins are the tallest.
707
+
708
+ Document: 1
709
+ title: Penguin habitats
710
+ text: Emperor penguins only live in Antarctica.
711
+ </results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
712
+ Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
713
+ Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
714
+ Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
715
+ Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'''
716
+ ```
717
+ >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
718
+ >>> outputs = model.generate(inputs, max_new_tokens=128)
719
+ >>> print(tokenizer.decode(outputs[0]))
720
+ Relevant Documents: 0,1
721
+ Cited Documents: 0,1
722
+ Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
723
+ Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
724
+ """
725
+ # priority: `grounded_generation_template` argument > `tokenizer.grounded_generation_template` > `tokenizer.default_grounded_generation_template`
726
+ if grounded_generation_template is None:
727
+ if self.grounded_generation_template is not None:
728
+ grounded_generation_template = self.grounded_generation_template
729
+ else:
730
+ grounded_generation_template = self.default_grounded_generation_template
731
+
732
+ return self._apply_template_with_arguments(
733
+ conversation,
734
+ documents=documents,
735
+ template=grounded_generation_template,
736
+ citation_mode=citation_mode,
737
+ **kwargs,
738
+ )
739
+
740
+ # TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers
741
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
742
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
743
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
744
+
745
+ output = bos_token_id + token_ids_0 + eos_token_id
746
+
747
+ if token_ids_1 is not None:
748
+ output = output + bos_token_id + token_ids_1 + eos_token_id
749
+
750
+ return output
751
+
752
+
753
+ # register the tokenizer to AutoTokenizer
754
+ AutoTokenizer.register(CohereConfig, fast_tokenizer_class=CohereTokenizerFast)
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0af6e6fe50ce1bb5611b103482de6bac000c82e06898138d57f35af121aec772
3
+ size 12777406
tokenizer_config.json ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<PAD>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<UNK>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<CLS>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "3": {
30
+ "content": "<SEP>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "4": {
38
+ "content": "<MASK_TOKEN>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "5": {
46
+ "content": "<BOS_TOKEN>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "6": {
54
+ "content": "<EOS_TOKEN>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "7": {
62
+ "content": "<EOP_TOKEN>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "255000": {
70
+ "content": "<|START_OF_TURN_TOKEN|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": false
76
+ },
77
+ "255001": {
78
+ "content": "<|END_OF_TURN_TOKEN|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": false
84
+ },
85
+ "255002": {
86
+ "content": "<|YES_TOKEN|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": false
92
+ },
93
+ "255003": {
94
+ "content": "<|NO_TOKEN|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": false
100
+ },
101
+ "255004": {
102
+ "content": "<|GOOD_TOKEN|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": false
108
+ },
109
+ "255005": {
110
+ "content": "<|BAD_TOKEN|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": false
116
+ },
117
+ "255006": {
118
+ "content": "<|USER_TOKEN|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "255007": {
126
+ "content": "<|CHATBOT_TOKEN|>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "255008": {
134
+ "content": "<|SYSTEM_TOKEN|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "255009": {
142
+ "content": "<|USER_0_TOKEN|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "255010": {
150
+ "content": "<|USER_1_TOKEN|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "255011": {
158
+ "content": "<|USER_2_TOKEN|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "255012": {
166
+ "content": "<|USER_3_TOKEN|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "255013": {
174
+ "content": "<|USER_4_TOKEN|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "255014": {
182
+ "content": "<|USER_5_TOKEN|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "255015": {
190
+ "content": "<|USER_6_TOKEN|>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "255016": {
198
+ "content": "<|USER_7_TOKEN|>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "255017": {
206
+ "content": "<|USER_8_TOKEN|>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "255018": {
214
+ "content": "<|USER_9_TOKEN|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": false
220
+ },
221
+ "255019": {
222
+ "content": "<|EXTRA_0_TOKEN|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": false
228
+ },
229
+ "255020": {
230
+ "content": "<|EXTRA_1_TOKEN|>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": false
236
+ },
237
+ "255021": {
238
+ "content": "<|EXTRA_2_TOKEN|>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": false
244
+ },
245
+ "255022": {
246
+ "content": "<|EXTRA_3_TOKEN|>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": false
252
+ },
253
+ "255023": {
254
+ "content": "<|EXTRA_4_TOKEN|>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": false
260
+ },
261
+ "255024": {
262
+ "content": "<|EXTRA_5_TOKEN|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": false
268
+ },
269
+ "255025": {
270
+ "content": "<|EXTRA_6_TOKEN|>",
271
+ "lstrip": false,
272
+ "normalized": false,
273
+ "rstrip": false,
274
+ "single_word": false,
275
+ "special": false
276
+ },
277
+ "255026": {
278
+ "content": "<|EXTRA_7_TOKEN|>",
279
+ "lstrip": false,
280
+ "normalized": false,
281
+ "rstrip": false,
282
+ "single_word": false,
283
+ "special": false
284
+ },
285
+ "255027": {
286
+ "content": "<|EXTRA_8_TOKEN|>",
287
+ "lstrip": false,
288
+ "normalized": false,
289
+ "rstrip": false,
290
+ "single_word": false,
291
+ "special": false
292
+ },
293
+ "255028": {
294
+ "content": "<|EXTRA_9_TOKEN|>",
295
+ "lstrip": false,
296
+ "normalized": false,
297
+ "rstrip": false,
298
+ "single_word": false,
299
+ "special": false
300
+ }
301
+ },
302
+ "auto_map": {
303
+ "AutoTokenizer": [
304
+ null,
305
+ "tokenization_cohere_fast.CohereTokenizerFast"
306
+ ]
307
+ },
308
+ "bos_token": "<BOS_TOKEN>",
309
+ "clean_up_tokenization_spaces": false,
310
+ "eos_token": "<|END_OF_TURN_TOKEN|>",
311
+ "legacy": true,
312
+ "model_max_length": 1000000000000000019884624838656,
313
+ "pad_token": "<PAD>",
314
+ "sp_model_kwargs": {},
315
+ "spaces_between_special_tokens": false,
316
+ "tokenizer_class": "CohereTokenizer",
317
+ "unk_token": null,
318
+ "use_default_system_prompt": false
319
+ }