bartowski commited on Mar 20

Commit

1254245

•

1 Parent(s): 413571f

Quant for 6.5

Browse files

Files changed (17) hide show

.gitattributes +1 -0
README.md +307 -35
config.json +46 -0
configuration_cohere.py +159 -0
generation_config.json +7 -0
model.safetensors.index.json +329 -0
modeling_cohere.py +1280 -0
original_repo_url.txt +1 -0
output-00001-of-00005.safetensors +3 -0
output-00002-of-00005.safetensors +3 -0
output-00003-of-00005.safetensors +3 -0
output-00004-of-00005.safetensors +3 -0
output-00005-of-00005.safetensors +3 -0
special_tokens_map.json +23 -0
tokenization_cohere_fast.py +754 -0
tokenizer.json +3 -0
tokenizer_config.json +319 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -12,69 +12,341 @@ language:
 - zh
 - ar
 license: cc-by-nc-4.0
-quantized_by: bartowski
-pipeline_tag: text-generation
 ---
-## Exllama v2 Quantizations of c4ai-command-r-v01
-Using <a href="https://github.com/turboderp/exllamav2/releases/tag/v0.0.16">turboderp's ExLlamaV2 v0.0.16</a> for quantization.
-## The "main" branch only contains the measurement.json, download one of the other branches for the model (see below)
-Each branch contains an individual bits per weight, with the main one containing only the meaurement.json for further conversions.
-Conversion was done using the default calibration dataset.
-Default arguments used except when the bits per weight is above 6.0, at that point the lm_head layer is quantized at 8 bits per weight instead of the default 6.
-Original model: https://huggingface.co/CohereForAI/c4ai-command-r-v01
-<a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/8_0">8.0 bits per weight</a>
-<a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/6_5">6.5 bits per weight</a>
-<a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/5_0">5.0 bits per weight</a>
-<a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/4_25">4.25 bits per weight</a>
-<a href="https://huggingface.co/bartowski/c4ai-command-r-v01-exl2/tree/3_5">3.5 bits per weight</a>
-## Download instructions
-With git:
-```shell
-git clone --single-branch --branch 6_5 https://huggingface.co/bartowski/c4ai-command-r-v01-exl2
 ```
-With huggingface hub (credit to TheBloke for instructions):
-```shell
-pip3 install huggingface-hub
 ```
-To download the `main` (only useful if you only care about measurement.json) branch to a folder called `c4ai-command-r-v01-exl2`:
-```shell
-mkdir c4ai-command-r-v01-exl2
-huggingface-cli download bartowski/c4ai-command-r-v01-exl2 --local-dir c4ai-command-r-v01-exl2 --local-dir-use-symlinks False
 ```
-To download from a different branch, add the `--revision` parameter:
-Linux:
-```shell
-mkdir c4ai-command-r-v01-exl2-6_5
-huggingface-cli download bartowski/c4ai-command-r-v01-exl2 --revision 6_5 --local-dir c4ai-command-r-v01-exl2-6_5 --local-dir-use-symlinks False
 ```
-Windows (which apparently doesn't like _ in folders sometimes?):
-```shell
-mkdir c4ai-command-r-v01-exl2-6.5
-huggingface-cli download bartowski/c4ai-command-r-v01-exl2 --revision 6_5 --local-dir c4ai-command-r-v01-exl2-6.5 --local-dir-use-symlinks False
-```

 - zh
 - ar
 license: cc-by-nc-4.0
 ---
+# Model Card for C4AI Command-R
+🚨 **This model is non-quantized version of C4AI Command-R. You can find the quantized version of C4AI Command-R using bitsandbytes [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit)**.
+## Model Summary
+C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
+Developed by: Cohere and [Cohere For AI](https://cohere.for.ai)
+- Point of Contact: Cohere For AI: [cohere.for.ai](https://cohere.for.ai/)
+- License: [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license), requires also adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy)
+- Model: c4ai-command-r-v01
+- Model Size: 35 billion parameters
+- Context length: 128K
+**Use**
+```python
+# pip install transformers
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_id = "CohereForAI/c4ai-command-r-v01"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+# Format message with the command-r chat template
+messages = [{"role": "user", "content": "Hello, how are you?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+gen_tokens = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.3,
+    )
+gen_text = tokenizer.decode(gen_tokens[0])
+print(gen_text)
+```
+**Quantized model through bitsandbytes, 8-bit precision**
+```python
+# pip install transformers bitsandbytes accelerate
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(load_in_8bit=True)
+model_id = "CohereForAI/c4ai-command-r-v01"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config)
+# Format message with the command-r chat template
+messages = [{"role": "user", "content": "Hello, how are you?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+gen_tokens = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.3,
+    )
+gen_text = tokenizer.decode(gen_tokens[0])
+print(gen_text)
 ```
+**Quantized model through bitsandbytes, 4-bit precision**
+```python
+# pip install transformers bitsandbytes accelerate
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(load_in_4bit=True)
+model_id = "CohereForAI/c4ai-command-r-v01"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config)
+# Format message with the command-r chat template
+messages = [{"role": "user", "content": "Hello, how are you?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+gen_tokens = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.3,
+    )
+gen_text = tokenizer.decode(gen_tokens[0])
+print(gen_text)
 ```
+## Model Details
+**Input**: Models input text only.
+**Output**: Models generate text only.
+**Model Architecture**: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety.
+**Languages covered**: The model is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic.
+Pre-training data additionally included the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, Persian.
+**Context length**: Command-R supports a context length of 128K.
+### Tool use capabilities:
+Command-R has been specifically trained with conversational tool use capabilities. These have been trained into the model via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template. Deviating from this prompt template will likely reduce performance, but we encourage experimentation.
+Command-R’s tool use functionality takes a conversation as input (with an optional user-system preamble), along with a list of available tools. The model will then generate a json-formatted list of actions to execute on a subset of those tools. Command-R may use one of its supplied tools more than once.
+The model has been trained to recognise a special `directly_answer` tool, which it uses to indicate that it doesn’t want to use any of its other tools. We recommend including the `directly_answer` tool, but encourage experimentation.
+Comprehensive documentation and guides on prompting strategies for tool use will be provided shortly.
+<details>
+<summary><b>Usage: Rendering Tool Use Prompts [CLICK TO EXPAND]</b> </summary>
+```python
+from transformers import AutoTokenizer
+model_id = "CohereForAI/c4ai-command-r-v01"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+# define conversation input:
+conversation = [
+    {"role": "user", "content": "Whats the biggest penguin in the world?"}
+]
+# Define tools available for the model to use:
+tools = [
+  {
+    "name": "internet_search",
+    "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
+    "parameter_definitions": {
+      "query": {
+        "description": "Query to search the internet with",
+        "type": 'str',
+        "required": True
+      }
+    }
+  },
+  {
+    'name': "directly_answer",
+    "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
+    'parameter_definitions': {}
+  }
+]
+# render the tool use prompt as a string:
+tool_use_prompt = tokenizer.apply_tool_use_template(
+    conversation,
+    tools=tools,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+print(tool_use_prompt)
 ```
+</details>
+<details>
+<summary><b>Example Rendered Tool Use Prompt [CLICK TO EXPAND]</b></summary>
+````
+<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
+The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
+# System Preamble
+## Basic Rules
+You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
+# User Preamble
+## Task and Context
+You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
+## Style Guide
+Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.
+## Available Tools
+Here is a list of tools that you have available to you:
+```python
+def internet_search(query: str) -> List[Dict]:
+    """Returns a list of relevant document snippets for a textual query retrieved from the internet
+    Args:
+        query (str): Query to search the internet with
+    """
+    pass
 ```
+```python
+def directly_answer() -> List[Dict]:
+    """Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
+    """
+    pass
+```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
+```json
+[
+    {
+        "tool_name": title of the tool in the specification,
+        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
+    }
+]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+````
+</details>
+<details>
+<summary><b>Example Rendered Tool Use Completion [CLICK TO EXPAND]</b></summary>
+````
+Action: ```json
+[
+      {
+          "tool_name": "internet_search",
+          "parameters": {
+              "query": "biggest penguin in the world"
+          }
+      }
+]
+```
+````
+</details>
+### Grounded Generation and RAG Capabilities:
+Command-R has been specifically trained with grounded generation capabilities. This means that it can generate responses based on a list of supplied document snippets, and it will include grounding spans (citations) in its response indicating the source of the information.
+This can be used to enable behaviors such as grounded summarization and the final step of Retrieval Augmented Generation (RAG).This behavior has been trained into the model via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template.
+Deviating from this prompt template may reduce performance, but we encourage experimentation.
+Command-R’s grounded generation behavior takes a conversation as input (with an optional user-supplied system preamble), along with a list of retrieved document snippets.
+The document snippets should be chunks, rather than long documents, typically around  100-400 words per chunk. Document snippets consist of key-value pairs. The keys should be short descriptive strings, the values can be text or semi-structured.
+By default, Command-R will generate grounded responses by first predicting which documents are relevant, then predicting which ones it will cite, then generating an answer.
+Finally, it will then insert grounding spans into the answer. See below for an example. This is referred to as `accurate` grounded generation.
+The model is trained with a number of other answering modes, which can be selected by prompt changes . A `fast` citation mode is supported in the tokenizer, which will directly generate an answer with grounding spans in it, without first writing the answer out in full. This sacrifices some grounding accuracy in favor of generating fewer tokens.
+The code snippet below shows a minimal working example on how to render a prompt, generate and parse a completion.
+Comprehensive documentation and guides on prompting strategies on grounded generation will be provided in follow-ups at a later stage.
+<details>
+<summary> <b>Usage: Rendering Grounded Generation prompts [CLICK TO EXPAND]</b> </summary>
+````python
+from transformers import AutoTokenizer
+model_id = "CohereForAI/c4ai-command-r-v01"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+# define conversation input:
+conversation = [
+    {"role": "user", "content": "Whats the biggest penguin in the world?"}
+]
+# define documents to ground on:
+documents = [
+    { "title": "Tall penguins", "text": "Emperor penguins are the tallest growing up to 122 cm in height." },
+    { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
+]
+# render the tool use prompt as a string:
+grounded_generation_prompt = tokenizer.apply_grounded_generation_template(
+    conversation,
+    documents=documents,
+    citation_mode="accurate", # or "fast"
+    tokenize=False,
+    add_generation_prompt=True,
+)
+print(grounded_generation_prompt)
+````
+</details>
+<details>
+<summary><b>Example Rendered Grounded Generation Prompt [CLICK TO EXPAND]</b></summary>
+````<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
+The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
+# System Preamble
+## Basic Rules
+You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
+# User Preamble
+## Task and Context
+You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
+## Style Guide
+Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
+Document: 0
+title: Tall penguins
+text: Emperor penguins are the tallest growing up to 122 cm in height.
+Document: 1
+title: Penguin habitats
+text: Emperor penguins only live in Antarctica.
+</results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
+Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
+Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
+Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
+Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+````
+</details>
+<details>
+<summary><b>Example Rendered Grounded Generation Completion [CLICK TO EXPAND]</b></summary>
+````
+Relevant Documents: 0,1
+Cited Documents: 0,1
+Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
+Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
+````
+</details>
+### Code Capabilities:
+Command-R has been optimized to interact with your code, by requesting code snippets, code explanations, or code rewrites. It might not perform well out-of-the-box for pure code completion. For better performance, we also recommend using a low temperature (and even greedy decoding) for code-generation related instructions.
+### Model Card Contact
+For errors or additional questions about details in this model card, contact [info@for.ai](mailto:info@for.ai).
+### Terms of Use:
+We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 35 billion parameter model to researchers all over the world. This model is governed by a [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license) License with an acceptable use addendum, and also requires adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
+### Try Chat:
+You can try Command-R chat in the playground [here](https://dashboard.cohere.com/playground/chat).

config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+    "_name_or_path": "/home/ahmet_cohere_com/HF_Final_weight_tie",
+    "architectures": [
+        "CohereForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "auto_map": {
+        "AutoConfig": "configuration_cohere.CohereConfig",
+        "AutoModel": "modeling_cohere.CohereModel",
+        "AutoModelForCausalLM": "modeling_cohere.CohereForCausalLM"
+    },
+    "bos_token_id": 5,
+    "eos_token_id": 255001,
+    "hidden_act": "silu",
+    "hidden_size": 8192,
+    "initializer_range": 0.02,
+    "intermediate_size": 22528,
+    "layer_norm_eps": 1e-05,
+    "logit_scale": 0.0625,
+    "max_position_embeddings": 8192,
+    "model_max_length": 131072,
+    "model_type": "cohere",
+    "num_attention_heads": 64,
+    "num_hidden_layers": 40,
+    "num_key_value_heads": 64,
+    "pad_token_id": 0,
+    "pretraining_tp": 1,
+    "rope_theta": 8000000.0,
+    "torch_dtype": "float16",
+    "transformers_version": "4.38.2",
+    "use_cache": true,
+    "vocab_size": 256000,
+    "tie_word_embeddings": true,
+    "quantization_config": {
+        "quant_method": "exl2",
+        "version": "0.0.16",
+        "bits": 6.5,
+        "head_bits": 8,
+        "calibration": {
+            "rows": 100,
+            "length": 2048,
+            "dataset": "(default)"
+        }
+    }
+}

configuration_cohere.py ADDED Viewed

	@@ -0,0 +1,159 @@

+# coding=utf-8
+# Copyright 2024 Cohere team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Cohere model configuration"""
+from transformers import PretrainedConfig, AutoConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class CohereConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CohereModel`]. It is used to instantiate an Cohere
+    model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 256000):
+            Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`CohereModel`]
+        hidden_size (`int`, *optional*, defaults to 8192):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22528):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 40):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 64):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 8192):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the layer normalization.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 5):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 255001):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+    ```python
+    >>> from transformers import CohereModel, CohereConfig
+    >>> # Initializing a Cohere model configuration
+    >>> configuration = CohereConfig()
+    >>> # Initializing a model from the Cohere configuration
+    >>> model = CohereModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "cohere"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=256000,
+        hidden_size=8192,
+        intermediate_size=22528,
+        num_hidden_layers=40,
+        num_attention_heads=64,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=8192,
+        initializer_range=0.02,
+        layer_norm_eps=1e-5,
+        use_cache=True,
+        pad_token_id=0,
+        bos_token_id=5,
+        eos_token_id=255001,
+        pretraining_tp=1,
+        tie_word_embeddings=True,
+        rope_theta=10000.0,
+        attention_bias=False,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+# register the model config to AutoConfig
+AutoConfig.register("cohere", CohereConfig)

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 5,
+  "eos_token_id": 255001,
+  "pad_token_id": 0,
+  "transformers_version": "4.38.2"
+}

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,329 @@

+{
+  "metadata": {
+    "total_size": 69961662464
+  },
+  "weight_map": {
+    "model.embed_tokens.weight": "model-00001-of-00015.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00002-of-00015.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00015.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00015.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00015.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00015.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00006-of-00015.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00007-of-00015.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00008-of-00015.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00009-of-00015.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00010-of-00015.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00011-of-00015.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00003-of-00015.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00012-of-00015.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00013-of-00015.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.36.input_layernorm.weight": "model-00014-of-00015.safetensors",
+    "model.layers.36.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.36.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.36.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.36.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.36.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.36.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.36.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
+    "model.layers.37.input_layernorm.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.37.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.input_layernorm.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.38.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.39.input_layernorm.weight": "model-00015-of-00015.safetensors",
+    "model.layers.39.mlp.down_proj.weight": "model-00015-of-00015.safetensors",
+    "model.layers.39.mlp.gate_proj.weight": "model-00015-of-00015.safetensors",
+    "model.layers.39.mlp.up_proj.weight": "model-00015-of-00015.safetensors",
+    "model.layers.39.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.39.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.39.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.39.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00004-of-00015.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00003-of-00015.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00005-of-00015.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
+    "model.norm.weight": "model-00015-of-00015.safetensors"
+  }
+}

modeling_cohere.py ADDED Viewed

	@@ -0,0 +1,1280 @@

+# coding=utf-8
+# Copyright 2024 Cohere and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is based on the LLama model definition file in transformers
+"""PyTorch Cohere model."""
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers import AutoModel, AutoModelForCausalLM
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, StaticCache
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
+from transformers.utils import (
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_flash_attn_2_available,
+    is_flash_attn_greater_or_equal_2_10,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_cohere import CohereConfig
+logger = logging.get_logger(__name__)
+_CONFIG_FOR_DOC = "CohereConfig"
+# Copied from transformers.models.llama.modeling_llama._get_unpad_data
+def _get_unpad_data(attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+    return (
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+class LayerNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-5, bias=False):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.bias = nn.Parameter(torch.zeros(hidden_size)) if bias else None
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        mean = hidden_states.mean(-1, keepdim=True)
+        variance = (hidden_states - mean).pow(2).mean(-1, keepdim=True)
+        hidden_states = (hidden_states - mean) * torch.rsqrt(variance + self.variance_epsilon)
+        hidden_states = self.weight.to(torch.float32) * hidden_states
+        if self.bias is not None:
+            hidden_states = hidden_states + self.bias.to(torch.float32)
+        return hidden_states.to(input_dtype)
+ALL_LAYERNORM_LAYERS.append(LayerNorm)
+class CohereRotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        super().__init__()
+        self.scaling_factor = scaling_factor
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # For BC we register cos and sin cached
+        self.max_seq_len_cached = max_position_embeddings
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+        t = t / self.scaling_factor
+        freqs = torch.outer(t, self.inv_freq)
+        emb = torch.repeat_interleave(freqs, 2, dim=-1)
+        self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
+        self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)
+    @property
+    def sin_cached(self):
+        logger.warning_once(
+            "The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
+            "the forward method of RoPE from now on instead. It is not used in the `CohereAttention` class"
+        )
+        return self._sin_cached
+    @property
+    def cos_cached(self):
+        logger.warning_once(
+            "The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use "
+            "the forward method of RoPE from now on instead. It is not used in the `CohereAttention` class"
+        )
+        return self._cos_cached
+    @torch.no_grad()
+    def forward(self, x, position_ids, seq_len=None):
+        if seq_len is not None:
+            logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 since bfloat16 loses precision on long contexts
+        # See https://github.com/huggingface/transformers/pull/29285
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.repeat_interleave(freqs, 2, dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+def rotate_half(x):
+    # Split and rotate
+    x1 = x[..., ::2]
+    x2 = x[..., 1::2]
+    rot_x = torch.stack([-x2, x1], dim=-1).flatten(-2)
+    return rot_x
+# copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+# Copied from transformers.models.llama.modeling_llama.LlamaMLP Llama->Cohere
+class CohereMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        if self.config.pretraining_tp > 1:
+            slice = self.intermediate_size // self.config.pretraining_tp
+            gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
+            up_proj_slices = self.up_proj.weight.split(slice, dim=0)
+            down_proj_slices = self.down_proj.weight.split(slice, dim=1)
+            gate_proj = torch.cat(
+                [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
+            )
+            up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
+            intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
+            down_proj = [
+                F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
+            ]
+            down_proj = sum(down_proj)
+        else:
+            down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+# Copied from transformers.models.llama.modeling_llama.repeat_kv
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+class Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: CohereConfig, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            logger.warning_once(
+                f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
+                "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.attention_dropout = config.attention_dropout
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.is_causal = True
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
+        self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=config.attention_bias)
+        self.rotary_emb = CohereRotaryEmbedding(
+            self.head_dim,
+            max_position_embeddings=self.max_position_embeddings,
+            base=self.rope_theta,
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        if self.config.pretraining_tp > 1:
+            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
+            query_slices = self.q_proj.weight.split(
+                (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
+            )
+            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
+            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
+            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
+            query_states = torch.cat(query_states, dim=-1)
+            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
+            key_states = torch.cat(key_states, dim=-1)
+            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
+            value_states = torch.cat(value_states, dim=-1)
+        else:
+            query_states = self.q_proj(hidden_states)
+            key_states = self.k_proj(hidden_states)
+            value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        past_key_value = getattr(self, "past_key_value", past_key_value)
+        cos, sin = self.rotary_emb(value_states, position_ids)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; position_ids needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+        if attention_mask is not None:  # no matter the length, we just slice it
+            causal_mask = attention_mask
+            if cache_position is not None:
+                causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
+            attn_weights = attn_weights + causal_mask
+        # upcast attention to fp32
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+        attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
+        attn_output = torch.matmul(attn_weights, value_states)
+        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+        if self.config.pretraining_tp > 1:
+            attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
+            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
+            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
+        else:
+            attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2 Llama->Cohere
+class CohereFlashAttention2(Attention):
+    """
+    Cohere flash attention module. This module inherits from `Attention` as the weights of the module stays
+    untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
+    flash attention and deal with padding tokens in case the input contains any of them.
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        output_attentions = False
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        # Flash attention requires the input to have the shape
+        # batch_size x seq_length x head_dim x hidden_dim
+        # therefore we just need to keep the original shape
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        cos, sin = self.rotary_emb(value_states, position_ids)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        past_key_value = getattr(self, "past_key_value", past_key_value)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; position_ids needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
+        # to be able to avoid many of these transpose/reshape/view.
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        dropout_rate = self.attention_dropout if self.training else 0.0
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in the correct dtype just to be sure everything works as expected.
+        # This might slowdown training & inference so it is recommended to not cast the LayerNorms
+        # in fp32.
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+        attn_output = self._flash_attention_forward(
+            query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
+        )
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
+        attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+    def _flash_attention_forward(
+        self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
+    ):
+        """
+        Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+        first unpad the input, then computes the attention scores and pad the final attention scores.
+        Args:
+            query_states (`torch.Tensor`):
+                Input query states to be passed to Flash Attention API
+            key_states (`torch.Tensor`):
+                Input key states to be passed to Flash Attention API
+            value_states (`torch.Tensor`):
+                Input value states to be passed to Flash Attention API
+            attention_mask (`torch.Tensor`):
+                The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+                position of padding tokens and 1 for the position of non-padding tokens.
+            dropout (`int`, *optional*):
+                Attention dropout
+            softmax_scale (`float`, *optional*):
+                The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+        """
+        if not self._flash_attn_uses_top_left_mask:
+            causal = self.is_causal
+        else:
+            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in CohereFlashAttention2 __init__.
+            causal = self.is_causal and query_length != 1
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            batch_size = query_states.shape[0]
+            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+                query_states, key_states, value_states, attention_mask, query_length
+            )
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+            attn_output_unpad = flash_attn_varlen_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_in_batch_q,
+                max_seqlen_k=max_seqlen_in_batch_k,
+                dropout_p=dropout,
+                softmax_scale=softmax_scale,
+                causal=causal,
+            )
+            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+        else:
+            attn_output = flash_attn_func(
+                query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
+            )
+        return attn_output
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
+        key_layer = index_first_axis(
+            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        value_layer = index_first_axis(
+            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        if query_length == kv_seq_len:
+            query_layer = index_first_axis(
+                query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
+            )
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_in_batch_q = max_seqlen_in_batch_k
+            indices_q = indices_k
+        elif query_length == 1:
+            max_seqlen_in_batch_q = 1
+            cu_seqlens_q = torch.arange(
+                batch_size + 1, dtype=torch.int32, device=query_layer.device
+            )  # There is a memcpy here, that is very bad.
+            indices_q = cu_seqlens_q[:-1]
+            query_layer = query_layer.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -query_length:]
+            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+        return (
+            query_layer,
+            key_layer,
+            value_layer,
+            indices_q,
+            (cu_seqlens_q, cu_seqlens_k),
+            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+        )
+class SdpaAttention(Attention):
+    """
+    Attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    # Adapted from Attention.forward
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        if output_attentions:
+            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
+            logger.warning_once(
+                "CohereModel is using SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
+                'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+            )
+            return super().forward(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                cache_position=cache_position,
+            )
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        cos, sin = self.rotary_emb(value_states, position_ids)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        # In case static cache is used, it is an instance attribute.
+        past_key_value = getattr(self, "past_key_value", past_key_value)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; position_ids needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        causal_mask = attention_mask
+        if attention_mask is not None and cache_position is not None:
+            causal_mask = causal_mask[:, :, cache_position, : key_states.shape[-2]]
+        # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
+        # Reference: https://github.com/pytorch/pytorch/issues/112577.
+        if query_states.device.type == "cuda" and causal_mask is not None:
+            query_states = query_states.contiguous()
+            key_states = key_states.contiguous()
+            value_states = value_states.contiguous()
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states,
+            key_states,
+            value_states,
+            attn_mask=causal_mask,
+            dropout_p=self.attention_dropout if self.training else 0.0,
+        )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.view(bsz, q_len, self.hidden_size)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None, past_key_value
+COHERE_ATTENTION_CLASSES = {
+    "eager": Attention,
+    "flash_attention_2": CohereFlashAttention2,
+    "sdpa": SdpaAttention,
+}
+class CohereDecoderLayer(nn.Module):
+    def __init__(self, config: CohereConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = COHERE_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
+        self.mlp = CohereMLP(config)
+        self.input_layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*):
+                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
+                query_sequence_length, key_sequence_length)` if default attention is used.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+        """
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states_attention, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        # Fully Connected
+        hidden_states_mlp = self.mlp(hidden_states)
+        # Add everything together
+        hidden_states = residual + hidden_states_attention + hidden_states_mlp
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        if use_cache:
+            outputs += (present_key_value,)
+        return outputs
+COHERE_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`CohereConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+@add_start_docstrings(
+    "The bare Cohere Model outputting raw hidden-states without any specific head on top.",
+    COHERE_START_DOCSTRING,
+)
+class CoherePreTrainedModel(PreTrainedModel):
+    config_class = CohereConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["CohereDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values", "causal_mask"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_cache_class = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+    def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] = None):
+        if self.config._attn_implementation == "flash_attention_2" and cache_cls == StaticCache:
+            raise ValueError(
+                "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
+                "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
+            )
+        if max_cache_len > self.model.causal_mask.shape[-1] or self.device != self.model.causal_mask.device:
+            causal_mask = torch.full(
+                (max_cache_len, max_cache_len), fill_value=True, device=self.device, dtype=torch.bool
+            )
+            self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
+        for layer in self.model.layers:
+            device = layer.input_layernorm.weight.device
+            if hasattr(self.config, "_pre_quantization_dtype"):
+                dtype = self.config._pre_quantization_dtype
+            else:
+                dtype = layer.self_attn.o_proj.weight.dtype
+            layer.self_attn.past_key_value = cache_cls(
+                self.config, max_batch_size, max_cache_len, device=device, dtype=dtype
+            )
+    def _reset_cache(self):
+        for layer in self.model.layers:
+            layer.self_attn.past_key_value = None
+COHERE_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
+            `past_key_values`).
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
+            Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
+            returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
+            Two formats are allowed:
+            - a [`~cache_utils.Cache`] instance;
+            - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+            shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
+            cache format.
+            The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
+            legacy cache format will be returned.
+            If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
+            have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
+            of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+@add_start_docstrings(
+    "The bare Cohere Model outputting raw hidden-states without any specific head on top.",
+    COHERE_START_DOCSTRING,
+)
+class CohereModel(CoherePreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`CohereDecoderLayer`]
+    Args:
+        config: CohereConfig
+    """
+    def __init__(self, config: CohereConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [CohereDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.gradient_checkpointing = False
+        # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
+        # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_position_embeddings`.
+        causal_mask = torch.full(
+            (config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool
+        )
+        self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    @add_start_docstrings_to_model_forward(COHERE_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError(
+                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
+            )
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        past_seen_tokens = 0
+        if use_cache:  # kept for BC (cache positions)
+            if not isinstance(past_key_values, StaticCache):
+                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+                past_seen_tokens = past_key_values.get_seq_length()
+        if cache_position is None:
+            if isinstance(past_key_values, StaticCache):
+                raise ValueError("cache_position is a required argument when using StaticCache.")
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
+        # embed positions
+        hidden_states = inputs_embeds
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = None
+        for decoder_layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                )
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        next_cache = None
+        if use_cache:
+            next_cache = (
+                next_decoder_cache.to_legacy_cache() if isinstance(next_decoder_cache, Cache) else next_decoder_cache
+            )
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+    # TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
+    # KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
+    # (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
+    # `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
+    def _update_causal_mask(self, attention_mask, input_tensor):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+        batch_size, seq_length = input_tensor.shape[:2]
+        dtype = input_tensor.dtype
+        device = input_tensor.device
+        # support going beyond cached `max_position_embedding`
+        if seq_length > self.causal_mask.shape[-1]:
+            causal_mask = torch.full((2 * self.causal_mask.shape[-1], 2 * self.causal_mask.shape[-1]), fill_value=1)
+            self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
+        # We use the current dtype to avoid any overflows
+        min_dtype = torch.finfo(dtype).min
+        causal_mask = self.causal_mask[None, None, :, :].to(dtype=dtype, device=device) * min_dtype
+        causal_mask = causal_mask.expand(batch_size, 1, -1, -1)
+        if attention_mask is not None and attention_mask.dim() == 2:
+            causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+            mask_length = attention_mask.shape[-1]
+            padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
+            causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type == "cuda"
+        ):
+            # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
+            is_tracing = (
+                torch.jit.is_tracing()
+                or isinstance(input_tensor, torch.fx.Proxy)
+                or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
+            )
+            if not is_tracing and torch.any(attention_mask != 1):
+                # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+                # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+                # Details: https://github.com/pytorch/pytorch/issues/110213
+                causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+class CohereForCausalLM(CoherePreTrainedModel):
+    _tied_weights_keys = ["model.embed_tokens.weight", "lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = CohereModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.logit_scale = config.logit_scale
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @add_start_docstrings_to_model_forward(COHERE_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, CohereForCausalLM
+        #TODO: Model name needs to be updated
+        >>> model = CohereForCausalLM.from_pretrained("CohereForAI/Cohere-model")
+        >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/Cohere-model")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+        hidden_states = outputs[0]
+        if self.config.pretraining_tp > 1:
+            lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
+            logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
+            logits = torch.cat(logits, dim=-1)
+        else:
+            logits = self.lm_head(hidden_states)
+        logits = logits * self.logit_scale
+        logits = logits.float()
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        past_length = 0
+        if past_key_values is not None:
+            if isinstance(past_key_values, Cache):
+                cache_length = past_key_values.get_seq_length()
+                past_length = past_key_values.seen_tokens
+                max_cache_length = past_key_values.get_max_length()
+            else:
+                cache_length = past_length = past_key_values[0][0].shape[2]
+                max_cache_length = None
+            # Keep only the unprocessed tokens:
+            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
+            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
+            # input)
+            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
+                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
+            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
+            # input_ids based on the past_length.
+            elif past_length < input_ids.shape[1]:
+                input_ids = input_ids[:, past_length:]
+            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
+            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
+            if (
+                max_cache_length is not None
+                and attention_mask is not None
+                and cache_length + input_ids.shape[1] > max_cache_length
+            ):
+                attention_mask = attention_mask[:, -max_cache_length:]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+        if self.generation_config.cache_implementation == "static":
+            # generation with static cache
+            cache_position = kwargs.get("cache_position", None)
+            if cache_position is None:
+                past_length = 0
+            else:
+                past_length = cache_position[-1] + 1
+            input_ids = input_ids[:, past_length:]
+            position_ids = position_ids[:, past_length:]
+        # TODO @gante we should only keep a `cache_position` in generate, and do +=1.
+        # same goes for position ids. Could also help with continued generation.
+        input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
+        cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
+        position_ids = position_ids.contiguous() if position_ids is not None else None
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
+            # TODO: use `next_tokens` directly instead.
+            model_inputs = {"input_ids": input_ids.contiguous()}
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "cache_position": cache_position,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
+# register models as AutoModel and AutoModelForCausalLM
+AutoModel.register(CohereConfig, CohereModel)
+AutoModelForCausalLM.register(CohereConfig, CohereForCausalLM)

original_repo_url.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/CohereForAI/c4ai-command-r-v01

output-00001-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88133c16e799264d13192e376b32de7ae1bfa4fd8f1867f3885d2b26055b16bb
+size 8587925566

output-00002-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd8bd6d89b19c414e2ed8301d98eb25c3a4306f1f7871d485ccca753dfe48046
+size 8569599568

output-00003-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18779478f820e7677b7468e191967cda01f8cdf4eee27082120d732f82423a8c
+size 8582850448

output-00004-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a252c8446fad63ebee87edd843b402920acf88d608f34eed9ce97945c47b692
+size 6871343664

output-00005-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cba5ab17ad5c45e11404996de897e2f740e98727cde2921f781c161b528ba62f
+size 2097152096

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<BOS_TOKEN>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|END_OF_TURN_TOKEN|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<PAD>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenization_cohere_fast.py ADDED Viewed

	@@ -0,0 +1,754 @@

+# coding=utf-8
+# Copyright 2024 Cohere and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is based on the tokenization_llama_fast.py file in transformers
+import os
+from shutil import copyfile
+from typing import Optional, Tuple, Dict, Union, List, Literal
+from tokenizers import processors
+from transformers import AutoTokenizer
+from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
+from transformers.utils import logging
+from transformers.utils.versions import require_version
+from transformers.tokenization_utils_base import TensorType
+from transformers.pipelines.conversational import Conversation
+from .configuration_cohere import CohereConfig
+require_version("tokenizers>=0.13.3")
+logger = logging.get_logger(__name__)
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.json"}
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "cohere-tokenizer": "https://huggingface.co/Cohere/Command-nightly/blob/main/tokenizer.json",
+    },
+}
+# fmt: off
+DEFAULT_SYSTEM_PROMPT = "You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere."
+DEFAULT_RAG_PREAMBLE = """## Task and Context
+You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
+## Style Guide
+Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling."""
+# fmt: on
+class CohereTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Construct a Cohere tokenizer. Based on byte-level Byte-Pair-Encoding.
+    This uses notably ByteFallback and NFC normalization.
+    ```python
+    >>> from transformers import AutoTokenizer
+    >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-0.1")
+    >>> tokenizer.encode("Hello this is a test")
+    [1, 15043, 445, 338, 263, 1243]
+    ```
+    If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
+    call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
+    values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
+    [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+    refer to this superclass for more information regarding those methods.
+    Args:
+        vocab_file (`str`, *optional*):
+            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        tokenizer_file (`str`, *optional*):
+            [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
+            contains everything needed to load the tokenizer.
+        clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
+            Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
+            extra spaces.
+        unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+        eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
+        add_bos_token (`bool`, *optional*, defaults to `True`):
+            Whether or not to add an `bos_token` at the start of sequences.
+        add_eos_token (`bool`, *optional*, defaults to `False`):
+            Whether or not to add an `eos_token` at the end of sequences.
+        use_default_system_prompt (`bool`, *optional*, defaults to `False`):
+            Whether or not the default system prompt for Cohere tokenizer should be used.
+        add_prefix_space (`bool`, *optional*):
+            Whether or not the tokenizer should automatically add a prefix space
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    padding_side = "left"
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        clean_up_tokenization_spaces=False,
+        unk_token="<UNK>",
+        bos_token="<BOS_TOKEN>",
+        eos_token="<EOS_TOKEN>",
+        add_bos_token=True,
+        add_eos_token=False,
+        use_default_system_prompt=False,
+        add_prefix_space=None,
+        **kwargs,
+    ):
+        if add_prefix_space is not None:
+            logger.warning_once(
+                "You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers"
+            )
+            kwargs["from_slow"] = True
+        super().__init__(
+            vocab_file=vocab_file,
+            tokenizer_file=tokenizer_file,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            add_bos_token=add_bos_token,
+            add_eos_token=add_eos_token,
+            use_default_system_prompt=use_default_system_prompt,
+            **kwargs,
+        )
+        self._add_bos_token = add_bos_token
+        self._add_eos_token = add_eos_token
+        self.update_post_processor()
+        self.use_default_system_prompt = use_default_system_prompt
+        self.vocab_file = vocab_file
+        self.grounded_generation_template = kwargs.pop("grounded_generation_template", None)
+        self.tool_use_template = kwargs.pop("tool_use_template", None)
+    def update_post_processor(self):
+        """
+        Updates the underlying post processor with the current `bos_token` and `eos_token`.
+        """
+        bos = self.bos_token
+        bos_token_id = self.bos_token_id
+        if bos is None and self.add_bos_token:
+            raise ValueError("add_bos_token = True but bos_token = None")
+        eos = self.eos_token
+        eos_token_id = self.eos_token_id
+        if eos is None and self.add_eos_token:
+            raise ValueError("add_eos_token = True but eos_token = None")
+        single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
+        pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
+        special_tokens = []
+        if self.add_bos_token:
+            special_tokens.append((bos, bos_token_id))
+        if self.add_eos_token:
+            special_tokens.append((eos, eos_token_id))
+        self._tokenizer.post_processor = processors.TemplateProcessing(
+            single=single, pair=pair, special_tokens=special_tokens
+        )
+    @property
+    def add_eos_token(self):
+        return self._add_eos_token
+    @property
+    def add_bos_token(self):
+        return self._add_bos_token
+    @add_eos_token.setter
+    def add_eos_token(self, value):
+        self._add_eos_token = value
+        self.update_post_processor()
+    @add_bos_token.setter
+    def add_bos_token(self, value):
+        self._add_bos_token = value
+        self.update_post_processor()
+    @property
+    def default_chat_template(self):
+        """
+        Cohere Tokenizer uses <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> to indicate each turn in a chat.
+        Additioanlly, to indicate the source of the message, <|USER_TOKEN|>, <|CHATBOT_TOKEN|> and <|SYSTEM_TOKEN|>
+        for user, assitant and system messages respectively.
+        The output should look something like:
+        <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ preamble }}<|END_OF_TURN_TOKEN|>
+        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ How are you? }}<|END_OF_TURN_TOKEN|>
+        <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ I am doing well! }}<|END_OF_TURN_TOKEN|>
+        Use add_generation_prompt to add a prompt for the model to generate a response:
+        >>> messages = [{"role": "user", "content": "Hello, how are you?"}]
+        >>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+        """
+        logger.warning_once(
+            "\nNo chat template is defined for this tokenizer - using the default template "
+            f"for the {self.__class__.__name__} class. If the default is not appropriate for "
+            "your model, please set `tokenizer.chat_template` to an appropriate template. "
+            "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n"
+        )
+        template = (
+            "{{ bos_token }}"
+            "{% if messages[0]['role'] == 'system' %}"
+            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
+            "{% set system_message = messages[0]['content'] %}"
+            "{% elif USE_DEFAULT_PROMPT == true %}"
+            "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
+            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
+            "{% else %}"
+            "{% set loop_messages = messages %}"
+            "{% set system_message = false %}"
+            "{% endif %}"
+            "{% if system_message != false %}"  # Start with system message
+            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% endif %}"
+            "{% for message in loop_messages %}"  # Loop over all non-system messages
+            "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
+            "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
+            "{% endif %}"
+            "{% set content = message['content'] %}"
+            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
+            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% elif message['role'] == 'assistant' %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% endif %}"
+            "{% endfor %}"
+            "{% if add_generation_prompt %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
+            "{% endif %}"
+        )
+        template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
+        default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
+        template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
+        return template
+    @property
+    def default_tool_use_template(self):
+        template = (
+            "{{ bos_token }}"
+            "{% if messages[0]['role'] == 'system' %}"
+            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
+            "{% set system_message = messages[0]['content'] %}"
+            "{% else %}"
+            "{% set loop_messages = messages %}"
+            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
+            "{% endif %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
+            "{{ '# Safety Preamble' }}"
+            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
+            "{{ '\n\n# System Preamble' }}"
+            "{{ '\n## Basic Rules' }}"
+            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
+            "{{ '\n\n# User Preamble' }}"
+            "{{ '\n' + system_message }}"
+            "{{'\n\n## Available Tools\nHere is a list of tools that you have available to you:\n\n'}}"
+            "{% for tool in tools %}"
+            "{% if loop.index0 != 0 %}"
+            "{{ '\n\n'}}"
+            "{% endif %}"
+            "{{'```python\ndef ' + tool.name + '('}}"
+            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
+            "{% if loop.index0 != 0 %}"
+            "{{ ', '}}"
+            "{% endif %}"
+            "{{param_name}}: "
+            "{% if not param_fields.required %}"
+            "{{'Optional[' + param_fields.type + '] = None'}}"
+            "{% else %}"
+            "{{ param_fields.type }}"
+            "{% endif %}"
+            "{% endfor %}"
+            "{{ ') -> List[Dict]:\n    \"\"\"'}}"
+            "{{ tool.description }}"
+            "{% if tool.parameter_definitions|length != 0 %}"
+            "{{ '\n\n    Args:\n        '}}"
+            "{% for param_name, param_fields in tool.parameter_definitions.items() %}"
+            "{% if loop.index0 != 0 %}"
+            "{{ '\n        ' }}"
+            "{% endif %}"
+            "{{ param_name + ' ('}}"
+            "{% if not param_fields.required %}"
+            "{{'Optional[' + param_fields.type + ']'}}"
+            "{% else %}"
+            "{{ param_fields.type }}"
+            "{% endif %}"
+            "{{ '): ' + param_fields.description }}"
+            "{% endfor %}"
+            "{% endif %}"
+            "{{ '\n    \"\"\"\n    pass\n```' }}"
+            "{% endfor %}"
+            "{{ '<|END_OF_TURN_TOKEN|>'}}"
+            "{% for message in loop_messages %}"
+            "{% set content = message['content'] %}"
+            "{% if message['role'] == 'user' %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% elif message['role'] == 'system' %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% elif message['role'] == 'assistant' %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% endif %}"
+            "{% endfor %}"
+            "{{'<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write \\'Action:\\' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user\\'s last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:\n```json\n[\n    {\n        \"tool_name\": title of the tool in the specification,\n        \"parameters\": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters\n    }\n]```<|END_OF_TURN_TOKEN|>'}}"
+            "{% if add_generation_prompt %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
+            "{% endif %}"
+        )
+        default_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
+        template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
+        return template
+    @property
+    def default_grounded_generation_template(self):
+        template = (
+            "{{ bos_token }}"
+            "{% if messages[0]['role'] == 'system' %}"
+            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
+            "{% set system_message = messages[0]['content'] %}"
+            "{% else %}"
+            "{% set loop_messages = messages %}"
+            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
+            "{% endif %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
+            "{{ '# Safety Preamble' }}"
+            "{{ '\nThe instructions in this section override those in the task description and style guide sections. Don\\'t answer questions that are harmful or immoral.' }}"
+            "{{ '\n\n# System Preamble' }}"
+            "{{ '\n## Basic Rules' }}"
+            "{{ '\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user\\'s requests, you cite your sources in your answers, according to those instructions.' }}"
+            "{{ '\n\n# User Preamble' }}"
+            "{{ '\n' + system_message }}"
+            "{{ '<|END_OF_TURN_TOKEN|>'}}"
+            "{% for message in loop_messages %}"  # Loop over all non-system messages
+            "{% set content = message['content'] %}"
+            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
+            "{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% elif message['role'] == 'system' %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% elif message['role'] == 'assistant' %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}"
+            "{% endif %}"
+            "{% endfor %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>'}}"
+            "{{ '<results>' }}"
+            "{% for document in documents %}"  # Loop over all non-system messages
+            "{{ '\nDocument: ' }}"
+            "{{ loop.index0 }}\n"
+            "{% for key, value in document.items() %}"
+            "{{ key }}: {{value}}\n"
+            "{% endfor %}"
+            "{% endfor %}"
+            "{{ '</results>'}}"
+            "{{ '<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' }}"
+            "{{ 'Carefully perform the following instructions, in order, starting each with a new line.\n' }}"
+            "{{ 'Firstly, Decide which of the retrieved documents are relevant to the user\\'s last input by writing \\'Relevant Documents:\\' followed by comma-separated list of document numbers. If none are relevant, you should instead write \\'None\\'.\n' }}"
+            "{{ 'Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user\\'s last input by writing \\'Cited Documents:\\' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write \\'None\\'.\n' }}"
+            "{% if citation_mode=='accurate' %}"
+                "{{ 'Thirdly, Write \\'Answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.\n' }}"
+            "{% endif %}"
+            "{{ 'Finally, Write \\'Grounded answer:\\' followed by a response to the user\\'s last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.' }}"
+            "{{ '<|END_OF_TURN_TOKEN|>' }}"
+            "{% if add_generation_prompt %}"
+            "{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}"
+            "{% endif %}"
+        )
+        default_message = DEFAULT_RAG_PREAMBLE.replace("\n", "\\n").replace("'", "\\'")
+        template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
+        return template
+    def _apply_template_with_arguments(
+        self,
+        conversation: Union[List[Dict[str, str]], "Conversation"],
+        template: Optional[str] = None,
+        add_generation_prompt: bool = False,
+        tokenize: bool = True,
+        padding: bool = False,
+        truncation: bool = False,
+        max_length: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_dict: bool = False,
+        **kwargs,
+    ) -> Union[str, List[int]]:
+        """Just tokenization_utils_base.apply_chat_template, but modified so that the jinjia template can take kwargs"""
+        if hasattr(conversation, "messages"):
+            # Indicates it's a Conversation object
+            conversation = conversation.messages
+        # Compilation function uses a cache to avoid recompiling the same template
+        compiled_template = self._compile_jinja_template(template)
+        rendered = compiled_template.render(
+            messages=conversation,
+            add_generation_prompt=add_generation_prompt,
+            **kwargs,
+            **self.special_tokens_map
+        )
+        if padding is True:
+            padding = "max_length"  # There's only one sequence here, so "longest" makes no sense
+        if tokenize:
+            if return_dict:
+                return self(
+                    rendered,
+                    padding=padding,
+                    truncation=truncation,
+                    max_length=max_length,
+                    add_special_tokens=False,
+                    return_tensors=return_tensors,
+                    **kwargs,
+                )
+            else:
+                return self.encode(
+                    rendered,
+                    padding=padding,
+                    truncation=truncation,
+                    max_length=max_length,
+                    add_special_tokens=False,
+                    return_tensors=return_tensors,
+                    **kwargs,
+                )
+        else:
+            return rendered
+    def apply_tool_use_template(
+        self,
+        conversation: Union[List[Dict[str, str]], "Conversation"],
+        tools: List[Dict],
+        tool_use_template: Optional[str] = None,
+        **kwargs
+    ) -> Union[str, List[int]]:
+        """Create a Command-R tool-use prompt.
+        Once rendered, the prompt instructs the model to generate a list of actions to perform on a set of user supplied tools
+        to help carry out the user's requests.
+        Conceptually, this works in the same way as `apply_chat_format`, but takes an additional `tools` parameter.
+        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of available
+        tools for the model to use into a prompt string, or a list of token ids.
+        This method will use the tokenizer's `default_tool_use_template` template specified at the class level.
+        You can override the default template using the `tool_use_template` kwarg but the quality of your results may decrease.
+        Args:
+            conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
+                with "role" and "content" keys, representing the chat history so far.
+            tools (List[Dict]): a list of tools to render into the prompt for the model to choose from.
+                See an example at the bottom of the docstring.
+                The format should be:
+                   * name (str): The name of the tool to be called. Valid names contain only the characters a-z,
+                        A-Z, 0-9, _ and must not begin with a digit.
+                   * description (str): The description of what the tool does, the model uses the description to
+                        choose when and how to call the function.
+                   * parameter_definitions (List[Dict]): The input parameters of the tool. Accepts a dictionary
+                        where the key is the name of the parameter and the value is the parameter spec.
+                        Valid parameter names contain only the characters a-z, A-Z, 0-9, _ and must not begin with a digit.
+                        Parameter specs are as follows:
+                       * description (str): The description of the parameter.
+                       * type (str): the type of the parameter - most effective for python builtin data types, such as 'str', 'bool'
+                       * required: boolean: Denotes whether the parameter is always present (required) or not. Defaults to not required.
+            tool_use_template (str, *optional*): A Jinja template to use for this conversion. If
+                this is not passed, the model's default chat template will be used instead.
+            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
+                the start of an assistant message. This is useful when you want to generate a response from the model.
+                Note that this argument will be passed to the chat template, and so it must be supported in the
+                template for this argument to have any effect.
+            tokenize (`bool`, defaults to `True`):
+                Whether to tokenize the output. If `False`, the output will be a string.
+            padding (`bool`, defaults to `False`):
+                Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
+            truncation (`bool`, defaults to `False`):
+                Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
+            max_length (`int`, *optional*):
+                Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
+                not specified, the tokenizer's `max_length` attribute will be used as a default.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
+                values are:
+                - `'tf'`: Return TensorFlow `tf.Tensor` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+            return_dict (`bool`, *optional*, defaults to `False`):
+                Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
+            **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.
+        Returns:
+            `str`: A rendered prompt string.
+            or if tokenize=True:
+            `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
+            output is ready to pass to the model, either directly or via methods like `generate()`.
+        Examples:
+        ```python
+        >>> tokenizer = CohereTokenizerFast.from_pretrained("CohereForAI/c4ai-command-r-0.1")
+        >>> tools = [
+            {
+                "name": "internet_search",
+                "description": "Returns a list of relevant document snippets for a textual query retrieved from the internet",
+                "parameter_definitions": {
+                    "query": {
+                        "description": "Query to search the internet with",
+                        "type": "str",
+                        "required": True
+                    }
+                }
+            },
+            {
+                "name': "directly_answer",
+                "description": "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
+                "parameter_definitions": {}
+            }
+        ]
+        >>> conversation = [
+            {"role": "user", "content": "Whats the biggest penguin in the world?"}
+        ]
+        >>> # render the prompt, ready for user to inspect, or for input into the model:
+        >>> prompt = tokenizer.apply_tool_use_template(
+            conversation,
+            tools=tools,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        >>> print(prompt)
+        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
+        The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
+        # System Preamble
+        ## Basic Rules
+        You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
+        # User Preamble
+        ## Task and Context
+        You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
+        ## Style Guide
+        Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.
+        ## Available Tools
+        Here is a list of tools that you have available to you:
+        \`\`\`python
+        def internet_search(query: str) -> List[Dict]:
+            \"\"\"Returns a list of relevant document snippets for a textual query retrieved from the internet
+            Args:
+                query (str): Query to search the internet with
+            \"\"\"
+            pass
+        \`\`\`
+        \`\`\`python
+        def directly_answer() -> List[Dict]:
+            \"\"\"Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history
+            \"\"\"
+            pass
+        \`\`\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the `directly-answer` tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:
+        \`\`\`json
+        [
+            {
+                "tool_name": title of the tool in the specification,
+                "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
+            }
+        ]\`\`\`<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
+        ```
+        >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
+        >>> outputs = model.generate(inputs, max_new_tokens=128)
+        >>> print(tokenizer.decode(outputs[0]))
+        Action: ```json
+        [
+            {
+                "tool_name": "internet_search",
+                "parameters": {
+                    "query": "biggest penguin in the world"
+                }
+            }
+        ]
+        ```
+        """
+        # priority: `tool_use_template` argument > `tokenizer.tool_use_template` > `tokenizer.default_tool_use_template`
+        if tool_use_template is None:
+            if self.tool_use_template is not None:
+                tool_use_template = self.tool_use_template
+            else:
+                tool_use_template = self.default_tool_use_template
+        return self._apply_template_with_arguments(
+            conversation,
+            tools=tools,
+            template=tool_use_template,
+            **kwargs,
+        )
+    def apply_grounded_generation_template(
+        self,
+        conversation: Union[List[Dict[str, str]], "Conversation"],
+        documents: List[Dict],
+        citation_mode: Literal["fast", "accurate"] = "accurate",
+        grounded_generation_template: Optional[str] = None,
+        **kwargs
+    ) -> Union[str, List[int]]:
+        """Create a Command-R grounded generation (aka RAG) prompt.
+        Once rendered, the prompt instructs the model to generate a response with citations in, based on supplied documents.
+        Conceptually, this works in the same way as `apply_chat_format`, but takes additional `documents`
+        and parameter `citation_mode` parameters.
+        Converts a Conversation object or a list of dictionaries with `"role"` and `"content"` keys and a list of
+        documents for the model to ground its response on into a prompt string, or a list of token ids.
+        This method will use the tokenizer's `grounded_generation_template` template specified at the class level.
+        You can override the default template using the `grounded_generation_template` kwarg but the quality of your results may decrease.
+        Args:
+            conversation (Union[List[Dict[str, str]], "Conversation"]): A Conversation object or list of dicts
+                with "role" and "content" keys, representing the chat history so far.
+            documents (List[Dict[str, str]): A list of dicts, representing documents or tool outputs to ground your
+                generation on. A document is a semistructured dict, wiht a string to string mapping. Common fields are
+                `url`, `title`, `snippet` etc but should be descriptive of the key. They will get rendered into the prompt.
+            citation_mode: either "accurate" (prompt the model to generate an answer first, then rewrite it with citation
+                spans in) or "fast", where the prompt instructs the model to generate an answer with citations in directly.
+                The former has higher quality citations, the latter requires fewer tokens to be generated.
+            grounded_generation_template (str, *optional*): A Jinja template to use for this conversion. If
+                this is not passed, the model's default grounded_generation_template template will be used instead.
+            add_generation_prompt (bool, *optional*): Whether to end the prompt with the token(s) that indicate
+                the start of an assistant message. This is useful when you want to generate a response from the model.
+                Note that this argument will be passed to the chat template, and so it must be supported in the
+                template for this argument to have any effect.
+            tokenize (`bool`, defaults to `True`):
+                Whether to tokenize the output. If `False`, the output will be a string.
+            padding (`bool`, defaults to `False`):
+                Whether to pad sequences to the maximum length. Has no effect if tokenize is `False`.
+            truncation (`bool`, defaults to `False`):
+                Whether to truncate sequences at the maximum length. Has no effect if tokenize is `False`.
+            max_length (`int`, *optional*):
+                Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is `False`. If
+                not specified, the tokenizer's `max_length` attribute will be used as a default.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Has no effect if tokenize is `False`. Acceptable
+                values are:
+                - `'tf'`: Return TensorFlow `tf.Tensor` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+            return_dict (`bool`, *optional*, defaults to `False`):
+                Whether to return a dictionary with named outputs. Has no effect if tokenize is `False`.
+            **tokenizer_kwargs: Additional kwargs to pass to the tokenizer.
+        Returns:
+            `str`: A rendered prompt string.
+            or if tokenize=True:
+            `List[int]`: A list of token ids representing the tokenized chat so far, including control tokens. This
+            output is ready to pass to the model, either directly or via methods like `generate()`.
+        Examples:
+        ```python
+        >>> tokenizer = CohereTokenizerFast.from_pretrained('CohereForAI/c4ai-command-r-0.1')
+        >>> # define documents:
+        >>> documents = [
+            { "title": "Tall penguins", "text": "Emperor penguins are the tallest." },
+            { "title": "Penguin habitats", "text": "Emperor penguins only live in Antarctica."}
+        ]
+        >>> # define a conversation:
+        >>> conversation = [
+            {"role": "user", "content": "Whats the biggest penguin in the world?"}
+        ]
+        >>> # render the prompt, ready for user to inspect, or for input into the model:
+        >>> grounded_generation_prompt = tokenizer.apply_grounded_generation_template(
+            conversation,
+            documents=documents,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        >>> print(grounded_generation_prompt)
+        <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
+        The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.
+        ## Basic Rules
+        You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.
+        # User Preamble
+        ## Task and Context
+        You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.
+        ## Style Guide
+        Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Whats the biggest penguin in the world?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|><results>
+        Document: 0
+        title: Tall penguins
+        text: Emperor penguins are the tallest.
+        Document: 1
+        title: Penguin habitats
+        text: Emperor penguins only live in Antarctica.
+        </results><|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>Carefully perform the following instructions, in order, starting each with a new line.
+        Firstly, Decide which of the retrieved documents are relevant to the user's last input by writing 'Relevant Documents:' followed by comma-separated list of document numbers. If none are relevant, you should instead write 'None'.
+        Secondly, Decide which of the retrieved documents contain facts that should be cited in a good answer to the user's last input by writing 'Cited Documents:' followed a comma-separated list of document numbers. If you dont want to cite any of them, you should instead write 'None'.
+        Thirdly, Write 'Answer:' followed by a response to the user's last input in high quality natural english. Use the retrieved documents to help you. Do not insert any citations or grounding markup.
+        Finally, Write 'Grounded answer:' followed by a response to the user's last input in high quality natural english. Use the symbols <co: doc> and </co: doc> to indicate when a fact comes from a document in the search result, e.g <co: 0>my fact</co: 0> for a fact from document 0.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'''
+        ```
+        >>> inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
+        >>> outputs = model.generate(inputs, max_new_tokens=128)
+        >>> print(tokenizer.decode(outputs[0]))
+        Relevant Documents: 0,1
+        Cited Documents: 0,1
+        Answer: The Emperor Penguin is the tallest or biggest penguin in the world. It is a bird that lives only in Antarctica and grows to a height of around 122 centimetres.
+        Grounded answer: The <co: 0>Emperor Penguin</co: 0> is the <co: 0>tallest</co: 0> or biggest penguin in the world. It is a bird that <co: 1>lives only in Antarctica</co: 1> and <co: 0>grows to a height of around 122 centimetres.</co: 0>
+        """
+        # priority: `grounded_generation_template` argument > `tokenizer.grounded_generation_template` > `tokenizer.default_grounded_generation_template`
+        if grounded_generation_template is None:
+            if self.grounded_generation_template is not None:
+                grounded_generation_template = self.grounded_generation_template
+            else:
+                grounded_generation_template = self.default_grounded_generation_template
+        return self._apply_template_with_arguments(
+            conversation,
+            documents=documents,
+            template=grounded_generation_template,
+            citation_mode=citation_mode,
+            **kwargs,
+        )
+    # TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+        output = bos_token_id + token_ids_0 + eos_token_id
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+        return output
+# register the tokenizer to AutoTokenizer
+AutoTokenizer.register(CohereConfig, fast_tokenizer_class=CohereTokenizerFast)

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0af6e6fe50ce1bb5611b103482de6bac000c82e06898138d57f35af121aec772
+size 12777406

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,319 @@

+{
+    "add_bos_token": true,
+    "add_eos_token": false,
+    "added_tokens_decoder": {
+      "0": {
+        "content": "<PAD>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "1": {
+        "content": "<UNK>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "2": {
+        "content": "<CLS>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "3": {
+        "content": "<SEP>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "4": {
+        "content": "<MASK_TOKEN>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "5": {
+        "content": "<BOS_TOKEN>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "6": {
+        "content": "<EOS_TOKEN>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "7": {
+        "content": "<EOP_TOKEN>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": true
+      },
+      "255000": {
+        "content": "<|START_OF_TURN_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255001": {
+        "content": "<|END_OF_TURN_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255002": {
+        "content": "<|YES_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255003": {
+        "content": "<|NO_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255004": {
+        "content": "<|GOOD_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255005": {
+        "content": "<|BAD_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255006": {
+        "content": "<|USER_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255007": {
+        "content": "<|CHATBOT_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255008": {
+        "content": "<|SYSTEM_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255009": {
+        "content": "<|USER_0_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255010": {
+        "content": "<|USER_1_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255011": {
+        "content": "<|USER_2_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255012": {
+        "content": "<|USER_3_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255013": {
+        "content": "<|USER_4_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255014": {
+        "content": "<|USER_5_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255015": {
+        "content": "<|USER_6_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255016": {
+        "content": "<|USER_7_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255017": {
+        "content": "<|USER_8_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255018": {
+        "content": "<|USER_9_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255019": {
+        "content": "<|EXTRA_0_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255020": {
+        "content": "<|EXTRA_1_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255021": {
+        "content": "<|EXTRA_2_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255022": {
+        "content": "<|EXTRA_3_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255023": {
+        "content": "<|EXTRA_4_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255024": {
+        "content": "<|EXTRA_5_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255025": {
+        "content": "<|EXTRA_6_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255026": {
+        "content": "<|EXTRA_7_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255027": {
+        "content": "<|EXTRA_8_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      },
+      "255028": {
+        "content": "<|EXTRA_9_TOKEN|>",
+        "lstrip": false,
+        "normalized": false,
+        "rstrip": false,
+        "single_word": false,
+        "special": false
+      }
+    },
+    "auto_map": {
+      "AutoTokenizer": [
+        null,
+        "tokenization_cohere_fast.CohereTokenizerFast"
+      ]
+    },
+    "bos_token": "<BOS_TOKEN>",
+    "clean_up_tokenization_spaces": false,
+    "eos_token": "<|END_OF_TURN_TOKEN|>",
+    "legacy": true,
+    "model_max_length": 1000000000000000019884624838656,
+    "pad_token": "<PAD>",
+    "sp_model_kwargs": {},
+    "spaces_between_special_tokens": false,
+    "tokenizer_class": "CohereTokenizer",
+    "unk_token": null,
+    "use_default_system_prompt": false
+  }