File size: 17,621 Bytes
b1cba1f a05ca96 b1cba1f a05ca96 b1cba1f a05ca96 b1cba1f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 |
---
inference: false
license: other
license_name: mnpl
license_link: https://mistral.ai/licenses/MNPL-0.1.md
tags:
- code
language:
- code
base_model: mistralai/Codestral-22B-v0.1
model_creator: Mistral AI
model_name: Codestral-22B-v0.1
model_type: mistral
datasets:
- m-a-p/CodeFeedback-Filtered-Instruction
quantized_by: CISC
---
# Codestral-22B-v0.1 - SOTA GGUF
- Model creator: [Mistral AI](https://huggingface.co/mistralai)
- Original model: [Codestral-22B-v0.1](https://huggingface.co/mistralai/Codestral-22B-v0.1)
<!-- description start -->
## Description
This repo contains State Of The Art quantized GGUF format model files for [Codestral-22B-v0.1](https://huggingface.co/mistralai/Codestral-22B-v0.1).
Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of answers from the [CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) dataset.
The embedded chat template has been extended to support function calling via OpenAI-compatible `tools` parameter and Fill-in-Middle token metadata has been added, see [example](#simple-llama-cpp-python-example-fill-in-middle-code). NOTE: Mistral's FIM requires support for [SPM infill mode](https://github.com/abetlen/llama-cpp-python/pull/1492)!
<!-- description end -->
<!-- prompt-template start -->
## Prompt template: Mistral v3
```
[AVAILABLE_TOOLS] [{"name": "function_name", "description": "Description", "parameters": {...}}, ...][/AVAILABLE_TOOLS][INST] {prompt}[/INST]
```
<!-- prompt-template end -->
<!-- compatibility_gguf start -->
## Compatibility
These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307)
They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.
## Explanation of quantisation methods
<details>
<summary>Click to see details</summary>
The new methods available are:
* GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw)
* GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw
* GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw
* GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw
* GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw
* GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw
* GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw
* GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw
* GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw
* GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw
* GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw
* GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw
Refer to the Provided Files table below to see what files use which methods, and how.
</details>
<!-- compatibility_gguf end -->
<!-- README_GGUF.md-provided-files start -->
## Provided files
| Name | Quant method | Bits | Size | Max RAM required | Use case |
| ---- | ---- | ---- | ---- | ---- | ----- |
| [Codestral-22B-v0.1.IQ1_S.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ1_S.gguf) | IQ1_S | 1 | 4.3 GB| 5.3 GB | smallest, significant quality loss - **TBD**: Waiting for [this issue](https://github.com/ggerganov/llama.cpp/issues/5996) to be resolved |
| [Codestral-22B-v0.1.IQ1_M.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ1_M.gguf) | IQ1_M | 1 | 4.8 GB| 5.8 GB | very small, significant quality loss |
| [Codestral-22B-v0.1.IQ2_XXS.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ2_XXS.gguf) | IQ2_XXS | 2 | 5.4 GB| 6.4 GB | very small, high quality loss |
| [Codestral-22B-v0.1.IQ2_XS.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ2_XS.gguf) | IQ2_XS | 2 | 6.0 GB| 7.0 GB | very small, high quality loss |
| [Codestral-22B-v0.1.IQ2_S.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ2_S.gguf) | IQ2_S | 2 | 6.4 GB| 7.4 GB | small, substantial quality loss |
| [Codestral-22B-v0.1.IQ2_M.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ2_M.gguf) | IQ2_M | 2 | 6.9 GB| 7.9 GB | small, greater quality loss |
| [Codestral-22B-v0.1.IQ3_XXS.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ3_XXS.gguf) | IQ3_XXS | 3 | 7.9 GB| 8.9 GB | very small, high quality loss |
| [Codestral-22B-v0.1.IQ3_XS.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ3_XS.gguf) | IQ3_XS | 3 | 8.4 GB| 9.4 GB | small, substantial quality loss |
| [Codestral-22B-v0.1.IQ3_S.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ3_S.gguf) | IQ3_S | 3 | 8.9 GB| 9.9 GB | small, greater quality loss |
| [Codestral-22B-v0.1.IQ3_M.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ3_M.gguf) | IQ3_M | 3 | 9.2 GB| 10.2 GB | medium, balanced quality - recommended |
| [Codestral-22B-v0.1.IQ4_XS.gguf](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.IQ4_XS.gguf) | IQ4_XS | 4 | 11.5 GB| 12.5 GB | small, substantial quality loss |
Generated importance matrix file: [Codestral-22B-v0.1.imatrix.dat](https://huggingface.co/CISCai/Codestral-22B-v0.1-SOTA-GGUF/blob/main/Codestral-22B-v0.1.imatrix.dat)
**Note**: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
<!-- README_GGUF.md-provided-files end -->
<!-- README_GGUF.md-how-to-run start -->
## Example `llama.cpp` command
Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.
```shell
./main -ngl 57 -m Codestral-22B-v0.1.IQ4_XS.gguf --color -c 32768 --temp 0 --repeat-penalty 1.1 -p "[AVAILABLE_TOOLS] {tools}[/AVAILABLE_TOOLS][INST] {prompt}[/INST]"
```
Change `-ngl 57` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
Change `-c 32768` to the desired sequence length.
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
If you are low on V/RAM try quantizing the K-cache with `-ctk q8_0` or even `-ctk q4_0` for big memory savings (depending on context size).
There is a similar option for V-cache (`-ctv`), however that is [not working yet](https://github.com/ggerganov/llama.cpp/issues/4425).
For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
## How to run from Python code
You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module.
### How to load this model in Python code, using llama-cpp-python
For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).
#### First install the package
Run one of the following commands, according to your system:
```shell
# Prebuilt wheel with basic CPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
# Prebuilt wheel with NVidia CUDA acceleration
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
# Prebuilt wheel with Metal GPU acceleration
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
# Build base version with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# Or with Vulkan acceleration
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
# Or with Kompute acceleration
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
# Or with SYCL acceleration
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
pip install llama-cpp-python
```
#### Simple llama-cpp-python example code
```python
from llama_cpp import Llama
# Chat Completion API
llm = Llama(model_path="./Codestral-22B-v0.1.IQ4_XS.gguf", n_gpu_layers=57, n_ctx=32768)
print(llm.create_chat_completion(
repeat_penalty = 1.1,
messages = [
{
"role": "user",
"content": "Pick a LeetCode challenge and solve it in Python."
}
]
))
```
#### Simple llama-cpp-python example fill-in-middle code
```python
from llama_cpp import Llama
# Completion API
prompt = "def add("
suffix = "\n return sum\n\n"
llm = Llama(model_path="./Codestral-22B-v0.1.IQ4_XS.gguf", n_gpu_layers=57, n_ctx=32768, spm_infill=True)
output = llm.create_completion(
temperature = 0.0,
repeat_penalty = 1.0,
prompt = prompt,
suffix = suffix
)
# Models sometimes repeat suffix in response, attempt to filter that
response = output["choices"][0]["text"]
response_stripped = response.rstrip()
unwanted_response_suffix = suffix.rstrip()
unwanted_response_length = len(unwanted_response_suffix)
filtered = False
if unwanted_response_suffix and response_stripped[-unwanted_response_length:] == unwanted_response_suffix:
response = response_stripped[:-unwanted_response_length]
filtered = True
print(f"Fill-in-Middle completion{' (filtered)' if filtered else ''}:\n\n{prompt}\033[32m{response}\033[0m{suffix}")
```
#### Simple llama-cpp-python example function calling code
```python
from llama_cpp import Llama
# Chat Completion API
llm = Llama(model_path="./Codestral-22B-v0.1.IQ4_XS.gguf", n_gpu_layers=57, n_ctx=32768)
print(llm.create_chat_completion(
temperature = 0.0,
repeat_penalty = 1.1,
messages = [
{
"role": "user",
"content": "In a physics experiment, you are given an object with a mass of 50 kilograms and a volume of 10 cubic meters. Can you use the 'calculate_density' function to determine the density of this object?"
},
{ # The tool_calls is from the response to the above with tool_choice active
"role": "assistant",
"content": None,
"tool_calls": [
{
"id": "call__0_calculate_density_cmpl-...",
"type": "function",
"function": {
"name": "calculate_density",
"arguments": '{"mass": "50", "volume": "10"}'
}
}
]
},
{ # The tool_call_id is from tool_calls and content is the result from the function call you made
"role": "tool",
"content": "5.0",
"tool_call_id": "call__0_calculate_density_cmpl-..."
}
],
tools=[{
"type": "function",
"function": {
"name": "calculate_density",
"description": "Calculates the density of an object.",
"parameters": {
"type": "object",
"properties": {
"mass": {
"type": "integer",
"description": "The mass of the object."
},
"volume": {
"type": "integer",
"description": "The volume of the object."
}
},
"required": [ "mass", "volume" ]
}
}
}],
#tool_choice={
# "type": "function",
# "function": {
# "name": "calculate_density"
# }
#}
))
```
<!-- README_GGUF.md-how-to-run end -->
<!-- original-model-card start -->
# Model Card for Codestral-22B-v0.1
Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash (more details in the [Blogpost](https://mistral.ai/news/codestral/)). The model can be queried:
- As instruct, for instance to answer any questions about a code snippet (write documentation, explain, factorize) or to generate code following specific indications
- As Fill in the Middle (FIM), to predict the middle tokens between a prefix and a suffix (very useful for software development add-ons like in VS Code)
## Installation
It is recommended to use `mistralai/Codestral-22B-v0.1` with [mistral-inference](https://github.com/mistralai/mistral-inference).
```
pip install mistral_inference
```
## Download
```py
from huggingface_hub import snapshot_download
from pathlib import Path
mistral_models_path = Path.home().joinpath('mistral_models', 'Codestral-22B-v0.1')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id="mistralai/Codestral-22B-v0.1", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)
```
### Chat
After installing `mistral_inference`, a `mistral-chat` CLI command should be available in your environment.
```
mistral-chat $HOME/mistral_models/Codestral-22B-v0.1 --instruct --max_tokens 256
```
Will generate an answer to "Write me a function that computes fibonacci in Rust" and should give something along the following lines:
```
Sure, here's a simple implementation of a function that computes the Fibonacci sequence in Rust. This function takes an integer `n` as an argument and returns the `n`th Fibonacci number.
fn fibonacci(n: u32) -> u32 {
match n {
0 => 0,
1 => 1,
_ => fibonacci(n - 1) + fibonacci(n - 2),
}
}
fn main() {
let n = 10;
println!("The {}th Fibonacci number is: {}", n, fibonacci(n));
}
This function uses recursion to calculate the Fibonacci number. However, it's not the most efficient solution because it performs a lot of redundant calculations. A more efficient solution would use a loop to iteratively calculate the Fibonacci numbers.
```
### Fill-in-the-middle (FIM)
After installing `mistral_inference` and running `pip install --upgrade mistral_common` to make sure to have mistral_common>=1.2 installed:
```py
from mistral_inference.model import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.instruct.request import FIMRequest
tokenizer = MistralTokenizer.v3()
model = Transformer.from_folder("~/codestral-22B-240529")
prefix = """def add("""
suffix = """ return sum"""
request = FIMRequest(prompt=prefix, suffix=suffix)
tokens = tokenizer.encode_fim(request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=256, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])
middle = result.split(suffix)[0].strip()
print(middle)
```
Should give something along the following lines:
```
num1, num2):
# Add two numbers
sum = num1 + num2
# return the sum
```
## Limitations
The Codestral-22B-v0.1 does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to
make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
## License
Codestral-22B-v0.1 is released under the `MNLP-0.1` license.
## The Mistral AI Team
Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Jean-Malo Delignon, Jia Li, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, Mickael Seznec, Nicolas Schuhl, Patrick von Platen, Romain Sauvestre, Pierre Stock, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibaut Lavril, Thibault Schueller, Timothée Lacroix, Théophile Gervet, Thomas Wang, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall
|