GGUF
code
Edit model card

Codestral-22B-v0.1 - SOTA GGUF

Description

This repo contains State Of The Art quantized GGUF format model files for Codestral-22B-v0.1.

Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of answers from the CodeFeedback-Filtered-Instruction dataset.

The embedded chat template has been extended to support function calling via OpenAI-compatible tools parameter and Fill-in-Middle token metadata has been added, see example. NOTE: Mistral's FIM requires support for SPM infill mode!

Prompt template: Mistral v3

[AVAILABLE_TOOLS] [{"name": "function_name", "description": "Description", "parameters": {...}}, ...][/AVAILABLE_TOOLS][INST] {prompt}[/INST]

Compatibility

These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit 0becb22

They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.

Explanation of quantisation methods

Click to see details

The new methods available are:

  • GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw)
  • GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw
  • GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw
  • GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw
  • GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw
  • GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw
  • GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw
  • GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw
  • GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw
  • GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw
  • GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw
  • GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw

Refer to the Provided Files table below to see what files use which methods, and how.

Provided files

Name Quant method Bits Size Max RAM required Use case
Codestral-22B-v0.1.IQ1_S.gguf IQ1_S 1 4.3 GB 5.3 GB smallest, significant quality loss - TBD: Waiting for this issue to be resolved
Codestral-22B-v0.1.IQ1_M.gguf IQ1_M 1 4.8 GB 5.8 GB very small, significant quality loss
Codestral-22B-v0.1.IQ2_XXS.gguf IQ2_XXS 2 5.4 GB 6.4 GB very small, high quality loss
Codestral-22B-v0.1.IQ2_XS.gguf IQ2_XS 2 6.0 GB 7.0 GB very small, high quality loss
Codestral-22B-v0.1.IQ2_S.gguf IQ2_S 2 6.4 GB 7.4 GB small, substantial quality loss
Codestral-22B-v0.1.IQ2_M.gguf IQ2_M 2 6.9 GB 7.9 GB small, greater quality loss
Codestral-22B-v0.1.IQ3_XXS.gguf IQ3_XXS 3 7.9 GB 8.9 GB very small, high quality loss
Codestral-22B-v0.1.IQ3_XS.gguf IQ3_XS 3 8.4 GB 9.4 GB small, substantial quality loss
Codestral-22B-v0.1.IQ3_S.gguf IQ3_S 3 8.9 GB 9.9 GB small, greater quality loss
Codestral-22B-v0.1.IQ3_M.gguf IQ3_M 3 9.2 GB 10.2 GB medium, balanced quality - recommended
Codestral-22B-v0.1.IQ4_XS.gguf IQ4_XS 4 11.5 GB 12.5 GB small, substantial quality loss

Generated importance matrix file: Codestral-22B-v0.1.imatrix.dat

Note: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

Example llama.cpp command

Make sure you are using llama.cpp from commit 0becb22 or later.

./main -ngl 57 -m Codestral-22B-v0.1.IQ4_XS.gguf --color -c 32768 --temp 0 --repeat-penalty 1.1 -p "[AVAILABLE_TOOLS] {tools}[/AVAILABLE_TOOLS][INST] {prompt}[/INST]"

Change -ngl 57 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

Change -c 32768 to the desired sequence length.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

If you are low on V/RAM try quantizing the K-cache with -ctk q8_0 or even -ctk q4_0 for big memory savings (depending on context size). There is a similar option for V-cache (-ctv), however that is not working yet.

For other parameters and how to use them, please refer to the llama.cpp documentation

How to run from Python code

You can use GGUF models from Python using the llama-cpp-python module.

How to load this model in Python code, using llama-cpp-python

For full documentation, please see: llama-cpp-python docs.

First install the package

Run one of the following commands, according to your system:

# Prebuilt wheel with basic CPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
# Prebuilt wheel with NVidia CUDA acceleration
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
# Prebuilt wheel with Metal GPU acceleration
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
# Build base version with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# Or with Vulkan acceleration
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
# Or with Kompute acceleration
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
# Or with SYCL acceleration
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
pip install llama-cpp-python

Simple llama-cpp-python example code

from llama_cpp import Llama

# Chat Completion API

llm = Llama(model_path="./Codestral-22B-v0.1.IQ4_XS.gguf", n_gpu_layers=57, n_ctx=32768)
print(llm.create_chat_completion(
    repeat_penalty = 1.1,
    messages = [
        {
            "role": "user",
            "content": "Pick a LeetCode challenge and solve it in Python."
        }
    ]
))

Simple llama-cpp-python example fill-in-middle code

from llama_cpp import Llama

# Completion API

prompt = "def add("
suffix = "\n    return sum\n\n"

llm = Llama(model_path="./Codestral-22B-v0.1.IQ4_XS.gguf", n_gpu_layers=57, n_ctx=32768, spm_infill=True)
output = llm.create_completion(
    temperature = 0.0,
    repeat_penalty = 1.0,
    prompt = prompt,
    suffix = suffix
)

# Models sometimes repeat suffix in response, attempt to filter that
response = output["choices"][0]["text"]
response_stripped = response.rstrip()
unwanted_response_suffix = suffix.rstrip()
unwanted_response_length = len(unwanted_response_suffix)

filtered = False
if unwanted_response_suffix and response_stripped[-unwanted_response_length:] == unwanted_response_suffix:
    response = response_stripped[:-unwanted_response_length]
    filtered = True

print(f"Fill-in-Middle completion{' (filtered)' if filtered else ''}:\n\n{prompt}\033[32m{response}\033[0m{suffix}")

Simple llama-cpp-python example function calling code

from llama_cpp import Llama

# Chat Completion API

llm = Llama(model_path="./Codestral-22B-v0.1.IQ4_XS.gguf", n_gpu_layers=57, n_ctx=32768)
print(llm.create_chat_completion(
      temperature = 0.0,
      repeat_penalty = 1.1,
      messages = [
        {
          "role": "user",
          "content": "In a physics experiment, you are given an object with a mass of 50 kilograms and a volume of 10 cubic meters. Can you use the 'calculate_density' function to determine the density of this object?"
        },
        { # The tool_calls is from the response to the above with tool_choice active
          "role": "assistant",
          "content": None,
          "tool_calls": [
            {
              "id": "call__0_calculate_density_cmpl-...",
              "type": "function",
              "function": {
                "name": "calculate_density",
                "arguments": '{"mass": "50", "volume": "10"}'
              }
            }
          ]
        },
        { # The tool_call_id is from tool_calls and content is the result from the function call you made
          "role": "tool",
          "content": "5.0",
          "tool_call_id": "call__0_calculate_density_cmpl-..."
        }
      ],
      tools=[{
        "type": "function",
        "function": {
          "name": "calculate_density",
          "description": "Calculates the density of an object.",
          "parameters": {
            "type": "object",
            "properties": {
              "mass": {
                "type": "integer",
                "description": "The mass of the object."
              },
              "volume": {
                "type": "integer",
                "description": "The volume of the object."
              }
            },
            "required": [ "mass", "volume" ]
          }
        }
      }],
      #tool_choice={
      #  "type": "function",
      #  "function": {
      #    "name": "calculate_density"
      #  }
      #}
))

Model Card for Codestral-22B-v0.1

Codestrall-22B-v0.1 is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash (more details in the Blogpost). The model can be queried:

  • As instruct, for instance to answer any questions about a code snippet (write documentation, explain, factorize) or to generate code following specific indications
  • As Fill in the Middle (FIM), to predict the middle tokens between a prefix and a suffix (very useful for software development add-ons like in VS Code)

Installation

It is recommended to use mistralai/Codestral-22B-v0.1 with mistral-inference.

pip install mistral_inference

Download

from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', 'Codestral-22B-v0.1')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Codestral-22B-v0.1", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

Chat

After installing mistral_inference, a mistral-chat CLI command should be available in your environment.

mistral-chat $HOME/mistral_models/Codestral-22B-v0.1 --instruct --max_tokens 256

Will generate an answer to "Write me a function that computes fibonacci in Rust" and should give something along the following lines:

Sure, here's a simple implementation of a function that computes the Fibonacci sequence in Rust. This function takes an integer `n` as an argument and returns the `n`th Fibonacci number.

fn fibonacci(n: u32) -> u32 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn main() {
    let n = 10;
    println!("The {}th Fibonacci number is: {}", n, fibonacci(n));
}

This function uses recursion to calculate the Fibonacci number. However, it's not the most efficient solution because it performs a lot of redundant calculations. A more efficient solution would use a loop to iteratively calculate the Fibonacci numbers.

Fill-in-the-middle (FIM)

After installing mistral_inference and running pip install --upgrade mistral_common to make sure to have mistral_common>=1.2 installed:

from mistral_inference.model import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.instruct.request import FIMRequest

tokenizer = MistralTokenizer.v3()
model = Transformer.from_folder("~/codestral-22B-240529")

prefix = """def add("""
suffix = """    return sum"""

request = FIMRequest(prompt=prefix, suffix=suffix)

tokens = tokenizer.encode_fim(request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=256, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

middle = result.split(suffix)[0].strip()
print(middle)

Should give something along the following lines:

num1, num2):

    # Add two numbers
    sum = num1 + num2

    # return the sum

Limitations

The Codestral-22B-v0.1 does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.

License

Codestral-22B-v0.1 is released under the MNLP-0.1 license.

The Mistral AI Team

Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Jean-Malo Delignon, Jia Li, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, Mickael Seznec, Nicolas Schuhl, Patrick von Platen, Romain Sauvestre, Pierre Stock, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibaut Lavril, Thibault Schueller, Timothée Lacroix, Théophile Gervet, Thomas Wang, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall

Downloads last month
737
GGUF
Model size
22.2B params
Architecture
llama
+3
Inference API (serverless) has been turned off for this model.

Quantized from

Dataset used to train CISCai/Codestral-22B-v0.1-SOTA-GGUF