CISCai

Updated offical chat template (tool_calls fixed when None), removed silly tool_call_id length restriction and added padding token.

0e014b7 verified 5 months ago

preview code

raw

history blame

13.5 kB

	---
	base_model: mistralai/Mistral-Nemo-Instruct-2407
	language:
	- en
	pipeline_tag: text-generation
	license: apache-2.0
	model_creator: Mistral AI
	model_name: Mistral-Nemo-Instruct-2407
	model_type: mistral
	quantized_by: CISC
	---

	# Mistral-Nemo-Instruct-2407 - SOTA GGUF
	- Model creator: [Mistral AI](https://huggingface.co/mistralai)
	- Original model: [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)

	<!-- description start -->
	## Description

	This repo contains State Of The Art quantized GGUF format model files for [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407).

	August 16th update: Updated offical chat template (`tool_calls` fixed when `None`), removed silly `tool_call_id` length restriction and added `padding` token.

	Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of [groups_merged-enhancedV3.txt](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384) and [wiki.train.raw](https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/train.txt) concatenated.

	The embedded chat template is the updated one with correct Tekken tokenization and function calling support via OpenAI-compatible `tools` parameter, see [example](#simple-llama-cpp-python-example-function-calling-code).

	<!-- description end -->


	<!-- prompt-template start -->
	## Prompt template: Mistral Tekken

	```
	[AVAILABLE_TOOLS][{"name": "function_name", "description": "Description", "parameters": {...}}, ...][/AVAILABLE_TOOLS][INST]{prompt}[/INST]
	```

	<!-- prompt-template end -->


	<!-- compatibility_gguf start -->
	## Compatibility

	These quantised GGUFv3 files are compatible with llama.cpp from July 22nd 2024 onwards, as of commit [50e0535](https://github.com/ggerganov/llama.cpp/commit/50e05353e88d50b644688caa91f5955e8bdb9eb9)

	They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.

	## Explanation of quantisation methods

	<details>
	<summary>Click to see details</summary>

	The new methods available are:

	* GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw)
	* GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw
	* GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw
	* GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw
	* GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw
	* GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw
	* GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw
	* GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw
	* GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw
	* GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw
	* GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw
	* GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw

	Refer to the Provided Files table below to see what files use which methods, and how.
	</details>
	<!-- compatibility_gguf end -->

	<!-- README_GGUF.md-provided-files start -->
	## Provided files

	\| Name \| Quant method \| Bits \| Size \| Max RAM required \| Use case \|
	\| ---- \| ---- \| ---- \| ---- \| ---- \| ----- \|
	\| [Mistral-Nemo-Instruct-2407.IQ1_S.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ1_S.gguf) \| IQ1_S \| 1 \| 2.8 GB\| 3.4 GB \| smallest, significant quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ1_M.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ1_M.gguf) \| IQ1_M \| 1 \| 3.0 GB\| 3.6 GB \| very small, significant quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ2_XXS.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ2_XXS.gguf) \| IQ2_XXS \| 2 \| 3.3 GB\| 3.9 GB \| very small, high quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ2_XS.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ2_XS.gguf) \| IQ2_XS \| 2 \| 3.6 GB\| 4.2 GB \| very small, high quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ2_S.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ2_S.gguf) \| IQ2_S \| 2 \| 3.9 GB\| 4.4 GB \| small, substantial quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ2_M.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ2_M.gguf) \| IQ2_M \| 2 \| 4.1 GB\| 4.7 GB \| small, greater quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ3_XXS.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ3_XXS.gguf) \| IQ3_XXS \| 3 \| 4.6 GB\| 5.2 GB \| very small, high quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ3_XS.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ3_XS.gguf) \| IQ3_XS \| 3 \| 4.9 GB\| 5.5 GB \| small, substantial quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ3_S.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ3_S.gguf) \| IQ3_S \| 3 \| 5.2 GB\| 5.8 GB \| small, greater quality loss \|
	\| [Mistral-Nemo-Instruct-2407.IQ3_M.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ3_M.gguf) \| IQ3_M \| 3 \| 5.3 GB\| 5.9 GB \| medium, balanced quality - recommended \|
	\| [Mistral-Nemo-Instruct-2407.IQ4_XS.gguf](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.IQ4_XS.gguf) \| IQ4_XS \| 4 \| 6.3 GB\| 6.9 GB \| small, substantial quality loss \|

	Generated importance matrix file: [Mistral-Nemo-Instruct-2407.imatrix.dat](https://huggingface.co/CISCai/Mistral-Nemo-Instruct-2407-SOTA-GGUF/blob/main/Mistral-Nemo-Instruct-2407.imatrix.dat)

	Note: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

	<!-- README_GGUF.md-provided-files end -->

	<!-- README_GGUF.md-how-to-run start -->
	## Example `llama.cpp` command

	Make sure you are using `llama.cpp` from commit [50e0535](https://github.com/ggerganov/llama.cpp/commit/50e05353e88d50b644688caa91f5955e8bdb9eb9) or later.

	```shell
	./llama-cli -ngl 41 -m Mistral-Nemo-Instruct-2407.IQ4_XS.gguf --color -c 131072 --temp 0.3 --repeat-penalty 1.1 -p "[AVAILABLE_TOOLS]{tools}[/AVAILABLE_TOOLS][INST]{prompt}[/INST]"
	```

	This model is very temperature sensitive, keep it between 0.3 and 0.4 for best results! Also note the lack of spaces between special tokens and input in the prompt; this model is not using the regular Mistral chat template.

	Change `-ngl 41` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

	Change `-c 131072` to the desired sequence length.

	If you are low on V/RAM try quantizing the K-cache with `-ctk q8_0` or even `-ctk q4_0` for big memory savings (depending on context size).
	There is a similar option for V-cache (`-ctv`), however that is [not working yet](https://github.com/ggerganov/llama.cpp/issues/4425) unless you enable Flash Attention (`-fa`) too.

	For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)

	## How to run from Python code

	You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module.

	### How to load this model in Python code, using llama-cpp-python

	For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).

	#### First install the package

	Run one of the following commands, according to your system:

	```shell
	# Prebuilt wheel with basic CPU support
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
	# Prebuilt wheel with NVidia CUDA acceleration
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
	# Prebuilt wheel with Metal GPU acceleration
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
	# Build base version with no GPU acceleration
	pip install llama-cpp-python
	# With NVidia CUDA acceleration
	CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
	# Or with OpenBLAS acceleration
	CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
	# Or with AMD ROCm GPU acceleration (Linux only)
	CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python
	# Or with Metal GPU acceleration for macOS systems only
	CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
	# Or with Vulkan acceleration
	CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python
	# Or with SYCL acceleration
	CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

	# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
	$env:CMAKE_ARGS = "-DGGML_CUDA=on"
	pip install llama-cpp-python
	```

	#### Simple llama-cpp-python example code

	```python
	from llama_cpp import Llama

	# Chat Completion API

	llm = Llama(model_path="./Mistral-Nemo-Instruct-2407.IQ4_XS.gguf", n_gpu_layers=41, n_ctx=131072)
	print(llm.create_chat_completion(
	messages = [
	{
	"role": "user",
	"content": "Pick a LeetCode challenge and solve it in Python."
	}
	]
	))
	```

	#### Simple llama-cpp-python example function calling code

	```python
	from llama_cpp import Llama

	# Chat Completion API

	grammar = LlamaGrammar.from_json_schema(json.dumps({
	"type": "array",
	"items": {
	"type": "object",
	"required": [ "name", "arguments" ],
	"properties": {
	"name": {
	"type": "string"
	},
	"arguments": {
	"type": "object"
	}
	}
	}
	}))

	llm = Llama(model_path="./Mistral-Nemo-Instruct-2407.IQ4_XS.gguf", n_gpu_layers=41, n_ctx=131072)
	response = llm.create_chat_completion(
	temperature = 0.0,
	repeat_penalty = 1.1,
	messages = [
	{
	"role": "user",
	"content": "What's the weather like in Oslo and Stockholm?"
	}
	],
	tools=[{
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"description": "Get the current weather in a given location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "The city and state, e.g. San Francisco, CA"
	},
	"unit": {
	"type": "string",
	"enum": [ "celsius", "fahrenheit" ]
	}
	},
	"required": [ "location" ]
	}
	}
	}],
	grammar = grammar
	)
	print(json.loads(response["choices"][0]["text"]))

	print(llm.create_chat_completion(
	temperature = 0.0,
	repeat_penalty = 1.1,
	messages = [
	{
	"role": "user",
	"content": "What's the weather like in Oslo?"
	},
	{ # The tool_calls is from the response to the above with tool_choice active
	"role": "assistant",
	"content": None,
	"tool_calls": [
	{
	"id": "call__0_get_current_weather_cmpl-...",
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"arguments": '{ "location": "Oslo, NO" ,"unit": "celsius"} '
	}
	}
	]
	},
	{ # The tool_call_id is from tool_calls and content is the result from the function call you made
	"role": "tool",
	"content": "20",
	"tool_call_id": "call__0_get_current_weather_cmpl-..."
	}
	],
	tools=[{
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"description": "Get the current weather in a given location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "The city and state, e.g. San Francisco, CA"
	},
	"unit": {
	"type": "string",
	"enum": [ "celsius", "fahrenheit" ]
	}
	},
	"required": [ "location" ]
	}
	}
	}],
	#tool_choice={
	# "type": "function",
	# "function": {
	# "name": "get_current_weather"
	# }
	#}
	))
	```

	<!-- README_GGUF.md-how-to-run end -->