CISCai commited on
Commit
7487169
1 Parent(s): 2d9a51a

Upload 14 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,16 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Mistral-7B-Instruct-v0.3.fp16.gguf filter=lfs diff=lfs merge=lfs -text
37
+ Mistral-7B-Instruct-v0.3.imatrix.dat filter=lfs diff=lfs merge=lfs -text
38
+ Mistral-7B-Instruct-v0.3.IQ1_M.gguf filter=lfs diff=lfs merge=lfs -text
39
+ Mistral-7B-Instruct-v0.3.IQ1_S.gguf filter=lfs diff=lfs merge=lfs -text
40
+ Mistral-7B-Instruct-v0.3.IQ2_M.gguf filter=lfs diff=lfs merge=lfs -text
41
+ Mistral-7B-Instruct-v0.3.IQ2_S.gguf filter=lfs diff=lfs merge=lfs -text
42
+ Mistral-7B-Instruct-v0.3.IQ2_XS.gguf filter=lfs diff=lfs merge=lfs -text
43
+ Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf filter=lfs diff=lfs merge=lfs -text
44
+ Mistral-7B-Instruct-v0.3.IQ3_M.gguf filter=lfs diff=lfs merge=lfs -text
45
+ Mistral-7B-Instruct-v0.3.IQ3_S.gguf filter=lfs diff=lfs merge=lfs -text
46
+ Mistral-7B-Instruct-v0.3.IQ3_XS.gguf filter=lfs diff=lfs merge=lfs -text
47
+ Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf filter=lfs diff=lfs merge=lfs -text
48
+ Mistral-7B-Instruct-v0.3.IQ4_XS.gguf filter=lfs diff=lfs merge=lfs -text
Mistral-7B-Instruct-v0.3.IQ1_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a4c2e132197ce0f48c2f4f305ed7c73b576484f1c4274cd5dcc7cb3d1464157
3
+ size 1757663808
Mistral-7B-Instruct-v0.3.IQ1_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:010412320006c0cfbec504bf49b521cd082fa3e7baa703c2521664b92d76bff4
3
+ size 1615319616
Mistral-7B-Instruct-v0.3.IQ2_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d772726da0d89582af1879e3bce1f206eecee4dabbf43373d20c3e3c463e033a
3
+ size 2504249920
Mistral-7B-Instruct-v0.3.IQ2_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c74ffe534be1845b3de096c4bee6258eabee8f4590662a04e7cf11325dcf4df
3
+ size 2314457664
Mistral-7B-Instruct-v0.3.IQ2_XS.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:714090e00e6d3774a09e5fcb97f208d8c9caa82182001e510b9ba6748f85f9ad
3
+ size 2201473600
Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e261e54d7e79d3a2829d946f012a07463435b0ea2674a3ad9ef11065eddac34
3
+ size 1994904128
Mistral-7B-Instruct-v0.3.IQ3_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dcc6670467d7a5bfb40428843a6c874b9e919bfa156061eeee5372815a177ab
3
+ size 3288846912
Mistral-7B-Instruct-v0.3.IQ3_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:33bd7373ab791bed06cc5ee9f6e4993e0a4edc308a78a6440732060056dd705b
3
+ size 3186348608
Mistral-7B-Instruct-v0.3.IQ3_XS.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c38302b457361cb0e4daa4a5629d36ba7c232ad57928cc7b867c31ca9647c79b
3
+ size 3022770752
Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61212d266155c5525b68791db73fcbbab7ac4f43e8357cf3b99e62f832c56c5f
3
+ size 2830881344
Mistral-7B-Instruct-v0.3.IQ4_XS.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b217dadd62f9799bc8dcb2a06d24b8bc6c77e38027d93185dd57f392cfc2fb4c
3
+ size 3911963200
Mistral-7B-Instruct-v0.3.fp16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7128d0c16ca917c0e5ffcb23c257d7f51bc44412900cbdeb136981bc9fb237f1
3
+ size 14497337696
Mistral-7B-Instruct-v0.3.imatrix.dat ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e44d933bda5232dd9081c20e0ae2ef2df5aab5c4388fc308b237d1454e694e81
3
+ size 4988162
README.md CHANGED
@@ -1,3 +1,238 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Mistral-7B-Instruct-v0.3
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ license: apache-2.0
7
+ model_creator: Mistral AI
8
+ model_name: Mistral-7B-Instruct-v0.3
9
+ model_type: mistral
10
+ quantized_by: CISC
11
+ ---
12
+
13
+ # Mistral-7B-Instruct-v0.3 - SOTA GGUF
14
+ - Model creator: [Mistral AI](https://huggingface.co/mistralai)
15
+ - Original model: [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
16
+
17
+ <!-- description start -->
18
+ ## Description
19
+
20
+ This repo contains State Of The Art quantized GGUF format model files for [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3).
21
+
22
+ Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of [groups_merged.txt](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384) and [wiki.train.raw](https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/train.txt) concatenated.
23
+
24
+ The embedded chat template has been extended to support function calling via OpenAI-compatible `tools` parameter, see [example](#simple-llama-cpp-python-example-function-calling-code).
25
+
26
+ <!-- description end -->
27
+
28
+
29
+ <!-- prompt-template start -->
30
+ ## Prompt template: Mistral v3
31
+
32
+ ```
33
+ [AVAILABLE_TOOLS][{"name": "function_name", "description": "Description", "parameters": {...}}, ...][/AVAILABLE_TOOLS][INST] {prompt} [/INST]
34
+ ```
35
+
36
+ <!-- prompt-template end -->
37
+
38
+
39
+ <!-- compatibility_gguf start -->
40
+ ## Compatibility
41
+
42
+ These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307)
43
+
44
+ They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.
45
+
46
+ ## Explanation of quantisation methods
47
+
48
+ <details>
49
+ <summary>Click to see details</summary>
50
+
51
+ The new methods available are:
52
+
53
+ * GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw)
54
+ * GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw
55
+ * GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw
56
+ * GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw
57
+ * GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw
58
+ * GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw
59
+ * GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw
60
+ * GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw
61
+ * GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw
62
+ * GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw
63
+ * GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw
64
+ * GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw
65
+
66
+ Refer to the Provided Files table below to see what files use which methods, and how.
67
+ </details>
68
+ <!-- compatibility_gguf end -->
69
+
70
+ <!-- README_GGUF.md-provided-files start -->
71
+ ## Provided files
72
+
73
+ | Name | Quant method | Bits | Size | Max RAM required | Use case |
74
+ | ---- | ---- | ---- | ---- | ---- | ----- |
75
+ | [Mistral-7B-Instruct-v0.3.IQ1_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ1_S.gguf) | IQ1_S | 1 | 1.5 GB| 2.5 GB | smallest, significant quality loss - **TBD**: Waiting for [this issue](https://github.com/ggerganov/llama.cpp/issues/5996) to be resolved |
76
+ | [Mistral-7B-Instruct-v0.3.IQ1_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ1_M.gguf) | IQ1_M | 1 | 1.6 GB| 2.6 GB | very small, significant quality loss |
77
+ | [Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf) | IQ2_XXS | 2 | 1.8 GB| 2.8 GB | very small, high quality loss |
78
+ | [Mistral-7B-Instruct-v0.3.IQ2_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_XS.gguf) | IQ2_XS | 2 | 1.9 GB| 2.9 GB | very small, high quality loss |
79
+ | [Mistral-7B-Instruct-v0.3.IQ2_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_S.gguf) | IQ2_S | 2 | 2.1 GB| 3.1 GB | small, substantial quality loss |
80
+ | [Mistral-7B-Instruct-v0.3.IQ2_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_M.gguf) | IQ2_M | 2 | 2.2 GB| 3.2 GB | small, greater quality loss |
81
+ | [Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf) | IQ3_XXS | 3 | 2.5 GB| 3.5 GB | very small, high quality loss |
82
+ | [Mistral-7B-Instruct-v0.3.IQ3_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_XS.gguf) | IQ3_XS | 3 | 2.7 GB| 3.7 GB | small, substantial quality loss |
83
+ | [Mistral-7B-Instruct-v0.3.IQ3_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_S.gguf) | IQ3_S | 3 | 2.8 GB| 3.8 GB | small, greater quality loss |
84
+ | [Mistral-7B-Instruct-v0.3.IQ3_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_M.gguf) | IQ3_M | 3 | 3.0 GB| 4.0 GB | medium, balanced quality - recommended |
85
+ | [Mistral-7B-Instruct-v0.3.IQ4_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ4_XS.gguf) | IQ4_XS | 4 | 3.4 GB| 4.4 GB | small, substantial quality loss |
86
+
87
+ Generated importance matrix file: [Mistral-7B-Instruct-v0.3.imatrix.dat](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.imatrix.dat)
88
+
89
+ **Note**: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
90
+
91
+ <!-- README_GGUF.md-provided-files end -->
92
+
93
+ <!-- README_GGUF.md-how-to-run start -->
94
+ ## Example `llama.cpp` command
95
+
96
+ Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.
97
+
98
+ ```shell
99
+ ./main -ngl 33 -m Mistral-7B-Instruct-v0.3.IQ4_XS.gguf --color -c 32768 --temp 0 --repeat-penalty 1.1 -p "[AVAILABLE_TOOLS]{tools}[/AVAILABLE_TOOLS][INST] {prompt} [/INST]"
100
+ ```
101
+
102
+ Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
103
+
104
+ Change `-c 32768` to the desired sequence length.
105
+
106
+ If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
107
+
108
+ If you are low on V/RAM try quantizing the K-cache with `-ctk q8_0` or even `-ctk q4_0` for big memory savings (depending on context size).
109
+ There is a similar option for V-cache (`-ctv`), however that is [not working yet](https://github.com/ggerganov/llama.cpp/issues/4425).
110
+
111
+ For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
112
+
113
+ ## How to run from Python code
114
+
115
+ You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module.
116
+
117
+ ### How to load this model in Python code, using llama-cpp-python
118
+
119
+ For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).
120
+
121
+ #### First install the package
122
+
123
+ Run one of the following commands, according to your system:
124
+
125
+ ```shell
126
+ # Prebuilt wheel with basic CPU support
127
+ pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
128
+ # Prebuilt wheel with NVidia CUDA acceleration
129
+ pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
130
+ # Prebuilt wheel with Metal GPU acceleration
131
+ pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
132
+ # Build base version with no GPU acceleration
133
+ pip install llama-cpp-python
134
+ # With NVidia CUDA acceleration
135
+ CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
136
+ # Or with OpenBLAS acceleration
137
+ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
138
+ # Or with CLBLast acceleration
139
+ CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
140
+ # Or with AMD ROCm GPU acceleration (Linux only)
141
+ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
142
+ # Or with Metal GPU acceleration for macOS systems only
143
+ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
144
+ # Or with Vulkan acceleration
145
+ CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
146
+ # Or with Kompute acceleration
147
+ CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
148
+ # Or with SYCL acceleration
149
+ CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
150
+
151
+ # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
152
+ $env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
153
+ pip install llama-cpp-python
154
+ ```
155
+
156
+ #### Simple llama-cpp-python example code
157
+
158
+ ```python
159
+ from llama_cpp import Llama
160
+
161
+ # Chat Completion API
162
+
163
+ llm = Llama(model_path="./Mistral-7B-Instruct-v0.3.IQ4_XS.gguf", n_gpu_layers=33, n_ctx=32768)
164
+ print(llm.create_chat_completion(
165
+ messages = [
166
+ {
167
+ "role": "user",
168
+ "content": "Pick a LeetCode challenge and solve it in Python."
169
+ }
170
+ ]
171
+ ))
172
+ ```
173
+
174
+ #### Simple llama-cpp-python example function calling code
175
+
176
+ ```python
177
+ from llama_cpp import Llama
178
+
179
+ # Chat Completion API
180
+
181
+ llm = Llama(model_path="./Mistral-7B-Instruct-v0.3.IQ4_XS.gguf", n_gpu_layers=33, n_ctx=32768, temperature=0.0, repeat_penalty=1.1)
182
+ print(llm.create_chat_completion(
183
+ messages = [
184
+ {
185
+ "role": "user",
186
+ "content": "What's the weather like in Oslo?"
187
+ },
188
+ { # The tool_calls is from the response to the above with tool_choice active
189
+ "role": "assistant",
190
+ "content": None,
191
+ "tool_calls": [
192
+ {
193
+ "id": "call__0_get_current_weather_cmpl-...",
194
+ "type": "function",
195
+ "function": {
196
+ "name": "get_current_weather",
197
+ "arguments": '{ "location": "Oslo, NO" ,"unit": "celsius"} '
198
+ }
199
+ }
200
+ ]
201
+ },
202
+ { # The tool_call_id is from tool_calls and content is the result from the function call you made
203
+ "role": "tool",
204
+ "content": 20,
205
+ "tool_call_id": "call__0_get_current_weather_cmpl-..."
206
+ }
207
+ ],
208
+ tools=[{
209
+ "type": "function",
210
+ "function": {
211
+ "name": "get_current_weather",
212
+ "description": "Get the current weather in a given location",
213
+ "parameters": {
214
+ "type": "object",
215
+ "properties": {
216
+ "location": {
217
+ "type": "string",
218
+ "description": "The city and state, e.g. San Francisco, CA"
219
+ },
220
+ "unit": {
221
+ "type": "string",
222
+ "enum": [ "celsius", "fahrenheit" ]
223
+ }
224
+ },
225
+ "required": [ "location" ]
226
+ }
227
+ }
228
+ }],
229
+ #tool_choice={
230
+ # "type": "function",
231
+ # "function": {
232
+ # "name": "get_current_weather"
233
+ # }
234
+ #}
235
+ ))
236
+ ```
237
+
238
+ <!-- README_GGUF.md-how-to-run end -->