TheBloke commited on
Commit
3b0bfcf
1 Parent(s): bce565c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -13
README.md CHANGED
@@ -55,14 +55,16 @@ This repo contains GGUF format model files for [OpenOrca's Mixtral SlimOrca 8X7B
55
 
56
  GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.
57
 
58
- **MIXTRAL GGUF SUPPORT**
 
 
 
 
59
 
60
- Known to work in:
61
  * llama.cpp as of December 13th
62
  * KoboldCpp 1.52 as later
63
  * LM Studio 0.2.9 and later
64
-
65
- Support for Mixtral was merged into Llama.cpp on December 13th.
66
 
67
  Other clients/libraries, not listed above, may not yet work.
68
 
@@ -70,7 +72,6 @@ Other clients/libraries, not listed above, may not yet work.
70
  <!-- repositories-available start -->
71
  ## Repositories available
72
 
73
- * AWQ coming soon
74
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Mixtral-SlimOrca-8x7B-GPTQ)
75
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Mixtral-SlimOrca-8x7B-GGUF)
76
  * [OpenOrca's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Open-Orca/Mixtral-SlimOrca-8x7B)
@@ -85,6 +86,7 @@ Other clients/libraries, not listed above, may not yet work.
85
  <|im_start|>user
86
  {prompt}<|im_end|>
87
  <|im_start|>assistant
 
88
  ```
89
 
90
  <!-- prompt-template end -->
@@ -93,9 +95,7 @@ Other clients/libraries, not listed above, may not yet work.
93
  <!-- compatibility_gguf start -->
94
  ## Compatibility
95
 
96
- These quantised GGUFv2 files are compatible with llama.cpp from December 13th onwards.
97
-
98
- They are also compatible with many third party UIs and libraries - please see the list at the top of this README.
99
 
100
  ## Explanation of quantisation methods
101
 
@@ -198,12 +198,12 @@ Windows Command Line users: You can set the environment variable by running `set
198
  Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
199
 
200
  ```shell
201
- ./main -ngl 35 -m mixtral-slimorca-8x7b.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"
202
  ```
203
 
204
  Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
205
 
206
- Change `-c 32768` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
207
 
208
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
209
 
@@ -211,16 +211,81 @@ For other parameters and how to use them, please refer to [the llama.cpp documen
211
 
212
  ## How to run in `text-generation-webui`
213
 
214
- Not yet supported
 
 
215
 
216
  ## How to run from Python code
217
 
218
- Not yet supported
219
 
220
  ### How to load this model in Python code, using llama-cpp-python
221
 
222
- Not yet supported
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
 
224
 
225
  <!-- README_GGUF.md-how-to-run end -->
226
 
 
55
 
56
  GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.
57
 
58
+ ### Mixtral GGUF
59
+
60
+ Support for Mixtral was merged into Llama.cpp on December 13th.
61
+
62
+ These Mixtral GGUFs are known to work in:
63
 
 
64
  * llama.cpp as of December 13th
65
  * KoboldCpp 1.52 as later
66
  * LM Studio 0.2.9 and later
67
+ * llama-cpp-python 0.2.23 and later
 
68
 
69
  Other clients/libraries, not listed above, may not yet work.
70
 
 
72
  <!-- repositories-available start -->
73
  ## Repositories available
74
 
 
75
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Mixtral-SlimOrca-8x7B-GPTQ)
76
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Mixtral-SlimOrca-8x7B-GGUF)
77
  * [OpenOrca's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Open-Orca/Mixtral-SlimOrca-8x7B)
 
86
  <|im_start|>user
87
  {prompt}<|im_end|>
88
  <|im_start|>assistant
89
+
90
  ```
91
 
92
  <!-- prompt-template end -->
 
95
  <!-- compatibility_gguf start -->
96
  ## Compatibility
97
 
98
+ These Mixtral GGUFs are compatible with llama.cpp from December 13th onwards. Other clients/libraries may not work yet.
 
 
99
 
100
  ## Explanation of quantisation methods
101
 
 
198
  Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
199
 
200
  ```shell
201
+ ./main -ngl 35 -m mixtral-slimorca-8x7b.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"
202
  ```
203
 
204
  Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
205
 
206
+ Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
207
 
208
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
209
 
 
211
 
212
  ## How to run in `text-generation-webui`
213
 
214
+ Note that text-generation-webui may not yet be compatible with Mixtral GGUFs. Please check compatibility first.
215
+
216
+ Further instructions can be found in the text-generation-webui documentation, here: [text-generation-webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp).
217
 
218
  ## How to run from Python code
219
 
220
+ You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) version 0.2.23 and later.
221
 
222
  ### How to load this model in Python code, using llama-cpp-python
223
 
224
+ For full documentation, please see: [llama-cpp-python docs](https://abetlen.github.io/llama-cpp-python/).
225
+
226
+ #### First install the package
227
+
228
+ Run one of the following commands, according to your system:
229
+
230
+ ```shell
231
+ # Base ctransformers with no GPU acceleration
232
+ pip install llama-cpp-python
233
+ # With NVidia CUDA acceleration
234
+ CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
235
+ # Or with OpenBLAS acceleration
236
+ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
237
+ # Or with CLBLast acceleration
238
+ CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
239
+ # Or with AMD ROCm GPU acceleration (Linux only)
240
+ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
241
+ # Or with Metal GPU acceleration for macOS systems only
242
+ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
243
+
244
+ # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
245
+ $env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
246
+ pip install llama-cpp-python
247
+ ```
248
+
249
+ #### Simple llama-cpp-python example code
250
+
251
+ ```python
252
+ from llama_cpp import Llama
253
+
254
+ # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
255
+ llm = Llama(
256
+ model_path="./mixtral-slimorca-8x7b.Q4_K_M.gguf", # Download the model file first
257
+ n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
258
+ n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
259
+ n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
260
+ )
261
+
262
+ # Simple inference example
263
+ output = llm(
264
+ "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant", # Prompt
265
+ max_tokens=512, # Generate up to 512 tokens
266
+ stop=["</s>"], # Example stop token - not necessarily correct for this specific model! Please check before using.
267
+ echo=True # Whether to echo the prompt
268
+ )
269
+
270
+ # Chat Completion API
271
+
272
+ llm = Llama(model_path="./mixtral-slimorca-8x7b.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
273
+ llm.create_chat_completion(
274
+ messages = [
275
+ {"role": "system", "content": "You are a story writing assistant."},
276
+ {
277
+ "role": "user",
278
+ "content": "Write a story about llamas."
279
+ }
280
+ ]
281
+ )
282
+ ```
283
+
284
+ ## How to use with LangChain
285
+
286
+ Here are guides on using llama-cpp-python and ctransformers with LangChain:
287
 
288
+ * [LangChain + llama-cpp-python](https://python.langchain.com/docs/integrations/llms/llamacpp)
289
 
290
  <!-- README_GGUF.md-how-to-run end -->
291