TheBloke
/

StableBeluga2-70B-GGML

@@ -38,13 +38,11 @@ quantized_by: TheBloke
 This repo contains GGML format model files for [Stability AI's StableBeluga 2](https://huggingface.co/stabilityai/StableBeluga2).
-GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
-* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Supports NVidia CUDA GPU acceleration.
-* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with GPU acceleration on all platforms (CUDA and OpenCL). Especially good for story telling.
-* [LM Studio](https://lmstudio.ai/), a fully featured local GUI with GPU acceleration on both Windows (NVidia and AMD), and macOS.
-* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with CUDA GPU acceleration via the c_transformers backend.
-* [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
-* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
 ## Repositories available
@@ -67,15 +65,15 @@ This is a system prompt, please behave and help the user.
 <!-- compatibility_ggml start -->
 ## Compatibility
-### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
-These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.
-### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
-These new quantisation methods are compatible with llama.cpp as of June 6th, commit `2d43387`.
-They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.
 ## Explanation of the new k-quant methods
 <details>
@@ -115,11 +113,11 @@ Refer to the Provided Files table below to see what files use which methods, and
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -ngl 32 -m stablebeluga2.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
-Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
 If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`

 This repo contains GGML format model files for [Stability AI's StableBeluga 2](https://huggingface.co/stabilityai/StableBeluga2).
+These 70B Llama 2 GGML files currently only support CPU inference.  They are known to work with:
+* [llama.cpp](https://github.com/ggerganov/llama.cpp)
+* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
+* [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI with GPU acceleration on all platforms (CUDA and OpenCL). Especially good for story telling.
+* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), version 0.1.77 and later. A Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
 ## Repositories available
 <!-- compatibility_ggml start -->
 ## Compatibility
+### Requires llama.cpp [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
+Or one of the other tools and libraries listed above.
+There is currently no GPU acceleration; only CPU can be used.
+To use in llama.cpp, you must add `-gqa 8` argument.
+For other UIs and libraries, please check the docs.
 ## Explanation of the new k-quant methods
 <details>
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -t 10 -gqa 8 -m stablebeluga2.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### System:\nThis is a system prompt, please behave and help the user.\n\n### User:\nWrite a story about llamas\n\n### Assistant:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
+Remember the `-gqa 8` argument, required for Llama 70B models.
 If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`