TheBloke
/

Nous-Hermes-Llama2-70B-GGUF

@@ -47,11 +47,14 @@ GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is
 The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.
-As of August 23rd 2023, only llama.cpp supports GGUF. However, third-party clients and libraries are expected to add support very soon.
 Here is a list of clients and libraries, along with their expected timeline for GGUF support. Where possible a link to the relevant issue or PR is provided:
 * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), awaiting llama-cpp-python support.
-* [KoboldCpp](https://github.com/LostRuins/koboldcpp), [in active development](https://github.com/LostRuins/koboldcpp/issues/387). Test builds are working, but GPU acceleration remains to be tested.
 * [LM Studio](https://lmstudio.ai/), in active development - hoped to be ready by August 25th-26th.
 * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), will work as soon as ctransformers or llama-cpp-python is updated.
 * [ctransformers](https://github.com/marella/ctransformers), [development will start soon](https://github.com/marella/ctransformers/issues/102).
@@ -84,7 +87,9 @@ Here is a list of clients and libraries, along with their expected timeline for
 These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)
-As of August 23rd 2023 they are not yet compatible with any third-party UIS, libraries or utilities but this is expected to change very soon.
 ## Explanation of quantisation methods
 <details>
@@ -96,7 +101,6 @@ The new methods available are:
 * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
 * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
 * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
-* GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
 Refer to the Provided Files table below to see what files use which methods, and how.
 </details>
@@ -107,54 +111,20 @@ Refer to the Provided Files table below to see what files use which methods, and
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
 | [nous-hermes-llama2-70b.Q2_K.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q2_K.gguf) | Q2_K | 2 | 29.48 GB| 31.98 GB | smallest, significant quality loss - not recommended for most purposes |
 | [nous-hermes-llama2-70b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_S.gguf) | Q3_K_S | 3 | 30.09 GB| 32.59 GB | very small, high quality loss |
 | [nous-hermes-llama2-70b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_M.gguf) | Q3_K_M | 3 | 33.45 GB| 35.95 GB | very small, high quality loss |
 | [nous-hermes-llama2-70b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_L.gguf) | Q3_K_L | 3 | 36.49 GB| 38.99 GB | small, substantial quality loss |
 | [nous-hermes-llama2-70b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_S.gguf) | Q4_K_S | 4 | 39.30 GB| 41.80 GB | small, greater quality loss |
 | [nous-hermes-llama2-70b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_M.gguf) | Q4_K_M | 4 | 41.69 GB| 44.19 GB | medium, balanced quality - recommended |
 | [nous-hermes-llama2-70b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_S.gguf) | Q5_K_S | 5 | 47.74 GB| 50.24 GB | large, low quality loss - recommended |
 | [nous-hermes-llama2-70b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_M.gguf) | Q5_K_M | 5 | 49.03 GB| 51.53 GB | large, very low quality loss - recommended |
-| nous-hermes-llama2-70b.Q6_K.bin | q6_K | 6 | 56.82 GB | 59.32 GB | very large, extremely low quality loss |
-| nous-hermes-llama2-70b.Q8_0.bin | q8_0 | 8 | 73.29 GB | 75.79 GB | very large, extremely low quality loss - not recommended |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
-### Q6_K and Q8_0 files are split and require joining
-**Note:** HF does not support uploading files larger than 50GB. Therefore I have uploaded the Q6_K and Q8_0 files as split files.
-<details>
-  <summary>Click for instructions regarding Q6_K and Q8_0 files</summary>
-### q6_K
-Please download:
-* `nous-hermes-llama2-70b.Q6_K.gguf-split-a`
-* `nous-hermes-llama2-70b.Q6_K.gguf-split-b`
-### q8_0
-Please download:
-* `nous-hermes-llama2-70b.Q8_0.gguf-split-a`
-* `nous-hermes-llama2-70b.Q8_0.gguf-split-b`
-To join the files, do the following:
-Linux:
-```
-cat nous-hermes-llama2-70b.Q6_K.gguf-split-* > nous-hermes-llama2-70b.Q6_K.gguf && rm nous-hermes-llama2-70b.Q6_K.gguf-split-*
-cat nous-hermes-llama2-70b.Q8_0.gguf-split-* > nous-hermes-llama2-70b.Q8_0.gguf && rm nous-hermes-llama2-70b.Q8_0.gguf-split-*
-```
-Windows command line:
-```
-COPY /B nous-hermes-llama2-70b.Q6_K.gguf-split-a + nous-hermes-llama2-70b.Q6_K.gguf-split-b nous-hermes-llama2-70b.Q6_K.gguf
-del nous-hermes-llama2-70b.Q6_K.gguf-split-a nous-hermes-llama2-70b.Q6_K.gguf-split-b
-COPY /B nous-hermes-llama2-70b.Q8_0.gguf-split-a + nous-hermes-llama2-70b.Q8_0.gguf-split-b nous-hermes-llama2-70b.Q8_0.gguf
-del nous-hermes-llama2-70b.Q8_0.gguf-split-a nous-hermes-llama2-70b.Q8_0.gguf-split-b
-```
-</details>
 <!-- README_GGUF.md-provided-files end -->
 <!-- README_GGUF.md-how-to-run start -->

 The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.
+As of August 24th 2023, llama.cpp and KoboldCpp support GGUF. Other third-party clients and libraries are expected to add support very soon.
+Here is a list of clients and libraries that are known to support GGUF:
+* [llama.cpp](https://github.com/ggerganov/llama.cpp)
+* [KoboldCpp](https://github.com/LostRuins/koboldcpp), now supports GGUF as of release 1.41!
 Here is a list of clients and libraries, along with their expected timeline for GGUF support. Where possible a link to the relevant issue or PR is provided:
 * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), awaiting llama-cpp-python support.
 * [LM Studio](https://lmstudio.ai/), in active development - hoped to be ready by August 25th-26th.
 * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), will work as soon as ctransformers or llama-cpp-python is updated.
 * [ctransformers](https://github.com/marella/ctransformers), [development will start soon](https://github.com/marella/ctransformers/issues/102).
 These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)
+As of August 24th 2023 they are now compatible with KoboldCpp, release 1.41 and later.
+They are are not yet compatible with any other third-party UIS, libraries or utilities but this is expected to change very soon.
 ## Explanation of quantisation methods
 <details>
 * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
 * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
 * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
 Refer to the Provided Files table below to see what files use which methods, and how.
 </details>
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| [nous-hermes-llama2-70b.Q6_K.gguf-split-b](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q6_K.gguf-split-b) | Q6_K | 6 | 20.13 GB| 22.63 GB | very large, extremely low quality loss |
 | [nous-hermes-llama2-70b.Q2_K.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q2_K.gguf) | Q2_K | 2 | 29.48 GB| 31.98 GB | smallest, significant quality loss - not recommended for most purposes |
 | [nous-hermes-llama2-70b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_S.gguf) | Q3_K_S | 3 | 30.09 GB| 32.59 GB | very small, high quality loss |
 | [nous-hermes-llama2-70b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_M.gguf) | Q3_K_M | 3 | 33.45 GB| 35.95 GB | very small, high quality loss |
 | [nous-hermes-llama2-70b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_L.gguf) | Q3_K_L | 3 | 36.49 GB| 38.99 GB | small, substantial quality loss |
+| [nous-hermes-llama2-70b.Q8_0.gguf-split-b](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q8_0.gguf-split-b) | Q8_0 | 8 | 36.59 GB| 39.09 GB | very large, extremely low quality loss - not recommended |
+| [nous-hermes-llama2-70b.Q6_K.gguf-split-a](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q6_K.gguf-split-a) | Q6_K | 6 | 36.70 GB| 39.20 GB | very large, extremely low quality loss |
+| [nous-hermes-llama2-70b.Q8_0.gguf-split-a](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q8_0.gguf-split-a) | Q8_0 | 8 | 36.70 GB| 39.20 GB | very large, extremely low quality loss - not recommended |
 | [nous-hermes-llama2-70b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_S.gguf) | Q4_K_S | 4 | 39.30 GB| 41.80 GB | small, greater quality loss |
 | [nous-hermes-llama2-70b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_M.gguf) | Q4_K_M | 4 | 41.69 GB| 44.19 GB | medium, balanced quality - recommended |
 | [nous-hermes-llama2-70b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_S.gguf) | Q5_K_S | 5 | 47.74 GB| 50.24 GB | large, low quality loss - recommended |
 | [nous-hermes-llama2-70b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_M.gguf) | Q5_K_M | 5 | 49.03 GB| 51.53 GB | large, very low quality loss - recommended |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 <!-- README_GGUF.md-provided-files end -->
 <!-- README_GGUF.md-how-to-run start -->