TheBloke commited on
Commit
ff5611b
1 Parent(s): 1bb123b

Initial GGUF model commit

Browse files
Files changed (1) hide show
  1. README.md +12 -42
README.md CHANGED
@@ -47,11 +47,14 @@ GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is
47
 
48
  The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.
49
 
50
- As of August 23rd 2023, only llama.cpp supports GGUF. However, third-party clients and libraries are expected to add support very soon.
 
 
 
 
51
 
52
  Here is a list of clients and libraries, along with their expected timeline for GGUF support. Where possible a link to the relevant issue or PR is provided:
53
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), awaiting llama-cpp-python support.
54
- * [KoboldCpp](https://github.com/LostRuins/koboldcpp), [in active development](https://github.com/LostRuins/koboldcpp/issues/387). Test builds are working, but GPU acceleration remains to be tested.
55
  * [LM Studio](https://lmstudio.ai/), in active development - hoped to be ready by August 25th-26th.
56
  * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), will work as soon as ctransformers or llama-cpp-python is updated.
57
  * [ctransformers](https://github.com/marella/ctransformers), [development will start soon](https://github.com/marella/ctransformers/issues/102).
@@ -84,7 +87,9 @@ Here is a list of clients and libraries, along with their expected timeline for
84
 
85
  These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)
86
 
87
- As of August 23rd 2023 they are not yet compatible with any third-party UIS, libraries or utilities but this is expected to change very soon.
 
 
88
 
89
  ## Explanation of quantisation methods
90
  <details>
@@ -96,7 +101,6 @@ The new methods available are:
96
  * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
97
  * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
98
  * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
99
- * GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
100
 
101
  Refer to the Provided Files table below to see what files use which methods, and how.
102
  </details>
@@ -107,54 +111,20 @@ Refer to the Provided Files table below to see what files use which methods, and
107
 
108
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
109
  | ---- | ---- | ---- | ---- | ---- | ----- |
 
110
  | [nous-hermes-llama2-70b.Q2_K.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q2_K.gguf) | Q2_K | 2 | 29.48 GB| 31.98 GB | smallest, significant quality loss - not recommended for most purposes |
111
  | [nous-hermes-llama2-70b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_S.gguf) | Q3_K_S | 3 | 30.09 GB| 32.59 GB | very small, high quality loss |
112
  | [nous-hermes-llama2-70b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_M.gguf) | Q3_K_M | 3 | 33.45 GB| 35.95 GB | very small, high quality loss |
113
  | [nous-hermes-llama2-70b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_L.gguf) | Q3_K_L | 3 | 36.49 GB| 38.99 GB | small, substantial quality loss |
 
 
 
114
  | [nous-hermes-llama2-70b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_S.gguf) | Q4_K_S | 4 | 39.30 GB| 41.80 GB | small, greater quality loss |
115
  | [nous-hermes-llama2-70b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_M.gguf) | Q4_K_M | 4 | 41.69 GB| 44.19 GB | medium, balanced quality - recommended |
116
  | [nous-hermes-llama2-70b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_S.gguf) | Q5_K_S | 5 | 47.74 GB| 50.24 GB | large, low quality loss - recommended |
117
  | [nous-hermes-llama2-70b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_M.gguf) | Q5_K_M | 5 | 49.03 GB| 51.53 GB | large, very low quality loss - recommended |
118
- | nous-hermes-llama2-70b.Q6_K.bin | q6_K | 6 | 56.82 GB | 59.32 GB | very large, extremely low quality loss |
119
- | nous-hermes-llama2-70b.Q8_0.bin | q8_0 | 8 | 73.29 GB | 75.79 GB | very large, extremely low quality loss - not recommended |
120
 
121
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
122
-
123
- ### Q6_K and Q8_0 files are split and require joining
124
-
125
- **Note:** HF does not support uploading files larger than 50GB. Therefore I have uploaded the Q6_K and Q8_0 files as split files.
126
-
127
- <details>
128
- <summary>Click for instructions regarding Q6_K and Q8_0 files</summary>
129
-
130
- ### q6_K
131
- Please download:
132
- * `nous-hermes-llama2-70b.Q6_K.gguf-split-a`
133
- * `nous-hermes-llama2-70b.Q6_K.gguf-split-b`
134
-
135
- ### q8_0
136
- Please download:
137
- * `nous-hermes-llama2-70b.Q8_0.gguf-split-a`
138
- * `nous-hermes-llama2-70b.Q8_0.gguf-split-b`
139
-
140
- To join the files, do the following:
141
-
142
- Linux:
143
- ```
144
- cat nous-hermes-llama2-70b.Q6_K.gguf-split-* > nous-hermes-llama2-70b.Q6_K.gguf && rm nous-hermes-llama2-70b.Q6_K.gguf-split-*
145
- cat nous-hermes-llama2-70b.Q8_0.gguf-split-* > nous-hermes-llama2-70b.Q8_0.gguf && rm nous-hermes-llama2-70b.Q8_0.gguf-split-*
146
- ```
147
- Windows command line:
148
- ```
149
- COPY /B nous-hermes-llama2-70b.Q6_K.gguf-split-a + nous-hermes-llama2-70b.Q6_K.gguf-split-b nous-hermes-llama2-70b.Q6_K.gguf
150
- del nous-hermes-llama2-70b.Q6_K.gguf-split-a nous-hermes-llama2-70b.Q6_K.gguf-split-b
151
-
152
- COPY /B nous-hermes-llama2-70b.Q8_0.gguf-split-a + nous-hermes-llama2-70b.Q8_0.gguf-split-b nous-hermes-llama2-70b.Q8_0.gguf
153
- del nous-hermes-llama2-70b.Q8_0.gguf-split-a nous-hermes-llama2-70b.Q8_0.gguf-split-b
154
- ```
155
-
156
- </details>
157
-
158
  <!-- README_GGUF.md-provided-files end -->
159
 
160
  <!-- README_GGUF.md-how-to-run start -->
 
47
 
48
  The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.
49
 
50
+ As of August 24th 2023, llama.cpp and KoboldCpp support GGUF. Other third-party clients and libraries are expected to add support very soon.
51
+
52
+ Here is a list of clients and libraries that are known to support GGUF:
53
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
54
+ * [KoboldCpp](https://github.com/LostRuins/koboldcpp), now supports GGUF as of release 1.41!
55
 
56
  Here is a list of clients and libraries, along with their expected timeline for GGUF support. Where possible a link to the relevant issue or PR is provided:
57
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), awaiting llama-cpp-python support.
 
58
  * [LM Studio](https://lmstudio.ai/), in active development - hoped to be ready by August 25th-26th.
59
  * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), will work as soon as ctransformers or llama-cpp-python is updated.
60
  * [ctransformers](https://github.com/marella/ctransformers), [development will start soon](https://github.com/marella/ctransformers/issues/102).
 
87
 
88
  These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)
89
 
90
+ As of August 24th 2023 they are now compatible with KoboldCpp, release 1.41 and later.
91
+
92
+ They are are not yet compatible with any other third-party UIS, libraries or utilities but this is expected to change very soon.
93
 
94
  ## Explanation of quantisation methods
95
  <details>
 
101
  * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
102
  * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
103
  * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
 
104
 
105
  Refer to the Provided Files table below to see what files use which methods, and how.
106
  </details>
 
111
 
112
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
113
  | ---- | ---- | ---- | ---- | ---- | ----- |
114
+ | [nous-hermes-llama2-70b.Q6_K.gguf-split-b](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q6_K.gguf-split-b) | Q6_K | 6 | 20.13 GB| 22.63 GB | very large, extremely low quality loss |
115
  | [nous-hermes-llama2-70b.Q2_K.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q2_K.gguf) | Q2_K | 2 | 29.48 GB| 31.98 GB | smallest, significant quality loss - not recommended for most purposes |
116
  | [nous-hermes-llama2-70b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_S.gguf) | Q3_K_S | 3 | 30.09 GB| 32.59 GB | very small, high quality loss |
117
  | [nous-hermes-llama2-70b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_M.gguf) | Q3_K_M | 3 | 33.45 GB| 35.95 GB | very small, high quality loss |
118
  | [nous-hermes-llama2-70b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q3_K_L.gguf) | Q3_K_L | 3 | 36.49 GB| 38.99 GB | small, substantial quality loss |
119
+ | [nous-hermes-llama2-70b.Q8_0.gguf-split-b](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q8_0.gguf-split-b) | Q8_0 | 8 | 36.59 GB| 39.09 GB | very large, extremely low quality loss - not recommended |
120
+ | [nous-hermes-llama2-70b.Q6_K.gguf-split-a](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q6_K.gguf-split-a) | Q6_K | 6 | 36.70 GB| 39.20 GB | very large, extremely low quality loss |
121
+ | [nous-hermes-llama2-70b.Q8_0.gguf-split-a](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q8_0.gguf-split-a) | Q8_0 | 8 | 36.70 GB| 39.20 GB | very large, extremely low quality loss - not recommended |
122
  | [nous-hermes-llama2-70b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_S.gguf) | Q4_K_S | 4 | 39.30 GB| 41.80 GB | small, greater quality loss |
123
  | [nous-hermes-llama2-70b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q4_K_M.gguf) | Q4_K_M | 4 | 41.69 GB| 44.19 GB | medium, balanced quality - recommended |
124
  | [nous-hermes-llama2-70b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_S.gguf) | Q5_K_S | 5 | 47.74 GB| 50.24 GB | large, low quality loss - recommended |
125
  | [nous-hermes-llama2-70b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-Llama2-70B-GGUF/blob/main/nous-hermes-llama2-70b.Q5_K_M.gguf) | Q5_K_M | 5 | 49.03 GB| 51.53 GB | large, very low quality loss - recommended |
 
 
126
 
127
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  <!-- README_GGUF.md-provided-files end -->
129
 
130
  <!-- README_GGUF.md-how-to-run start -->