New GGMLv3 format for breaking llama.cpp change May 19th commit 2d5db48
Browse files
README.md
CHANGED
@@ -26,29 +26,28 @@ GGML files are for CPU inference using [llama.cpp](https://github.com/ggerganov/
|
|
26 |
* [4bit and 5bit GGML models for CPU inference](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-GGML).
|
27 |
* [float16 HF format unquantised model for GPU inference and further conversions](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-HF)
|
28 |
|
29 |
-
##
|
30 |
|
31 |
-
llama.cpp recently made
|
32 |
|
33 |
-
I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May
|
34 |
-
|
35 |
-
If you are currently unable to update llama.cpp, eg because you use a UI which hasn't updated yet, you can find GGML files for the previous version of llama.cpp in the `previous_llama` branch.
|
36 |
|
|
|
37 |
## Provided files
|
38 |
| Name | Quant method | Bits | Size | RAM required | Use case |
|
39 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
40 |
-
`h2ogptq-oasst1-512-30B.
|
41 |
-
`h2ogptq-oasst1-512-30B.
|
42 |
-
`h2ogptq-oasst1-512-30B.
|
43 |
-
`h2ogptq-oasst1-512-30B.
|
44 |
-
`h2ogptq-oasst1-512-30B.
|
45 |
|
46 |
## How to run in `llama.cpp`
|
47 |
|
48 |
I use the following command line; adjust for your tastes and needs:
|
49 |
|
50 |
```
|
51 |
-
./main -t 8 -m h2ogptq-oasst1-512-30B.
|
52 |
### Instruction:
|
53 |
Write a story about llamas
|
54 |
### Response:"
|
@@ -63,6 +62,8 @@ GGML models can be loaded into text-generation-webui by installing the llama.cpp
|
|
63 |
|
64 |
Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
|
65 |
|
|
|
|
|
66 |
# Original h2oGPT Model Card
|
67 |
## Summary
|
68 |
|
@@ -284,4 +285,4 @@ Please read this disclaimer carefully before using the large language model prov
|
|
284 |
- Reporting Issues: If you encounter any biased, offensive, or otherwise inappropriate content generated by the large language model, please report it to the repository maintainers through the provided channels. Your feedback will help improve the model and mitigate potential issues.
|
285 |
- Changes to this Disclaimer: The developers of this repository reserve the right to modify or update this disclaimer at any time without prior notice. It is the user's responsibility to periodically review the disclaimer to stay informed about any changes.
|
286 |
|
287 |
-
By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.
|
|
|
26 |
* [4bit and 5bit GGML models for CPU inference](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-GGML).
|
27 |
* [float16 HF format unquantised model for GPU inference and further conversions](https://huggingface.co/TheBloke/h2ogpt-oasst1-512-30B-HF)
|
28 |
|
29 |
+
## THE FILES IN MAIN BRANCH REQUIRES LATEST LLAMA.CPP (May 19th 2023 - commit 2d5db48)!
|
30 |
|
31 |
+
llama.cpp recently made another breaking change to its quantisation methods - https://github.com/ggerganov/llama.cpp/pull/1508
|
32 |
|
33 |
+
I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 19th or later (commit `2d5db48` or later) to use them.
|
|
|
|
|
34 |
|
35 |
+
For files compatible with the previous version of llama.cpp, please see branch `previous_llama_ggmlv2`.
|
36 |
## Provided files
|
37 |
| Name | Quant method | Bits | Size | RAM required | Use case |
|
38 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
39 |
+
`h2ogptq-oasst1-512-30B.ggmlv3.q4_0.bin` | q4_0 | 4bit | 20.3GB | 25GB | 4-bit. |
|
40 |
+
`h2ogptq-oasst1-512-30B.ggmlv3.q4_1.bin` | q4_1 | 4bit | 24.4GB | 26GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
|
41 |
+
`h2ogptq-oasst1-512-30B.ggmlv3.q5_0.bin` | q5_0 | 5bit | 22.4GB | 25GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
|
42 |
+
`h2ogptq-oasst1-512-30B.ggmlv3.q5_1.bin` | q5_1 | 5bit | 24.4GB | 26GB | 5-bit. Even higher accuracy, and higher resource usage and slower inference.|
|
43 |
+
`h2ogptq-oasst1-512-30B.ggmlv3.q8_0.bin` | q8_0 | 8bit | 36.6GB | 39GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
|
44 |
|
45 |
## How to run in `llama.cpp`
|
46 |
|
47 |
I use the following command line; adjust for your tastes and needs:
|
48 |
|
49 |
```
|
50 |
+
./main -t 8 -m h2ogptq-oasst1-512-30B.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
|
51 |
### Instruction:
|
52 |
Write a story about llamas
|
53 |
### Response:"
|
|
|
62 |
|
63 |
Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
|
64 |
|
65 |
+
Note: at this time text-generation-webui may not support the new May 19th llama.cpp quantisation methods for q4_0, q4_1 and q8_0 files.
|
66 |
+
|
67 |
# Original h2oGPT Model Card
|
68 |
## Summary
|
69 |
|
|
|
285 |
- Reporting Issues: If you encounter any biased, offensive, or otherwise inappropriate content generated by the large language model, please report it to the repository maintainers through the provided channels. Your feedback will help improve the model and mitigate potential issues.
|
286 |
- Changes to this Disclaimer: The developers of this repository reserve the right to modify or update this disclaimer at any time without prior notice. It is the user's responsibility to periodically review the disclaimer to stay informed about any changes.
|
287 |
|
288 |
+
By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.
|