Initial GGCC model commit
Browse files
README.md
CHANGED
@@ -17,9 +17,9 @@ license: other
|
|
17 |
</div>
|
18 |
<!-- header end -->
|
19 |
|
20 |
-
# TII's Falcon 7B
|
21 |
|
22 |
-
These files are GGML format model files for [TII's Falcon 7B
|
23 |
|
24 |
These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
|
25 |
|
@@ -27,19 +27,18 @@ GGCC is a new format created in a new fork of llama.cpp that introduced this new
|
|
27 |
|
28 |
Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
|
29 |
|
30 |
-
For the previous ggmlv3 files, please see branch `ggmlv3`.
|
31 |
-
|
32 |
## Repositories available
|
33 |
|
34 |
* [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GPTQ)
|
35 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GGML)
|
36 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-7b-instruct)
|
37 |
|
38 |
-
## Prompt template
|
39 |
|
40 |
```
|
41 |
User: prompt
|
42 |
Assistant:
|
|
|
43 |
```
|
44 |
|
45 |
<!-- compatibility_ggml start -->
|
@@ -57,7 +56,7 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
|
|
57 |
|
58 |
Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
|
59 |
```
|
60 |
-
bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon-40b-sft-mix-1226.ggccv1.
|
61 |
```
|
62 |
|
63 |
You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
|
@@ -73,11 +72,11 @@ Please see https://github.com/cmp-nct/ggllm.cpp for further details and instruct
|
|
73 |
## Provided files
|
74 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
75 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
76 |
-
| falcon-7b-instruct.
|
77 |
-
| falcon-7b-instruct.
|
78 |
-
| falcon-7b-instruct.
|
79 |
-
| falcon-7b-instruct.
|
80 |
-
| falcon-7b-instruct.
|
81 |
|
82 |
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
83 |
|
@@ -109,7 +108,7 @@ Thank you to all my generous patrons and donaters!
|
|
109 |
|
110 |
<!-- footer end -->
|
111 |
|
112 |
-
# Original model card: TII's Falcon 7B
|
113 |
|
114 |
|
115 |
# ✨ Falcon-7B-Instruct
|
|
|
17 |
</div>
|
18 |
<!-- header end -->
|
19 |
|
20 |
+
# TII's Falcon 7B GGCC GGML
|
21 |
|
22 |
+
These files are GGML format model files for [TII's Falcon 7B GGCC](https://huggingface.co/tiiuae/falcon-7b-instruct).
|
23 |
|
24 |
These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
|
25 |
|
|
|
27 |
|
28 |
Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
|
29 |
|
|
|
|
|
30 |
## Repositories available
|
31 |
|
32 |
* [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GPTQ)
|
33 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GGML)
|
34 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-7b-instruct)
|
35 |
|
36 |
+
## Prompt template: Falcon
|
37 |
|
38 |
```
|
39 |
User: prompt
|
40 |
Assistant:
|
41 |
+
|
42 |
```
|
43 |
|
44 |
<!-- compatibility_ggml start -->
|
|
|
56 |
|
57 |
Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
|
58 |
```
|
59 |
+
bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon-40b-sft-mix-1226.ggccv1.q4_0.bin -enc -p "write a story about llamas"
|
60 |
```
|
61 |
|
62 |
You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
|
|
|
72 |
## Provided files
|
73 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
74 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
75 |
+
| falcon-7b-instruct.ggccv1.q4_0.bin | q4_0 | 4 | 4.06 GB| 6.56 GB | Original quant method, 4-bit. |
|
76 |
+
| falcon-7b-instruct.ggccv1.q4_1.bin | q4_1 | 4 | 4.51 GB| 7.01 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
|
77 |
+
| falcon-7b-instruct.ggccv1.q5_0.bin | q5_0 | 5 | 4.96 GB| 7.46 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
|
78 |
+
| falcon-7b-instruct.ggccv1.q5_1.bin | q5_1 | 5 | 5.42 GB| 7.92 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
|
79 |
+
| falcon-7b-instruct.ggccv1.q8_0.bin | q8_0 | 8 | 7.67 GB| 10.17 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
80 |
|
81 |
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
82 |
|
|
|
108 |
|
109 |
<!-- footer end -->
|
110 |
|
111 |
+
# Original model card: TII's Falcon 7B GGCC
|
112 |
|
113 |
|
114 |
# ✨ Falcon-7B-Instruct
|