TheBloke commited on
Commit
5436273
1 Parent(s): b6c7623

Initial GGCC model commit

Browse files
Files changed (1) hide show
  1. README.md +11 -12
README.md CHANGED
@@ -17,9 +17,9 @@ license: other
17
  </div>
18
  <!-- header end -->
19
 
20
- # TII's Falcon 7B Instruct GGCC
21
 
22
- These files are GGML format model files for [TII's Falcon 7B Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct).
23
 
24
  These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
25
 
@@ -27,19 +27,18 @@ GGCC is a new format created in a new fork of llama.cpp that introduced this new
27
 
28
  Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
29
 
30
- For the previous ggmlv3 files, please see branch `ggmlv3`.
31
-
32
  ## Repositories available
33
 
34
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GPTQ)
35
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GGML)
36
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-7b-instruct)
37
 
38
- ## Prompt template
39
 
40
  ```
41
  User: prompt
42
  Assistant:
 
43
  ```
44
 
45
  <!-- compatibility_ggml start -->
@@ -57,7 +56,7 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
57
 
58
  Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
59
  ```
60
- bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon-40b-sft-mix-1226.ggccv1.q4_K.bin -p "<|prompter|>write a story about llamas<|endoftext|><|assistant|>"
61
  ```
62
 
63
  You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
@@ -73,11 +72,11 @@ Please see https://github.com/cmp-nct/ggllm.cpp for further details and instruct
73
  ## Provided files
74
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
75
  | ---- | ---- | ---- | ---- | ---- | ----- |
76
- | falcon-7b-instruct.ggmlv3.q4_0.bin | q4_0 | 4 | 4.06 GB| 6.56 GB | Original quant method, 4-bit. |
77
- | falcon-7b-instruct.ggmlv3.q4_1.bin | q4_1 | 4 | 4.51 GB| 7.01 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
78
- | falcon-7b-instruct.ggmlv3.q5_0.bin | q5_0 | 5 | 4.96 GB| 7.46 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
79
- | falcon-7b-instruct.ggmlv3.q5_1.bin | q5_1 | 5 | 5.41 GB| 7.91 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
80
- | falcon-7b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 7.67 GB| 10.17 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
81
 
82
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
83
 
@@ -109,7 +108,7 @@ Thank you to all my generous patrons and donaters!
109
 
110
  <!-- footer end -->
111
 
112
- # Original model card: TII's Falcon 7B Instruct
113
 
114
 
115
  # ✨ Falcon-7B-Instruct
 
17
  </div>
18
  <!-- header end -->
19
 
20
+ # TII's Falcon 7B GGCC GGML
21
 
22
+ These files are GGML format model files for [TII's Falcon 7B GGCC](https://huggingface.co/tiiuae/falcon-7b-instruct).
23
 
24
  These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
25
 
 
27
 
28
  Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
29
 
 
 
30
  ## Repositories available
31
 
32
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GPTQ)
33
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GGML)
34
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-7b-instruct)
35
 
36
+ ## Prompt template: Falcon
37
 
38
  ```
39
  User: prompt
40
  Assistant:
41
+
42
  ```
43
 
44
  <!-- compatibility_ggml start -->
 
56
 
57
  Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
58
  ```
59
+ bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon-40b-sft-mix-1226.ggccv1.q4_0.bin -enc -p "write a story about llamas"
60
  ```
61
 
62
  You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
 
72
  ## Provided files
73
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
74
  | ---- | ---- | ---- | ---- | ---- | ----- |
75
+ | falcon-7b-instruct.ggccv1.q4_0.bin | q4_0 | 4 | 4.06 GB| 6.56 GB | Original quant method, 4-bit. |
76
+ | falcon-7b-instruct.ggccv1.q4_1.bin | q4_1 | 4 | 4.51 GB| 7.01 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
77
+ | falcon-7b-instruct.ggccv1.q5_0.bin | q5_0 | 5 | 4.96 GB| 7.46 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
78
+ | falcon-7b-instruct.ggccv1.q5_1.bin | q5_1 | 5 | 5.42 GB| 7.92 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
79
+ | falcon-7b-instruct.ggccv1.q8_0.bin | q8_0 | 8 | 7.67 GB| 10.17 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
80
 
81
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
82
 
 
108
 
109
  <!-- footer end -->
110
 
111
+ # Original model card: TII's Falcon 7B GGCC
112
 
113
 
114
  # ✨ Falcon-7B-Instruct