TheBloke
/

Falcon-7B-Instruct-GGML

Transformers

English

falcon

text-generation-inference

5 papers

Model card Files Files and versions Community

TheBloke commited on Jul 7, 2023

Commit

5436273

•

1 Parent(s): b6c7623

Initial GGCC model commit

Browse files

Files changed (1) hide show

README.md +11 -12

README.md CHANGED Viewed

@@ -17,9 +17,9 @@ license: other
 </div>
 <!-- header end -->
-# TII's Falcon 7B Instruct GGCC
-These files are GGML format model files for [TII's Falcon 7B Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct).
 These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
@@ -27,19 +27,18 @@ GGCC is a new format created in a new fork of llama.cpp that introduced this new
 Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
-For the previous ggmlv3 files, please see branch `ggmlv3`.
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-7b-instruct)
-## Prompt template
 ```
 User: prompt
 Assistant:
 ```
 <!-- compatibility_ggml start -->
@@ -57,7 +56,7 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
 Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
 ```
-bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon-40b-sft-mix-1226.ggccv1.q4_K.bin -p "<|prompter|>write a story about llamas<|endoftext|><|assistant|>"
 ```
 You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
@@ -73,11 +72,11 @@ Please see https://github.com/cmp-nct/ggllm.cpp for further details and instruct
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-| falcon-7b-instruct.ggmlv3.q4_0.bin | q4_0 | 4 | 4.06 GB| 6.56 GB | Original quant method, 4-bit. |
-| falcon-7b-instruct.ggmlv3.q4_1.bin | q4_1 | 4 | 4.51 GB| 7.01 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
-| falcon-7b-instruct.ggmlv3.q5_0.bin | q5_0 | 5 | 4.96 GB| 7.46 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
-| falcon-7b-instruct.ggmlv3.q5_1.bin | q5_1 | 5 | 5.41 GB| 7.91 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
-| falcon-7b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 7.67 GB| 10.17 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
@@ -109,7 +108,7 @@ Thank you to all my generous patrons and donaters!
 <!-- footer end -->
-# Original model card: TII's Falcon 7B Instruct
 # ✨ Falcon-7B-Instruct

 </div>
 <!-- header end -->
+# TII's Falcon 7B GGCC GGML
+These files are GGML format model files for [TII's Falcon 7B GGCC](https://huggingface.co/tiiuae/falcon-7b-instruct).
 These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
 Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-7B-instruct-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-7b-instruct)
+## Prompt template: Falcon
 ```
 User: prompt
 Assistant:
 ```
 <!-- compatibility_ggml start -->
 Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
 ```
+bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon-40b-sft-mix-1226.ggccv1.q4_0.bin -enc -p "write a story about llamas"
 ```
 You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| falcon-7b-instruct.ggccv1.q4_0.bin | q4_0 | 4 | 4.06 GB| 6.56 GB | Original quant method, 4-bit. |
+| falcon-7b-instruct.ggccv1.q4_1.bin | q4_1 | 4 | 4.51 GB| 7.01 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
+| falcon-7b-instruct.ggccv1.q5_0.bin | q5_0 | 5 | 4.96 GB| 7.46 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
+| falcon-7b-instruct.ggccv1.q5_1.bin | q5_1 | 5 | 5.42 GB| 7.92 GB | Original quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
+| falcon-7b-instruct.ggccv1.q8_0.bin | q8_0 | 8 | 7.67 GB| 10.17 GB | Original quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 <!-- footer end -->
+# Original model card: TII's Falcon 7B GGCC
 # ✨ Falcon-7B-Instruct