TheBloke
/

falcon-40b-instruct-GPTQ

@@ -21,11 +21,11 @@ inference: false
 </div>
 <!-- header end -->
-# Falcon-40B-Instruct GPTQ
-This repo contains an experimantal GPTQ 4bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
-It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
 ## EXPERIMENTAL
@@ -33,6 +33,10 @@ Please note this is an experimental GPTQ model. Support for it is currently quit
 It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
 ## AutoGPTQ
 AutoGPTQ is required: `pip install auto-gptq`
@@ -61,11 +65,11 @@ So please first update text-genration-webui to the latest version.
 1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
 2. Click the **Model tab**.
-3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
 4. Click **Download**.
 5. Wait until it says it's finished downloading.
 6. Click the **Refresh** icon next to **Model** in the top left.
-7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
 8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
 ## About `trust_remote_code`
@@ -91,7 +95,7 @@ from transformers import AutoTokenizer
 from auto_gptq import AutoGPTQForCausalLM
 # Download the model from HF and store it locally, then reference its location here:
-quantized_model_dir = "/path/to/falcon40b-instruct-gptq"
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
@@ -108,13 +112,13 @@ print(tokenizer.decode(output[0]))
 ## Provided files
-**gptq_model-4bit--1g.safetensors**
 This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
 It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
-* `gptq_model-4bit--1g.safetensors`
   * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
     * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
   * Works with text-generation-webui using `--autogptq --trust_remote_code`
@@ -326,3 +330,4 @@ Falcon-40B-Instruct is made available under the [TII Falcon LLM License](https:/
 ## Contact
 falconllm@tii.ae

 </div>
 <!-- header end -->
+# Falcon-40B-Instruct 3bit GPTQ
+This repo contains an experimantal GPTQ 3bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
+It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
 ## EXPERIMENTAL
 It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
+This is a 3bit model with the aim of being loadable on a 24GB VRAM.  In my testing so far it will not exceed 24GB VRAM at least up to 512 tokens returned. It may exceed 24GB beyond that.
+Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
 ## AutoGPTQ
 AutoGPTQ is required: `pip install auto-gptq`
 1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
 2. Click the **Model tab**.
+3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-3bit-GPTQ`.
 4. Click **Download**.
 5. Wait until it says it's finished downloading.
 6. Click the **Refresh** icon next to **Model** in the top left.
+7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-3bit-GPTQ`.
 8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
 ## About `trust_remote_code`
 from auto_gptq import AutoGPTQForCausalLM
 # Download the model from HF and store it locally, then reference its location here:
+quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
 ## Provided files
+**gptq_model-3bit--1g.safetensors**
 This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
 It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
+* `gptq_model-3bit--1g.safetensors`
   * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
     * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
   * Works with text-generation-webui using `--autogptq --trust_remote_code`
 ## Contact
 falconllm@tii.ae