TheBloke commited on
Commit
ce48895
1 Parent(s): 3c9d0da

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +13 -8
README.md CHANGED
@@ -21,11 +21,11 @@ inference: false
21
  </div>
22
  <!-- header end -->
23
 
24
- # Falcon-40B-Instruct GPTQ
25
 
26
- This repo contains an experimantal GPTQ 4bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
27
 
28
- It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
29
 
30
  ## EXPERIMENTAL
31
 
@@ -33,6 +33,10 @@ Please note this is an experimental GPTQ model. Support for it is currently quit
33
 
34
  It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
35
 
 
 
 
 
36
  ## AutoGPTQ
37
 
38
  AutoGPTQ is required: `pip install auto-gptq`
@@ -61,11 +65,11 @@ So please first update text-genration-webui to the latest version.
61
 
62
  1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
63
  2. Click the **Model tab**.
64
- 3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
65
  4. Click **Download**.
66
  5. Wait until it says it's finished downloading.
67
  6. Click the **Refresh** icon next to **Model** in the top left.
68
- 7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
69
  8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
70
 
71
  ## About `trust_remote_code`
@@ -91,7 +95,7 @@ from transformers import AutoTokenizer
91
  from auto_gptq import AutoGPTQForCausalLM
92
 
93
  # Download the model from HF and store it locally, then reference its location here:
94
- quantized_model_dir = "/path/to/falcon40b-instruct-gptq"
95
 
96
  from transformers import AutoTokenizer
97
  tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
@@ -108,13 +112,13 @@ print(tokenizer.decode(output[0]))
108
 
109
  ## Provided files
110
 
111
- **gptq_model-4bit--1g.safetensors**
112
 
113
  This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
114
 
115
  It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
116
 
117
- * `gptq_model-4bit--1g.safetensors`
118
  * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
119
  * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
120
  * Works with text-generation-webui using `--autogptq --trust_remote_code`
@@ -326,3 +330,4 @@ Falcon-40B-Instruct is made available under the [TII Falcon LLM License](https:/
326
 
327
  ## Contact
328
  falconllm@tii.ae
 
 
21
  </div>
22
  <!-- header end -->
23
 
24
+ # Falcon-40B-Instruct 3bit GPTQ
25
 
26
+ This repo contains an experimantal GPTQ 3bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
27
 
28
+ It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
29
 
30
  ## EXPERIMENTAL
31
 
 
33
 
34
  It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
35
 
36
+ This is a 3bit model with the aim of being loadable on a 24GB VRAM. In my testing so far it will not exceed 24GB VRAM at least up to 512 tokens returned. It may exceed 24GB beyond that.
37
+
38
+ Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
39
+
40
  ## AutoGPTQ
41
 
42
  AutoGPTQ is required: `pip install auto-gptq`
 
65
 
66
  1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
67
  2. Click the **Model tab**.
68
+ 3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-3bit-GPTQ`.
69
  4. Click **Download**.
70
  5. Wait until it says it's finished downloading.
71
  6. Click the **Refresh** icon next to **Model** in the top left.
72
+ 7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-3bit-GPTQ`.
73
  8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
74
 
75
  ## About `trust_remote_code`
 
95
  from auto_gptq import AutoGPTQForCausalLM
96
 
97
  # Download the model from HF and store it locally, then reference its location here:
98
+ quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"
99
 
100
  from transformers import AutoTokenizer
101
  tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
 
112
 
113
  ## Provided files
114
 
115
+ **gptq_model-3bit--1g.safetensors**
116
 
117
  This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
118
 
119
  It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
120
 
121
+ * `gptq_model-3bit--1g.safetensors`
122
  * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
123
  * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
124
  * Works with text-generation-webui using `--autogptq --trust_remote_code`
 
330
 
331
  ## Contact
332
  falconllm@tii.ae
333
+