Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -21,11 +21,11 @@ inference: false
|
|
21 |
</div>
|
22 |
<!-- header end -->
|
23 |
|
24 |
-
# Falcon-40B-Instruct GPTQ
|
25 |
|
26 |
-
This repo contains an experimantal GPTQ
|
27 |
|
28 |
-
It is the result of quantising to
|
29 |
|
30 |
## EXPERIMENTAL
|
31 |
|
@@ -33,6 +33,10 @@ Please note this is an experimental GPTQ model. Support for it is currently quit
|
|
33 |
|
34 |
It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
|
35 |
|
|
|
|
|
|
|
|
|
36 |
## AutoGPTQ
|
37 |
|
38 |
AutoGPTQ is required: `pip install auto-gptq`
|
@@ -61,11 +65,11 @@ So please first update text-genration-webui to the latest version.
|
|
61 |
|
62 |
1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
|
63 |
2. Click the **Model tab**.
|
64 |
-
3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
|
65 |
4. Click **Download**.
|
66 |
5. Wait until it says it's finished downloading.
|
67 |
6. Click the **Refresh** icon next to **Model** in the top left.
|
68 |
-
7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
|
69 |
8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
70 |
|
71 |
## About `trust_remote_code`
|
@@ -91,7 +95,7 @@ from transformers import AutoTokenizer
|
|
91 |
from auto_gptq import AutoGPTQForCausalLM
|
92 |
|
93 |
# Download the model from HF and store it locally, then reference its location here:
|
94 |
-
quantized_model_dir = "/path/to/falcon40b-instruct-gptq"
|
95 |
|
96 |
from transformers import AutoTokenizer
|
97 |
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
|
@@ -108,13 +112,13 @@ print(tokenizer.decode(output[0]))
|
|
108 |
|
109 |
## Provided files
|
110 |
|
111 |
-
**gptq_model-
|
112 |
|
113 |
This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
|
114 |
|
115 |
It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
|
116 |
|
117 |
-
* `gptq_model-
|
118 |
* Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
|
119 |
* At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
|
120 |
* Works with text-generation-webui using `--autogptq --trust_remote_code`
|
@@ -326,3 +330,4 @@ Falcon-40B-Instruct is made available under the [TII Falcon LLM License](https:/
|
|
326 |
|
327 |
## Contact
|
328 |
falconllm@tii.ae
|
|
|
|
21 |
</div>
|
22 |
<!-- header end -->
|
23 |
|
24 |
+
# Falcon-40B-Instruct 3bit GPTQ
|
25 |
|
26 |
+
This repo contains an experimantal GPTQ 3bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
27 |
|
28 |
+
It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
|
29 |
|
30 |
## EXPERIMENTAL
|
31 |
|
|
|
33 |
|
34 |
It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
|
35 |
|
36 |
+
This is a 3bit model with the aim of being loadable on a 24GB VRAM. In my testing so far it will not exceed 24GB VRAM at least up to 512 tokens returned. It may exceed 24GB beyond that.
|
37 |
+
|
38 |
+
Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
|
39 |
+
|
40 |
## AutoGPTQ
|
41 |
|
42 |
AutoGPTQ is required: `pip install auto-gptq`
|
|
|
65 |
|
66 |
1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
|
67 |
2. Click the **Model tab**.
|
68 |
+
3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-3bit-GPTQ`.
|
69 |
4. Click **Download**.
|
70 |
5. Wait until it says it's finished downloading.
|
71 |
6. Click the **Refresh** icon next to **Model** in the top left.
|
72 |
+
7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-3bit-GPTQ`.
|
73 |
8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
74 |
|
75 |
## About `trust_remote_code`
|
|
|
95 |
from auto_gptq import AutoGPTQForCausalLM
|
96 |
|
97 |
# Download the model from HF and store it locally, then reference its location here:
|
98 |
+
quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"
|
99 |
|
100 |
from transformers import AutoTokenizer
|
101 |
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
|
|
|
112 |
|
113 |
## Provided files
|
114 |
|
115 |
+
**gptq_model-3bit--1g.safetensors**
|
116 |
|
117 |
This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
|
118 |
|
119 |
It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
|
120 |
|
121 |
+
* `gptq_model-3bit--1g.safetensors`
|
122 |
* Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
|
123 |
* At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
|
124 |
* Works with text-generation-webui using `--autogptq --trust_remote_code`
|
|
|
330 |
|
331 |
## Contact
|
332 |
falconllm@tii.ae
|
333 |
+
|