TheBloke commited on
Commit
2f30b2f
·
1 Parent(s): 31893ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -43
README.md CHANGED
@@ -43,6 +43,21 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  <!-- description end -->
47
  <!-- repositories-available start -->
48
  ## Repositories available
@@ -53,11 +68,10 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
53
  <!-- repositories-available end -->
54
 
55
  <!-- prompt-template start -->
56
- ## Prompt template: None
57
 
58
  ```
59
  {prompt}
60
-
61
  ```
62
 
63
  <!-- prompt-template end -->
@@ -86,9 +100,9 @@ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches
86
 
87
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
88
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
89
- | main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 10.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
90
- | gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 9.98 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
91
- | gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 9.93 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
92
  | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 92.74 GB | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
93
 
94
  <!-- README_GPTQ.md-provided-files end -->
@@ -106,22 +120,25 @@ git clone --single-branch --branch main https://huggingface.co/TheBloke/Falcon-1
106
  <!-- README_GPTQ.md-text-generation-webui start -->
107
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
108
 
 
 
109
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
110
 
111
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
112
 
113
  1. Click the **Model tab**.
114
  2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
115
- - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:main`
116
  - see Provided Files above for the list of branches for each option.
117
  3. Click **Download**.
118
  4. The model will start downloading. Once it's finished it will say "Done".
119
- 5. In the top left, click the refresh icon next to **Model**.
120
- 6. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
121
- 7. The model will automatically load, and is now ready for use!
122
- 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
 
123
  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
124
- 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
125
  <!-- README_GPTQ.md-text-generation-webui end -->
126
 
127
  <!-- README_GPTQ.md-use-from-python start -->
@@ -129,54 +146,35 @@ It is strongly recommended to use the text-generation-webui one-click-installers
129
 
130
  ### Install the necessary packages
131
 
132
- Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
133
 
134
  ```shell
135
- pip3 install transformers>=4.32.0 optimum>=1.12.0
136
  pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
137
  ```
138
 
139
- If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
140
-
141
- ```shell
142
- pip3 uninstall -y auto-gptq
143
- git clone https://github.com/PanQiWei/AutoGPTQ
144
- cd AutoGPTQ
145
- pip3 install .
146
- ```
147
-
148
- ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
149
-
150
- If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
151
- ```shell
152
- pip3 uninstall -y transformers
153
- pip3 install git+https://github.com/huggingface/transformers.git
154
- ```
155
-
156
- ### You can then use the following code
157
 
158
  ```python
159
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
160
 
161
  model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
 
162
  # To use a different branch, change revision
163
- # For example: revision="main"
164
  model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
165
  device_map="auto",
166
- trust_remote_code=False,
167
  revision="main")
168
 
169
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
170
 
171
  prompt = "Tell me about AI"
172
- prompt_template=f'''{prompt}
173
-
174
- '''
175
 
176
  print("\n\n*** Generate:")
177
 
178
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
179
- output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
180
  print(tokenizer.decode(output[0]))
181
 
182
  # Inference can also be done using transformers' pipeline
@@ -187,11 +185,10 @@ pipe = pipeline(
187
  model=model,
188
  tokenizer=tokenizer,
189
  max_new_tokens=512,
190
- do_sample=True,
191
  temperature=0.7,
 
192
  top_p=0.95,
193
- top_k=40,
194
- repetition_penalty=1.1
195
  )
196
 
197
  print(pipe(prompt_template)[0]['generated_text'])
@@ -201,11 +198,13 @@ print(pipe(prompt_template)[0]['generated_text'])
201
  <!-- README_GPTQ.md-compatibility start -->
202
  ## Compatibility
203
 
204
- The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
 
 
205
 
206
- [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
207
 
208
- [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
209
  <!-- README_GPTQ.md-compatibility end -->
210
 
211
  <!-- footer start -->
 
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
46
+ ## Requirements
47
+
48
+ Transformers version 4.33.0 is required.
49
+
50
+ Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
51
+
52
+ But they work great loaded directly through Transformers - and can be served using Text Generation Inference!
53
+
54
+ ## Compatibility
55
+
56
+ Currently these GPTQs are known to work with:
57
+ - Transformers 4.33.0
58
+ - [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.0.4
59
+ - Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
60
+
61
  <!-- description end -->
62
  <!-- repositories-available start -->
63
  ## Repositories available
 
68
  <!-- repositories-available end -->
69
 
70
  <!-- prompt-template start -->
71
+ ## Prompt template: None (base model)
72
 
73
  ```
74
  {prompt}
 
75
  ```
76
 
77
  <!-- prompt-template end -->
 
100
 
101
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
102
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
103
+ | main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 94.5 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
104
+ | gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 73.81 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
105
+ | gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 70.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
106
  | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 92.74 GB | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
107
 
108
  <!-- README_GPTQ.md-provided-files end -->
 
120
  <!-- README_GPTQ.md-text-generation-webui start -->
121
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
122
 
123
+ **NOTE**: I have not tested this model with Text Generation Webui. It *should* work through the Transformers Loader. It will *not* work through the AutoGPTQ loader, due to the files being sharded.
124
+
125
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
126
 
127
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
128
 
129
  1. Click the **Model tab**.
130
  2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
131
+ - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:gptq-3bit-128g-actorder_True`
132
  - see Provided Files above for the list of branches for each option.
133
  3. Click **Download**.
134
  4. The model will start downloading. Once it's finished it will say "Done".
135
+ 5. Choose Loader: Transformers
136
+ 6. In the top left, click the refresh icon next to **Model**.
137
+ 7. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
138
+ 8. The model will automatically load, and is now ready for use!
139
+ 9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
140
  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
141
+ 10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
142
  <!-- README_GPTQ.md-text-generation-webui end -->
143
 
144
  <!-- README_GPTQ.md-use-from-python start -->
 
146
 
147
  ### Install the necessary packages
148
 
149
+ Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
150
 
151
  ```shell
152
+ pip3 install transformers>=4.33.0 optimum>=1.12.0
153
  pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
154
  ```
155
 
156
+ ### Transformers sample code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
  ```python
159
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
160
 
161
  model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
162
+
163
  # To use a different branch, change revision
164
+ # For example: revision="gptq-3bit-128g-actorder_True"
165
  model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
166
  device_map="auto",
 
167
  revision="main")
168
 
169
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
170
 
171
  prompt = "Tell me about AI"
172
+ prompt_template=f'''{prompt}'''
 
 
173
 
174
  print("\n\n*** Generate:")
175
 
176
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
177
+ output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
178
  print(tokenizer.decode(output[0]))
179
 
180
  # Inference can also be done using transformers' pipeline
 
185
  model=model,
186
  tokenizer=tokenizer,
187
  max_new_tokens=512,
 
188
  temperature=0.7,
189
+ do_sample=True,
190
  top_p=0.95,
191
+ repetition_penalty=1.15
 
192
  )
193
 
194
  print(pipe(prompt_template)[0]['generated_text'])
 
198
  <!-- README_GPTQ.md-compatibility start -->
199
  ## Compatibility
200
 
201
+ The provided files have been tested with Transformers 4.33.0, and TGI 1.0.4.
202
+
203
+ Because they are sharded, they will not yet via AutoGPTQ. It is hoped support will be added soon.
204
 
205
+ Note: lack of support for AutoGPTQ doesn't affect your ability to load these models from Python code. It only affects third-party clients that might use AutoGPTQ.
206
 
207
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is confirmed working as of version 1.0.4.
208
  <!-- README_GPTQ.md-compatibility end -->
209
 
210
  <!-- footer start -->