TheBloke commited on
Commit
57a67e1
1 Parent(s): 6b55f48

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +70 -58
README.md CHANGED
@@ -8,7 +8,7 @@ language:
8
  license: llama2
9
  model_creator: OpenAssistant
10
  model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
11
- model_name: CodeLlama 13B OASST SFT v10
12
  model_type: llama
13
  quantized_by: TheBloke
14
  ---
@@ -30,23 +30,28 @@ quantized_by: TheBloke
30
  <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
31
  <!-- header end -->
32
 
33
- # CodeLlama 13B OASST SFT v10 - GPTQ
34
  - Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
35
- - Original model: [CodeLlama 13B OASST SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
36
 
 
37
  ## Description
38
 
39
- This repo contains GPTQ model files for [OpenAssistant's CodeLlama 13B OASST SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
40
 
41
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
42
 
 
 
43
  ## Repositories available
44
 
45
- * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ)
46
- * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGUF)
47
- * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GGML)
48
  * [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
 
49
 
 
50
  ## Prompt template: ChatML
51
 
52
  ```
@@ -58,6 +63,9 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
58
 
59
  ```
60
 
 
 
 
61
  ## Provided files and GPTQ parameters
62
 
63
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
@@ -71,7 +79,7 @@ All GPTQ files are made with AutoGPTQ.
71
 
72
  - Bits: The bit size of the quantised model.
73
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
74
- - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
75
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
76
  - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
77
  - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
@@ -81,87 +89,89 @@ All GPTQ files are made with AutoGPTQ.
81
 
82
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
83
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
84
- | [main](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ/tree/main) | 4 | 128 | No | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
85
- | [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
86
- | [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
87
- | [gptq-4bit-128g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ/tree/gptq-4bit-128g-actorder_True) | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
88
- | [gptq-8bit--1g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ/tree/gptq-8bit--1g-actorder_True) | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
89
- | [gptq-8bit-128g-actorder_True](https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ/tree/gptq-8bit-128g-actorder_True) | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
 
 
90
 
 
91
  ## How to download from branches
92
 
93
- - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ:gptq-4bit-32g-actorder_True`
94
  - With Git, you can clone a branch with:
95
  ```
96
- git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ
97
  ```
98
  - In Python Transformers code, the branch is the `revision` parameter; see below.
99
-
 
100
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
101
 
102
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
103
 
104
- It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
105
 
106
  1. Click the **Model tab**.
107
- 2. Under **Download custom model or LoRA**, enter `TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ`.
108
- - To download from a specific branch, enter for example `TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ:gptq-4bit-32g-actorder_True`
109
  - see Provided Files above for the list of branches for each option.
110
  3. Click **Download**.
111
- 4. The model will start downloading. Once it's finished it will say "Done"
112
  5. In the top left, click the refresh icon next to **Model**.
113
- 6. In the **Model** dropdown, choose the model you just downloaded: `CodeLlama-13B-oasst-sft-v10-GPTQ`
114
  7. The model will automatically load, and is now ready for use!
115
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
116
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
117
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
118
 
 
119
  ## How to use this GPTQ model from Python code
120
 
121
- First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 0.3.1 or later installed:
122
 
123
- ```
124
- pip3 install auto-gptq
125
- ```
126
 
127
- If you have problems installing AutoGPTQ, please build from source instead:
 
 
128
  ```
 
 
 
 
129
  pip3 uninstall -y auto-gptq
130
  git clone https://github.com/PanQiWei/AutoGPTQ
131
  cd AutoGPTQ
132
  pip3 install .
133
  ```
134
 
135
- Then try the following example code:
136
-
137
- ```python
138
- from transformers import AutoTokenizer, pipeline, logging
139
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
140
 
141
- model_name_or_path = "TheBloke/CodeLlama-13B-oasst-sft-v10-GPTQ"
 
 
 
 
142
 
143
- use_triton = False
144
 
145
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 
146
 
147
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
148
- use_safetensors=True,
149
- trust_remote_code=False,
150
- device="cuda:0",
151
- use_triton=use_triton,
152
- quantize_config=None)
 
153
 
154
- """
155
- # To download from a specific branch, use the revision parameter, as in this example:
156
- # Note that `revision` requires AutoGPTQ 0.3.1 or later!
157
-
158
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
159
- revision="gptq-4bit-32g-actorder_True",
160
- use_safetensors=True,
161
- trust_remote_code=False,
162
- device="cuda:0",
163
- quantize_config=None)
164
- """
165
 
166
  prompt = "Tell me about AI"
167
  prompt_template=f'''<|im_start|>system
@@ -180,9 +190,6 @@ print(tokenizer.decode(output[0]))
180
 
181
  # Inference can also be done using transformers' pipeline
182
 
183
- # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
184
- logging.set_verbosity(logging.CRITICAL)
185
-
186
  print("*** Pipeline:")
187
  pipe = pipeline(
188
  "text-generation",
@@ -196,12 +203,17 @@ pipe = pipeline(
196
 
197
  print(pipe(prompt_template)[0]['generated_text'])
198
  ```
 
199
 
 
200
  ## Compatibility
201
 
202
- The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
 
 
203
 
204
- ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 
205
 
206
  <!-- footer start -->
207
  <!-- 200823 -->
@@ -235,7 +247,7 @@ And thank you again to a16z for their generous grant.
235
 
236
  <!-- footer end -->
237
 
238
- # Original model card: OpenAssistant's CodeLlama 13B OASST SFT v10
239
 
240
  # Open-Assistant CodeLlama 13B SFT v10
241
 
 
8
  license: llama2
9
  model_creator: OpenAssistant
10
  model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
11
+ model_name: CodeLlama 13B SFT v10
12
  model_type: llama
13
  quantized_by: TheBloke
14
  ---
 
30
  <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
31
  <!-- header end -->
32
 
33
+ # CodeLlama 13B SFT v10 - GPTQ
34
  - Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
35
+ - Original model: [CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
36
 
37
+ <!-- description start -->
38
  ## Description
39
 
40
+ This repo contains GPTQ model files for [OpenAssistant's CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
41
 
42
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
43
 
44
+ <!-- description end -->
45
+ <!-- repositories-available start -->
46
  ## Repositories available
47
 
48
+ * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ)
49
+ * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGUF)
50
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGML)
51
  * [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
52
+ <!-- repositories-available end -->
53
 
54
+ <!-- prompt-template start -->
55
  ## Prompt template: ChatML
56
 
57
  ```
 
63
 
64
  ```
65
 
66
+ <!-- prompt-template end -->
67
+
68
+ <!-- README_GPTQ.md-provided-files start -->
69
  ## Provided files and GPTQ parameters
70
 
71
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
 
79
 
80
  - Bits: The bit size of the quantised model.
81
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
82
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
83
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
84
  - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
85
  - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
 
89
 
90
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
91
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
92
+ | main | 4 | 128 | No | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
93
+ | gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
94
+ | gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
95
+ | gptq-4bit-128g-actorder_True | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
96
+ | gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
97
+ | gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
98
+
99
+ <!-- README_GPTQ.md-provided-files end -->
100
 
101
+ <!-- README_GPTQ.md-download-from-branches start -->
102
  ## How to download from branches
103
 
104
+ - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ:gptq-4bit-32g-actorder_True`
105
  - With Git, you can clone a branch with:
106
  ```
107
+ git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ
108
  ```
109
  - In Python Transformers code, the branch is the `revision` parameter; see below.
110
+ <!-- README_GPTQ.md-download-from-branches end -->
111
+ <!-- README_GPTQ.md-text-generation-webui start -->
112
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
113
 
114
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
115
 
116
+ It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
117
 
118
  1. Click the **Model tab**.
119
+ 2. Under **Download custom model or LoRA**, enter `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ`.
120
+ - To download from a specific branch, enter for example `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ:gptq-4bit-32g-actorder_True`
121
  - see Provided Files above for the list of branches for each option.
122
  3. Click **Download**.
123
+ 4. The model will start downloading. Once it's finished it will say "Done".
124
  5. In the top left, click the refresh icon next to **Model**.
125
+ 6. In the **Model** dropdown, choose the model you just downloaded: `CodeLlama-13B-OASST-SFT-v10-GPTQ`
126
  7. The model will automatically load, and is now ready for use!
127
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
128
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
129
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
130
+ <!-- README_GPTQ.md-text-generation-webui end -->
131
 
132
+ <!-- README_GPTQ.md-use-from-python start -->
133
  ## How to use this GPTQ model from Python code
134
 
135
+ ### Install the necessary packages
136
 
137
+ Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
 
 
138
 
139
+ ```shell
140
+ pip3 install transformers>=4.32.0 optimum>=1.12.0
141
+ pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
142
  ```
143
+
144
+ If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
145
+
146
+ ```shell
147
  pip3 uninstall -y auto-gptq
148
  git clone https://github.com/PanQiWei/AutoGPTQ
149
  cd AutoGPTQ
150
  pip3 install .
151
  ```
152
 
153
+ ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
 
 
 
 
154
 
155
+ If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
156
+ ```shell
157
+ pip3 uninstall -y transformers
158
+ pip3 install git+https://github.com/huggingface/transformers.git
159
+ ```
160
 
161
+ ### You can then use the following code
162
 
163
+ ```python
164
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
165
 
166
+ model_name_or_path = "TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ"
167
+ # To use a different branch, change revision
168
+ # For example: revision="gptq-4bit-32g-actorder_True"
169
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
170
+ torch_dtype=torch.bfloat16,
171
+ device_map="auto",
172
+ revision="main")
173
 
174
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 
 
 
 
 
 
 
 
 
 
175
 
176
  prompt = "Tell me about AI"
177
  prompt_template=f'''<|im_start|>system
 
190
 
191
  # Inference can also be done using transformers' pipeline
192
 
 
 
 
193
  print("*** Pipeline:")
194
  pipe = pipeline(
195
  "text-generation",
 
203
 
204
  print(pipe(prompt_template)[0]['generated_text'])
205
  ```
206
+ <!-- README_GPTQ.md-use-from-python end -->
207
 
208
+ <!-- README_GPTQ.md-compatibility start -->
209
  ## Compatibility
210
 
211
+ The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
212
+
213
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
214
 
215
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
216
+ <!-- README_GPTQ.md-compatibility end -->
217
 
218
  <!-- footer start -->
219
  <!-- 200823 -->
 
247
 
248
  <!-- footer end -->
249
 
250
+ # Original model card: OpenAssistant's CodeLlama 13B SFT v10
251
 
252
  # Open-Assistant CodeLlama 13B SFT v10
253