Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ language:
|
|
8 |
license: llama2
|
9 |
model_creator: OpenAssistant
|
10 |
model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
|
11 |
-
model_name: CodeLlama 13B
|
12 |
model_type: llama
|
13 |
quantized_by: TheBloke
|
14 |
---
|
@@ -30,23 +30,28 @@ quantized_by: TheBloke
|
|
30 |
<hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
|
31 |
<!-- header end -->
|
32 |
|
33 |
-
# CodeLlama 13B
|
34 |
- Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
|
35 |
-
- Original model: [CodeLlama 13B
|
36 |
|
|
|
37 |
## Description
|
38 |
|
39 |
-
This repo contains GPTQ model files for [OpenAssistant's CodeLlama 13B
|
40 |
|
41 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
42 |
|
|
|
|
|
43 |
## Repositories available
|
44 |
|
45 |
-
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-
|
46 |
-
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-
|
47 |
-
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-
|
48 |
* [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
|
|
|
49 |
|
|
|
50 |
## Prompt template: ChatML
|
51 |
|
52 |
```
|
@@ -58,6 +63,9 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
58 |
|
59 |
```
|
60 |
|
|
|
|
|
|
|
61 |
## Provided files and GPTQ parameters
|
62 |
|
63 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
@@ -71,7 +79,7 @@ All GPTQ files are made with AutoGPTQ.
|
|
71 |
|
72 |
- Bits: The bit size of the quantised model.
|
73 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
74 |
-
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
|
75 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
76 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
77 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
@@ -81,87 +89,89 @@ All GPTQ files are made with AutoGPTQ.
|
|
81 |
|
82 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
83 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
84 |
-
|
|
85 |
-
|
|
86 |
-
|
|
87 |
-
|
|
88 |
-
|
|
89 |
-
|
|
|
|
|
|
90 |
|
|
|
91 |
## How to download from branches
|
92 |
|
93 |
-
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/CodeLlama-13B-
|
94 |
- With Git, you can clone a branch with:
|
95 |
```
|
96 |
-
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/CodeLlama-13B-
|
97 |
```
|
98 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
99 |
-
|
|
|
100 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
101 |
|
102 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
103 |
|
104 |
-
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
105 |
|
106 |
1. Click the **Model tab**.
|
107 |
-
2. Under **Download custom model or LoRA**, enter `TheBloke/CodeLlama-13B-
|
108 |
-
- To download from a specific branch, enter for example `TheBloke/CodeLlama-13B-
|
109 |
- see Provided Files above for the list of branches for each option.
|
110 |
3. Click **Download**.
|
111 |
-
4. The model will start downloading. Once it's finished it will say "Done"
|
112 |
5. In the top left, click the refresh icon next to **Model**.
|
113 |
-
6. In the **Model** dropdown, choose the model you just downloaded: `CodeLlama-13B-
|
114 |
7. The model will automatically load, and is now ready for use!
|
115 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
116 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
117 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
|
|
118 |
|
|
|
119 |
## How to use this GPTQ model from Python code
|
120 |
|
121 |
-
|
122 |
|
123 |
-
|
124 |
-
pip3 install auto-gptq
|
125 |
-
```
|
126 |
|
127 |
-
|
|
|
|
|
128 |
```
|
|
|
|
|
|
|
|
|
129 |
pip3 uninstall -y auto-gptq
|
130 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
131 |
cd AutoGPTQ
|
132 |
pip3 install .
|
133 |
```
|
134 |
|
135 |
-
|
136 |
-
|
137 |
-
```python
|
138 |
-
from transformers import AutoTokenizer, pipeline, logging
|
139 |
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
140 |
|
141 |
-
|
|
|
|
|
|
|
|
|
142 |
|
143 |
-
|
144 |
|
145 |
-
|
|
|
146 |
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
|
152 |
-
|
|
|
153 |
|
154 |
-
|
155 |
-
# To download from a specific branch, use the revision parameter, as in this example:
|
156 |
-
# Note that `revision` requires AutoGPTQ 0.3.1 or later!
|
157 |
-
|
158 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
159 |
-
revision="gptq-4bit-32g-actorder_True",
|
160 |
-
use_safetensors=True,
|
161 |
-
trust_remote_code=False,
|
162 |
-
device="cuda:0",
|
163 |
-
quantize_config=None)
|
164 |
-
"""
|
165 |
|
166 |
prompt = "Tell me about AI"
|
167 |
prompt_template=f'''<|im_start|>system
|
@@ -180,9 +190,6 @@ print(tokenizer.decode(output[0]))
|
|
180 |
|
181 |
# Inference can also be done using transformers' pipeline
|
182 |
|
183 |
-
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
184 |
-
logging.set_verbosity(logging.CRITICAL)
|
185 |
-
|
186 |
print("*** Pipeline:")
|
187 |
pipe = pipeline(
|
188 |
"text-generation",
|
@@ -196,12 +203,17 @@ pipe = pipeline(
|
|
196 |
|
197 |
print(pipe(prompt_template)[0]['generated_text'])
|
198 |
```
|
|
|
199 |
|
|
|
200 |
## Compatibility
|
201 |
|
202 |
-
The files provided
|
|
|
|
|
203 |
|
204 |
-
|
|
|
205 |
|
206 |
<!-- footer start -->
|
207 |
<!-- 200823 -->
|
@@ -235,7 +247,7 @@ And thank you again to a16z for their generous grant.
|
|
235 |
|
236 |
<!-- footer end -->
|
237 |
|
238 |
-
# Original model card: OpenAssistant's CodeLlama 13B
|
239 |
|
240 |
# Open-Assistant CodeLlama 13B SFT v10
|
241 |
|
|
|
8 |
license: llama2
|
9 |
model_creator: OpenAssistant
|
10 |
model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
|
11 |
+
model_name: CodeLlama 13B SFT v10
|
12 |
model_type: llama
|
13 |
quantized_by: TheBloke
|
14 |
---
|
|
|
30 |
<hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
|
31 |
<!-- header end -->
|
32 |
|
33 |
+
# CodeLlama 13B SFT v10 - GPTQ
|
34 |
- Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
|
35 |
+
- Original model: [CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
|
36 |
|
37 |
+
<!-- description start -->
|
38 |
## Description
|
39 |
|
40 |
+
This repo contains GPTQ model files for [OpenAssistant's CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
|
41 |
|
42 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
43 |
|
44 |
+
<!-- description end -->
|
45 |
+
<!-- repositories-available start -->
|
46 |
## Repositories available
|
47 |
|
48 |
+
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ)
|
49 |
+
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGUF)
|
50 |
+
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGML)
|
51 |
* [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
|
52 |
+
<!-- repositories-available end -->
|
53 |
|
54 |
+
<!-- prompt-template start -->
|
55 |
## Prompt template: ChatML
|
56 |
|
57 |
```
|
|
|
63 |
|
64 |
```
|
65 |
|
66 |
+
<!-- prompt-template end -->
|
67 |
+
|
68 |
+
<!-- README_GPTQ.md-provided-files start -->
|
69 |
## Provided files and GPTQ parameters
|
70 |
|
71 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
|
|
79 |
|
80 |
- Bits: The bit size of the quantised model.
|
81 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
82 |
+
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
|
83 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
84 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
85 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
|
|
89 |
|
90 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
91 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
92 |
+
| main | 4 | 128 | No | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
93 |
+
| gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
94 |
+
| gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
95 |
+
| gptq-4bit-128g-actorder_True | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
96 |
+
| gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
|
97 |
+
| gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
|
98 |
+
|
99 |
+
<!-- README_GPTQ.md-provided-files end -->
|
100 |
|
101 |
+
<!-- README_GPTQ.md-download-from-branches start -->
|
102 |
## How to download from branches
|
103 |
|
104 |
+
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ:gptq-4bit-32g-actorder_True`
|
105 |
- With Git, you can clone a branch with:
|
106 |
```
|
107 |
+
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ
|
108 |
```
|
109 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
110 |
+
<!-- README_GPTQ.md-download-from-branches end -->
|
111 |
+
<!-- README_GPTQ.md-text-generation-webui start -->
|
112 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
113 |
|
114 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
115 |
|
116 |
+
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
117 |
|
118 |
1. Click the **Model tab**.
|
119 |
+
2. Under **Download custom model or LoRA**, enter `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ`.
|
120 |
+
- To download from a specific branch, enter for example `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ:gptq-4bit-32g-actorder_True`
|
121 |
- see Provided Files above for the list of branches for each option.
|
122 |
3. Click **Download**.
|
123 |
+
4. The model will start downloading. Once it's finished it will say "Done".
|
124 |
5. In the top left, click the refresh icon next to **Model**.
|
125 |
+
6. In the **Model** dropdown, choose the model you just downloaded: `CodeLlama-13B-OASST-SFT-v10-GPTQ`
|
126 |
7. The model will automatically load, and is now ready for use!
|
127 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
128 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
129 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
130 |
+
<!-- README_GPTQ.md-text-generation-webui end -->
|
131 |
|
132 |
+
<!-- README_GPTQ.md-use-from-python start -->
|
133 |
## How to use this GPTQ model from Python code
|
134 |
|
135 |
+
### Install the necessary packages
|
136 |
|
137 |
+
Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
|
|
|
|
|
138 |
|
139 |
+
```shell
|
140 |
+
pip3 install transformers>=4.32.0 optimum>=1.12.0
|
141 |
+
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
142 |
```
|
143 |
+
|
144 |
+
If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
|
145 |
+
|
146 |
+
```shell
|
147 |
pip3 uninstall -y auto-gptq
|
148 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
149 |
cd AutoGPTQ
|
150 |
pip3 install .
|
151 |
```
|
152 |
|
153 |
+
### For CodeLlama models only: you must use Transformers 4.33.0 or later.
|
|
|
|
|
|
|
|
|
154 |
|
155 |
+
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
|
156 |
+
```shell
|
157 |
+
pip3 uninstall -y transformers
|
158 |
+
pip3 install git+https://github.com/huggingface/transformers.git
|
159 |
+
```
|
160 |
|
161 |
+
### You can then use the following code
|
162 |
|
163 |
+
```python
|
164 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
165 |
|
166 |
+
model_name_or_path = "TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ"
|
167 |
+
# To use a different branch, change revision
|
168 |
+
# For example: revision="gptq-4bit-32g-actorder_True"
|
169 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
170 |
+
torch_dtype=torch.bfloat16,
|
171 |
+
device_map="auto",
|
172 |
+
revision="main")
|
173 |
|
174 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
175 |
|
176 |
prompt = "Tell me about AI"
|
177 |
prompt_template=f'''<|im_start|>system
|
|
|
190 |
|
191 |
# Inference can also be done using transformers' pipeline
|
192 |
|
|
|
|
|
|
|
193 |
print("*** Pipeline:")
|
194 |
pipe = pipeline(
|
195 |
"text-generation",
|
|
|
203 |
|
204 |
print(pipe(prompt_template)[0]['generated_text'])
|
205 |
```
|
206 |
+
<!-- README_GPTQ.md-use-from-python end -->
|
207 |
|
208 |
+
<!-- README_GPTQ.md-compatibility start -->
|
209 |
## Compatibility
|
210 |
|
211 |
+
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
|
212 |
+
|
213 |
+
[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
|
214 |
|
215 |
+
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
|
216 |
+
<!-- README_GPTQ.md-compatibility end -->
|
217 |
|
218 |
<!-- footer start -->
|
219 |
<!-- 200823 -->
|
|
|
247 |
|
248 |
<!-- footer end -->
|
249 |
|
250 |
+
# Original model card: OpenAssistant's CodeLlama 13B SFT v10
|
251 |
|
252 |
# Open-Assistant CodeLlama 13B SFT v10
|
253 |
|