Initial GPTQ model upload
Browse files
README.md
CHANGED
@@ -17,16 +17,11 @@ license: other
|
|
17 |
</div>
|
18 |
<!-- header end -->
|
19 |
|
20 |
-
# Camel AI's CAMEL 33B Combined Data
|
21 |
|
22 |
-
These files are
|
23 |
|
24 |
-
|
25 |
-
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
|
26 |
-
* [KoboldCpp](https://github.com/LostRuins/koboldcpp)
|
27 |
-
* [ParisNeo/GPT4All-UI](https://github.com/ParisNeo/gpt4all-ui)
|
28 |
-
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
|
29 |
-
* [ctransformers](https://github.com/marella/ctransformers)
|
30 |
|
31 |
## Repositories available
|
32 |
|
@@ -42,71 +37,92 @@ USER: prompt
|
|
42 |
ASSISTANT:
|
43 |
```
|
44 |
|
45 |
-
|
46 |
-
## Compatibility
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
57 |
|
58 |
-
|
59 |
|
60 |
-
|
|
|
|
|
|
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
* GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
|
65 |
-
* GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
|
66 |
-
* GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
|
67 |
-
* GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
|
68 |
-
* GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
|
69 |
|
70 |
-
|
71 |
-
<!-- compatibility_ggml end -->
|
72 |
|
73 |
-
|
74 |
-
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
75 |
-
| ---- | ---- | ---- | ---- | ---- | ----- |
|
76 |
-
| camel-33B-combined-data.ggmlv3.q2_K.bin | q2_K | 2 | 13.71 GB | 16.21 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
|
77 |
-
| camel-33B-combined-data.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.28 GB | 19.78 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
78 |
-
| camel-33B-combined-data.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 15.72 GB | 18.22 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
79 |
-
| camel-33B-combined-data.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 14.06 GB | 16.56 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
|
80 |
-
| camel-33B-combined-data.ggmlv3.q4_0.bin | q4_0 | 4 | 18.30 GB | 20.80 GB | Original llama.cpp quant method, 4-bit. |
|
81 |
-
| camel-33B-combined-data.ggmlv3.q4_1.bin | q4_1 | 4 | 20.33 GB | 22.83 GB | Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
|
82 |
-
| camel-33B-combined-data.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 19.62 GB | 22.12 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
|
83 |
-
| camel-33B-combined-data.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 18.36 GB | 20.86 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
|
84 |
-
| camel-33B-combined-data.ggmlv3.q5_0.bin | q5_0 | 5 | 22.37 GB | 24.87 GB | Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
|
85 |
-
| camel-33B-combined-data.ggmlv3.q5_1.bin | q5_1 | 5 | 24.40 GB | 26.90 GB | Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
|
86 |
-
| camel-33B-combined-data.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 23.05 GB | 25.55 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
|
87 |
-
| camel-33B-combined-data.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 22.40 GB | 24.90 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
|
88 |
-
| camel-33B-combined-data.ggmlv3.q6_K.bin | q6_K | 6 | 26.69 GB | 29.19 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
89 |
-
| camel-33B-combined-data.ggmlv3.q8_0.bin | q8_0 | 8 | 34.56 GB | 37.06 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
90 |
-
|
91 |
-
|
92 |
-
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
93 |
-
|
94 |
-
## How to run in `llama.cpp`
|
95 |
-
|
96 |
-
I use the following command line; adjust for your tastes and needs:
|
97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
```
|
99 |
-
./main -t 10 -ngl 32 -m camel-33B-combined-data.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
|
100 |
-
```
|
101 |
-
Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
|
102 |
|
103 |
-
|
|
|
|
|
104 |
|
105 |
-
|
106 |
|
107 |
-
|
108 |
|
109 |
-
|
|
|
|
|
|
|
|
|
110 |
|
111 |
<!-- footer start -->
|
112 |
## Discord
|
|
|
17 |
</div>
|
18 |
<!-- header end -->
|
19 |
|
20 |
+
# Camel AI's CAMEL 33B Combined Data GPTQ
|
21 |
|
22 |
+
These files are GPTQ 4bit model files for [Camel AI's CAMEL 33B Combined Data](https://huggingface.co/camel-ai/CAMEL-33B-Combined-Data).
|
23 |
|
24 |
+
It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
## Repositories available
|
27 |
|
|
|
37 |
ASSISTANT:
|
38 |
```
|
39 |
|
40 |
+
## How to easily download and use this model in text-generation-webui
|
|
|
41 |
|
42 |
+
Please make sure you're using the latest version of text-generation-webui
|
43 |
|
44 |
+
1. Click the **Model tab**.
|
45 |
+
2. Under **Download custom model or LoRA**, enter `TheBloke/CAMEL-33B-Combined-Data-GPTQ`.
|
46 |
+
3. Click **Download**.
|
47 |
+
4. The model will start downloading. Once it's finished it will say "Done"
|
48 |
+
5. In the top left, click the refresh icon next to **Model**.
|
49 |
+
6. In the **Model** dropdown, choose the model you just downloaded: `CAMEL-33B-Combined-Data-GPTQ`
|
50 |
+
7. The model will automatically load, and is now ready for use!
|
51 |
+
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
52 |
+
* Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
53 |
+
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
54 |
|
55 |
+
## How to use this GPTQ model from Python code
|
56 |
|
57 |
+
First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
|
58 |
|
59 |
+
`pip install auto-gptq`
|
60 |
|
61 |
+
Then try the following example code:
|
62 |
|
63 |
+
```python
|
64 |
+
from transformers import AutoTokenizer, pipeline, logging
|
65 |
+
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
66 |
+
import argparse
|
67 |
|
68 |
+
model_name_or_path = "TheBloke/CAMEL-33B-Combined-Data-GPTQ"
|
69 |
+
model_basename = "camel-33B-combined-data-GPTQ-4bit--1g.act.order"
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
+
use_triton = False
|
|
|
72 |
|
73 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
+
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
76 |
+
model_basename=model_basename,
|
77 |
+
use_safetensors=True,
|
78 |
+
trust_remote_code=False,
|
79 |
+
device="cuda:0",
|
80 |
+
use_triton=use_triton,
|
81 |
+
quantize_config=None)
|
82 |
+
|
83 |
+
# Note: check the prompt template is correct for this model.
|
84 |
+
prompt = "Tell me about AI"
|
85 |
+
prompt_template=f'''### Human: {prompt}
|
86 |
+
### Assistant:'''
|
87 |
+
|
88 |
+
print("\n\n*** Generate:")
|
89 |
+
|
90 |
+
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
91 |
+
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
|
92 |
+
print(tokenizer.decode(output[0]))
|
93 |
+
|
94 |
+
# Inference can also be done using transformers' pipeline
|
95 |
+
|
96 |
+
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
97 |
+
logging.set_verbosity(logging.CRITICAL)
|
98 |
+
|
99 |
+
print("*** Pipeline:")
|
100 |
+
pipe = pipeline(
|
101 |
+
"text-generation",
|
102 |
+
model=model,
|
103 |
+
tokenizer=tokenizer,
|
104 |
+
max_new_tokens=512,
|
105 |
+
temperature=0.7,
|
106 |
+
top_p=0.95,
|
107 |
+
repetition_penalty=1.15
|
108 |
+
)
|
109 |
+
|
110 |
+
print(pipe(prompt_template)[0]['generated_text'])
|
111 |
```
|
|
|
|
|
|
|
112 |
|
113 |
+
## Provided files
|
114 |
+
|
115 |
+
**camel-33B-combined-data-GPTQ-4bit--1g.act.order.safetensors**
|
116 |
|
117 |
+
This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
|
118 |
|
119 |
+
It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
|
120 |
|
121 |
+
* `camel-33B-combined-data-GPTQ-4bit--1g.act.order.safetensors`
|
122 |
+
* Works with AutoGPTQ in CUDA or Triton modes.
|
123 |
+
* Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
|
124 |
+
* Works with text-generation-webui, including one-click-installers.
|
125 |
+
* Parameters: Groupsize = -1. Act Order / desc_act = True.
|
126 |
|
127 |
<!-- footer start -->
|
128 |
## Discord
|