TheBloke commited on
Commit
c95d222
1 Parent(s): 1c87be6

Initial GPTQ model upload

Browse files
Files changed (1) hide show
  1. README.md +73 -57
README.md CHANGED
@@ -17,16 +17,11 @@ license: other
17
  </div>
18
  <!-- header end -->
19
 
20
- # Camel AI's CAMEL 33B Combined Data GGML
21
 
22
- These files are GGML format model files for [Camel AI's CAMEL 33B Combined Data](https://huggingface.co/camel-ai/CAMEL-33B-Combined-Data).
23
 
24
- GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
25
- * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
26
- * [KoboldCpp](https://github.com/LostRuins/koboldcpp)
27
- * [ParisNeo/GPT4All-UI](https://github.com/ParisNeo/gpt4all-ui)
28
- * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
29
- * [ctransformers](https://github.com/marella/ctransformers)
30
 
31
  ## Repositories available
32
 
@@ -42,71 +37,92 @@ USER: prompt
42
  ASSISTANT:
43
  ```
44
 
45
- <!-- compatibility_ggml start -->
46
- ## Compatibility
47
 
48
- ### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
49
 
50
- I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit `2d5db48`.
 
 
 
 
 
 
 
 
 
51
 
52
- These are guaranteed to be compatbile with any UIs, tools and libraries released since late May.
53
 
54
- ### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
55
 
56
- These new quantisation methods are compatible with llama.cpp as of June 6th, commit `2d43387`.
57
 
58
- They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python and ctransformers. Other tools and libraries may or may not be compatible - check their documentation if in doubt.
59
 
60
- ## Explanation of the new k-quant methods
 
 
 
61
 
62
- The new methods available are:
63
- * GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
64
- * GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
65
- * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
66
- * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
67
- * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
68
- * GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
69
 
70
- Refer to the Provided Files table below to see what files use which methods, and how.
71
- <!-- compatibility_ggml end -->
72
 
73
- ## Provided files
74
- | Name | Quant method | Bits | Size | Max RAM required | Use case |
75
- | ---- | ---- | ---- | ---- | ---- | ----- |
76
- | camel-33B-combined-data.ggmlv3.q2_K.bin | q2_K | 2 | 13.71 GB | 16.21 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
77
- | camel-33B-combined-data.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.28 GB | 19.78 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
78
- | camel-33B-combined-data.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 15.72 GB | 18.22 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
79
- | camel-33B-combined-data.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 14.06 GB | 16.56 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
80
- | camel-33B-combined-data.ggmlv3.q4_0.bin | q4_0 | 4 | 18.30 GB | 20.80 GB | Original llama.cpp quant method, 4-bit. |
81
- | camel-33B-combined-data.ggmlv3.q4_1.bin | q4_1 | 4 | 20.33 GB | 22.83 GB | Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
82
- | camel-33B-combined-data.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 19.62 GB | 22.12 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
83
- | camel-33B-combined-data.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 18.36 GB | 20.86 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
84
- | camel-33B-combined-data.ggmlv3.q5_0.bin | q5_0 | 5 | 22.37 GB | 24.87 GB | Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
85
- | camel-33B-combined-data.ggmlv3.q5_1.bin | q5_1 | 5 | 24.40 GB | 26.90 GB | Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
86
- | camel-33B-combined-data.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 23.05 GB | 25.55 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
87
- | camel-33B-combined-data.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 22.40 GB | 24.90 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
88
- | camel-33B-combined-data.ggmlv3.q6_K.bin | q6_K | 6 | 26.69 GB | 29.19 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
89
- | camel-33B-combined-data.ggmlv3.q8_0.bin | q8_0 | 8 | 34.56 GB | 37.06 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
90
-
91
-
92
- **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
93
-
94
- ## How to run in `llama.cpp`
95
-
96
- I use the following command line; adjust for your tastes and needs:
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ```
99
- ./main -t 10 -ngl 32 -m camel-33B-combined-data.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
100
- ```
101
- Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
102
 
103
- Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
 
 
104
 
105
- If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
106
 
107
- ## How to run in `text-generation-webui`
108
 
109
- Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
 
 
 
 
110
 
111
  <!-- footer start -->
112
  ## Discord
 
17
  </div>
18
  <!-- header end -->
19
 
20
+ # Camel AI's CAMEL 33B Combined Data GPTQ
21
 
22
+ These files are GPTQ 4bit model files for [Camel AI's CAMEL 33B Combined Data](https://huggingface.co/camel-ai/CAMEL-33B-Combined-Data).
23
 
24
+ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
 
 
 
 
 
25
 
26
  ## Repositories available
27
 
 
37
  ASSISTANT:
38
  ```
39
 
40
+ ## How to easily download and use this model in text-generation-webui
 
41
 
42
+ Please make sure you're using the latest version of text-generation-webui
43
 
44
+ 1. Click the **Model tab**.
45
+ 2. Under **Download custom model or LoRA**, enter `TheBloke/CAMEL-33B-Combined-Data-GPTQ`.
46
+ 3. Click **Download**.
47
+ 4. The model will start downloading. Once it's finished it will say "Done"
48
+ 5. In the top left, click the refresh icon next to **Model**.
49
+ 6. In the **Model** dropdown, choose the model you just downloaded: `CAMEL-33B-Combined-Data-GPTQ`
50
+ 7. The model will automatically load, and is now ready for use!
51
+ 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
52
+ * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
53
+ 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
54
 
55
+ ## How to use this GPTQ model from Python code
56
 
57
+ First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
58
 
59
+ `pip install auto-gptq`
60
 
61
+ Then try the following example code:
62
 
63
+ ```python
64
+ from transformers import AutoTokenizer, pipeline, logging
65
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
66
+ import argparse
67
 
68
+ model_name_or_path = "TheBloke/CAMEL-33B-Combined-Data-GPTQ"
69
+ model_basename = "camel-33B-combined-data-GPTQ-4bit--1g.act.order"
 
 
 
 
 
70
 
71
+ use_triton = False
 
72
 
73
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
76
+ model_basename=model_basename,
77
+ use_safetensors=True,
78
+ trust_remote_code=False,
79
+ device="cuda:0",
80
+ use_triton=use_triton,
81
+ quantize_config=None)
82
+
83
+ # Note: check the prompt template is correct for this model.
84
+ prompt = "Tell me about AI"
85
+ prompt_template=f'''### Human: {prompt}
86
+ ### Assistant:'''
87
+
88
+ print("\n\n*** Generate:")
89
+
90
+ input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
91
+ output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
92
+ print(tokenizer.decode(output[0]))
93
+
94
+ # Inference can also be done using transformers' pipeline
95
+
96
+ # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
97
+ logging.set_verbosity(logging.CRITICAL)
98
+
99
+ print("*** Pipeline:")
100
+ pipe = pipeline(
101
+ "text-generation",
102
+ model=model,
103
+ tokenizer=tokenizer,
104
+ max_new_tokens=512,
105
+ temperature=0.7,
106
+ top_p=0.95,
107
+ repetition_penalty=1.15
108
+ )
109
+
110
+ print(pipe(prompt_template)[0]['generated_text'])
111
  ```
 
 
 
112
 
113
+ ## Provided files
114
+
115
+ **camel-33B-combined-data-GPTQ-4bit--1g.act.order.safetensors**
116
 
117
+ This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
118
 
119
+ It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
120
 
121
+ * `camel-33B-combined-data-GPTQ-4bit--1g.act.order.safetensors`
122
+ * Works with AutoGPTQ in CUDA or Triton modes.
123
+ * Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
124
+ * Works with text-generation-webui, including one-click-installers.
125
+ * Parameters: Groupsize = -1. Act Order / desc_act = True.
126
 
127
  <!-- footer start -->
128
  ## Discord