TheBloke
/

airochronos-33B-GPTQ

@@ -1,6 +1,7 @@
 ---
 inference: false
 license: other
 ---
 <!-- header start -->
@@ -19,13 +20,15 @@ license: other
 # Henk717's Airochronos 33B GPTQ
-These files are GPTQ 4bit model files for [Henk717's Airochronos 33B](https://huggingface.co/Henk717/airochronos-33B).
-It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
 ## Repositories available
-* [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Henk717/airochronos-33B)
@@ -39,6 +42,32 @@ Below is an instruction that describes a task. Write a response that appropriate
 ### Response:
 ```
 ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
@@ -47,6 +76,8 @@ It is strongly recommended to use the text-generation-webui one-click-installers
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/airochronos-33B-GPTQ`.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done"
 5. In the top left, click the refresh icon next to **Model**.
@@ -76,20 +107,31 @@ use_triton = False
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
-        model_basename=model_basename,
         use_safetensors=True,
         trust_remote_code=False,
         device="cuda:0",
         use_triton=use_triton,
         quantize_config=None)
 prompt = "Tell me about AI"
 prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction: {prompt}
 ### Response:
 '''
 print("\n\n*** Generate:")
@@ -117,21 +159,11 @@ pipe = pipeline(
 print(pipe(prompt_template)[0]['generated_text'])
 ```
-## Provided files
-**airochronos-33b-GPTQ-4bit--1g.act.order.safetensors**
-This will work with ExLlama, AutoGPTQ, Occ4m's fork of GPTQ-for-LLaMa, and GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa but this is untested.
-It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
-* `airochronos-33b-GPTQ-4bit--1g.act.order.safetensors`
-  * Works with [ExLlama](https://github.com/turboderp/exllama), providing the best performance and lowest VRAM usage. Recommended.
-  * Works with AutoGPTQ in CUDA or Triton modes.
-  * Works with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/GPTQ-for-LLaMa).
-  * Works with GPTQ-for-LLaMa in CUDA mode.  May have issues with GPTQ-for-LLaMa Triton mode.
-  * Works with text-generation-webui, including one-click-installers.
-  * Parameters: Groupsize = -1. Act Order / desc_act = True.
 <!-- footer start -->
 ## Discord

 ---
 inference: false
 license: other
+model_type: llama
 ---
 <!-- header start -->
 # Henk717's Airochronos 33B GPTQ
+These files are GPTQ model files for [Henk717's Airochronos 33B](https://huggingface.co/Henk717/airochronos-33B).
+Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
+These models were quantised using hardware kindly provided by [Latitude.sh](https://www.latitude.sh/accelerate).
 ## Repositories available
+* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/airochronos-33B-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Henk717/airochronos-33B)
 ### Response:
 ```
+## Provided files
+Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
+Each separate quant is in a different branch.  See below for instructions on fetching from different branches.
+| Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
+| ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
+| main | 4 | None | True | 16.94 GB | True | GPTQ-for-LLaMa | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
+| gptq-4bit-32g-actorder_True | 4 | 32 | True | 19.44 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
+| gptq-4bit-64g-actorder_True | 4 | 64 | True | 18.18 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-4bit-128g-actorder_True | 4 | 128 | True | 17.55 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-8bit--1g-actorder_True | 8 | None | True | 32.99 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
+| gptq-8bit-128g-actorder_False | 8 | 128 | False | 33.73 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
+| gptq-3bit--1g-actorder_True | 3 | None | True | 12.92 GB | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
+| gptq-3bit-128g-actorder_False | 3 | 128 | False | 13.51 GB | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
+## How to download from branches
+- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/airochronos-33B-GPTQ:gptq-4bit-32g-actorder_True`
+- With Git, you can clone a branch with:
+```
+git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/airochronos-33B-GPTQ`
+```
+- In Python Transformers code, the branch is the `revision` parameter; see below.
 ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/airochronos-33B-GPTQ`.
+  - To download from a specific branch, enter for example `TheBloke/airochronos-33B-GPTQ:gptq-4bit-32g-actorder_True`
+  - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done"
 5. In the top left, click the refresh icon next to **Model**.
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
+        model_basename=model_basename
         use_safetensors=True,
         trust_remote_code=False,
         device="cuda:0",
         use_triton=use_triton,
         quantize_config=None)
+"""
+To download from a specific branch, use the revision parameter, as in this example:
+model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
+        revision="gptq-4bit-32g-actorder_True",
+        model_basename=model_basename,
+        use_safetensors=True,
+        trust_remote_code=False,
+        device="cuda:0",
+        quantize_config=None)
+"""
 prompt = "Tell me about AI"
 prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.
 ### Instruction: {prompt}
 ### Response:
 '''
 print("\n\n*** Generate:")
 print(pipe(prompt_template)[0]['generated_text'])
 ```
+## Compatibility
+The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
+ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 <!-- footer start -->
 ## Discord