TheBloke commited on
Commit
9719fa8
1 Parent(s): e9a9b29

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +15 -17
README.md CHANGED
@@ -1,14 +1,4 @@
1
  ---
2
- extra_gated_button_content: Submit
3
- extra_gated_description: This is a form to enable access to Llama 2 on Hugging Face
4
- after you have been granted access from Meta. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads)
5
- and accept our license terms and acceptable use policy before submitting this form.
6
- Requests will be processed in 1-2 days.
7
- extra_gated_fields:
8
- ? I agree to share my name, email address and username with Meta and confirm that
9
- I have already been granted download access on the Meta website
10
- : checkbox
11
- extra_gated_heading: Access Llama 2 on Hugging Face
12
  inference: false
13
  language:
14
  - en
@@ -39,7 +29,7 @@ tags:
39
 
40
  # Meta's Llama 2 13B GPTQ
41
 
42
- These files are GPTQ model files for [Meta's Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf).
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
@@ -47,12 +37,14 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
47
 
48
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
49
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
50
- * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13b-hf)
51
 
52
- ## Prompt template: None
53
 
54
  ```
55
- {prompt}
 
 
56
  ```
57
 
58
  ## Provided files
@@ -67,6 +59,10 @@ Each separate quant is in a different branch. See below for instructions on fet
67
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
68
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
69
  | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 
 
 
 
70
 
71
  ## How to download from branches
72
 
@@ -116,7 +112,7 @@ use_triton = False
116
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
117
 
118
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
119
- model_basename=model_basename
120
  use_safetensors=True,
121
  trust_remote_code=True,
122
  device="cuda:0",
@@ -136,7 +132,9 @@ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
136
  """
137
 
138
  prompt = "Tell me about AI"
139
- prompt_template=f'''{prompt}
 
 
140
  '''
141
 
142
  print("\n\n*** Generate:")
@@ -201,7 +199,7 @@ Thank you to all my generous patrons and donaters!
201
  # Original model card: Meta's Llama 2 13B
202
 
203
  # **Llama 2**
204
- Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B pretrained model, converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.
205
 
206
  ## Model Details
207
  *Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept our License before requesting access here.*
 
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  inference: false
3
  language:
4
  - en
 
29
 
30
  # Meta's Llama 2 13B GPTQ
31
 
32
+ These files are GPTQ model files for [Meta's Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf).
33
 
34
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
35
 
 
37
 
38
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
39
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
40
+ * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13B-hf)
41
 
42
+ ## Prompt template: Llama-2-Chat
43
 
44
  ```
45
+ SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
46
+ USER: {prompt}
47
+ ASSISTANT:
48
  ```
49
 
50
  ## Provided files
 
59
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
60
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
61
  | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
62
+ | gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
63
+ | gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. |
64
+ | gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
65
+ | gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
66
 
67
  ## How to download from branches
68
 
 
112
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
113
 
114
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
115
+ model_basename=model_basename,
116
  use_safetensors=True,
117
  trust_remote_code=True,
118
  device="cuda:0",
 
132
  """
133
 
134
  prompt = "Tell me about AI"
135
+ prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
136
+ USER: {prompt}
137
+ ASSISTANT:
138
  '''
139
 
140
  print("\n\n*** Generate:")
 
199
  # Original model card: Meta's Llama 2 13B
200
 
201
  # **Llama 2**
202
+ Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.
203
 
204
  ## Model Details
205
  *Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept our License before requesting access here.*