File size: 1,610 Bytes
d3f780f
 
b8d8a11
 
 
 
 
cc44d48
 
d31444e
cc44d48
d31444e
cc44d48
b10873c
cc44d48
 
 
8011cbf
cc44d48
d31444e
cc44d48
30ffb0c
 
cc44d48
30ffb0c
cc44d48
b10873c
cc44d48
30ffb0c
 
 
 
 
 
 
 
8011cbf
 
 
 
 
 
30ffb0c
8011cbf
 
 
30ffb0c
 
8011cbf
30ffb0c
8011cbf
 
30ffb0c
8011cbf
 
 
 
 
 
30ffb0c
 
cc44d48
30ffb0c
cc44d48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: cc
datasets:
- VMware/open-instruct-v1-oasst-dolly-hhrlhf
language:
- en
pipeline_tag: text-generation
---

# SearchUnify-ML/xgen-7b-8k-open-instruct-gptq

These are GPTQ 4bit model files for [VMWare's XGEN 7B 8K Open Instruct](https://huggingface.co/VMware/xgen-7b-8k-open-instruct).

It is the result of quantizing to 4bit using GPTQ-for-LLaMa.



# How to use this GPTQ model from Python code

First, make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:

```
pip install auto-gptq

```

Second, install tiktoken in order to use the tokenizer

```
pip install tiktoken
```

```

from transformers import AutoTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=False,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton)

# Note: check the prompt template is correct for this model.
prompt = "Explain the rules of field hockey to a novice."
prompt_template=f'''### Instruction: {prompt}
### Response:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")

```