Suparious commited on
Commit
6a276d6
1 Parent(s): 459a6cb

Provide usage example

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md CHANGED
@@ -1,4 +1,19 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  base_model: senseable/WestLake-7B-v2
3
  license: apache-2.0
4
  language:
@@ -37,6 +52,52 @@ This repo contains AWQ model files for [Common Sense's WestLake 7B v2](https://h
37
 
38
  These files were quantised using hardware kindly provided by [SolidRusT Networks](https://solidrust.net/).
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ### About AWQ
41
 
42
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
 
1
  ---
2
+ tags:
3
+ - finetuned
4
+ - quantized
5
+ - 4-bit
6
+ - AWQ
7
+ - transformers
8
+ - pytorch
9
+ - mistral
10
+ - text-generation
11
+ - conversational
12
+ - license:apache-2.0
13
+ - autotrain_compatible
14
+ - endpoints_compatible
15
+ - text-generation-inference
16
+ - region:us
17
  base_model: senseable/WestLake-7B-v2
18
  license: apache-2.0
19
  language:
 
52
 
53
  These files were quantised using hardware kindly provided by [SolidRusT Networks](https://solidrust.net/).
54
 
55
+ ## How to use
56
+
57
+ ### Install the necessary packages
58
+
59
+ ```bash
60
+ pip install --upgrade autoawq autoawq-kernels
61
+ ```
62
+
63
+ ### Example Python code
64
+
65
+ ```bash
66
+ from awq import AutoAWQForCausalLM
67
+ from transformers import AutoTokenizer, TextStreamer
68
+
69
+ quant_path = "/srv/home/shaun/repos/samantha-1.1-westlake-7b-laser-AWQ"
70
+
71
+ # Load model
72
+ model = AutoAWQForCausalLM.from_quantized(quant_path,
73
+ fuse_layers=True)
74
+ tokenizer = AutoTokenizer.from_pretrained(quant_path,
75
+ trust_remote_code=True)
76
+ streamer = TextStreamer(tokenizer,
77
+ skip_prompt=True,
78
+ skip_special_tokens=True)
79
+
80
+ # Convert prompt to tokens
81
+ prompt_template = """\
82
+ <|system|>
83
+ </s>
84
+ <|user|>
85
+ {prompt}</s>
86
+ <|assistant|>"""
87
+
88
+ prompt = "You're standing on the surface of the Earth. "\
89
+ "You walk one mile south, one mile west and one mile north. "\
90
+ "You end up exactly where you started. Where are you?"
91
+
92
+ tokens = tokenizer(prompt_template.format(prompt=prompt),
93
+ return_tensors='pt').input_ids.cuda()
94
+
95
+ # Generate output
96
+ generation_output = model.generate(tokens,
97
+ streamer=streamer,
98
+ max_new_tokens=512)
99
+ ```
100
+
101
  ### About AWQ
102
 
103
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.