MetaReplica
commited on
Commit
•
4b7de21
1
Parent(s):
212e779
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,56 @@
|
|
1 |
---
|
2 |
license: llama2
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: llama2
|
3 |
---
|
4 |
+
## 量化部署
|
5 |
+
为了降低用户在本地使用XuanYuan的成本,降低显存需求,我们提供量化好的XuanYuan-13B-Chat模型8bit和4bit模型。
|
6 |
+
|
7 |
+
### 8bit模型:
|
8 |
+
在8bit量化算法上,我们使用目前社区广泛使用的bitsandbytes库。
|
9 |
+
|
10 |
+
```python
|
11 |
+
import torch
|
12 |
+
from transformers import LlamaForCausalLM, LlamaTokenizer
|
13 |
+
|
14 |
+
model_name_or_path = "/your/model/path"
|
15 |
+
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True)
|
16 |
+
model = LlamaForCausalLM.from_pretrained(model_name_or_path,torch_dtype=torch.float16, device_map="auto")
|
17 |
+
inputs = tokenizer("问题:李时珍是哪一个朝代的人?回答:", return_tensors="pt").to("cuda")
|
18 |
+
outputs = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
|
19 |
+
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
|
20 |
+
print(outputs)
|
21 |
+
```
|
22 |
+
|
23 |
+
|
24 |
+
### 4bit模型:
|
25 |
+
|
26 |
+
在4bit量化算法上,我们使用[auto-gptq](https://github.com/PanQiWei/AutoGPTQ)工具。
|
27 |
+
|
28 |
+
```python
|
29 |
+
import torch
|
30 |
+
from transformers import LlamaForCausalLM, LlamaTokenizer
|
31 |
+
from auto_gptq import AutoGPTQForCausalLM
|
32 |
+
|
33 |
+
model_name_or_path = "/your/model/path"
|
34 |
+
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True)
|
35 |
+
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,torch_dtype=torch.float16, device_map="auto")
|
36 |
+
inputs = tokenizer("问题:李时珍是哪一个朝代的人?回答:", return_tensors="pt").to("cuda")
|
37 |
+
outputs = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
|
38 |
+
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
|
39 |
+
print(outputs)
|
40 |
+
```
|
41 |
+
|
42 |
+
### 在vllm下使用4bit模型:
|
43 |
+
普通huggingface的推理脚本运行gptq量化的4bit模型,推理的速度很慢,并不实用。而最新版本的vllm已经支持包含gptq在内的多种量化模型的加载,vllm依靠量化的加速算子以及pagedAttention,continue batching以及一些调度机制,可以实现至少10倍的推理吞吐的提升。
|
44 |
+
您可以安装最新版本的vllm并使用以下脚本使用我们的4bit量化模型:
|
45 |
+
```python
|
46 |
+
from vllm import LLM, SamplingParams
|
47 |
+
|
48 |
+
sampling_params = SamplingParams(temperature=0.7, top_p=0.95,max_tokens=256)
|
49 |
+
llm = LLM(model="/your/model/path", quantization="gptq", dtype="float16")
|
50 |
+
|
51 |
+
prompts = "问题:李时珍是哪一个时代的人?回答:"
|
52 |
+
result = llm.generate(prompts, sampling_params)
|
53 |
+
result_output = [[output.outputs[0].text,output.outputs[0].token_ids] for output in result]
|
54 |
+
|
55 |
+
print('generated_result', result_output[0])
|
56 |
+
```
|