chienweichang commited on
Commit
e00e655
1 Parent(s): 25ea831

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -0
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ library_name: transformers
6
+ quantized_by: chienweichang
7
+ ---
8
+
9
+ # Breeze-7B-Instruct-64k-v0_1-AWQ
10
+
11
+ - Model creator: [MediaTek Research](https://huggingface.co/MediaTek-Research)
12
+ - Original model: [Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1)
13
+
14
+ ## Description
15
+
16
+ This repo contains AWQ model files for MediaTek Research's [Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1).
17
+
18
+ ### About AWQ
19
+
20
+ AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
21
+
22
+ AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.
23
+
24
+ It is supported by:
25
+
26
+ - [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ
27
+ - [vLLM](https://github.com/vllm-project/vllm) - version 0.2.2 or later for support for all model types.
28
+ - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
29
+ - [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers
30
+ - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code
31
+
32
+ <!-- description end -->
33
+ <!-- repositories-available start -->
34
+ ## Repositories available
35
+
36
+ * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/audreyt/Breeze-7B-Instruct-64k-v0.1-GGUF)
37
+ <!-- repositories-available end -->
38
+
39
+ <!-- README_AWQ.md-use-from-vllm start -->
40
+ ## Multi-user inference server: vLLM
41
+
42
+ Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
43
+
44
+ - Please ensure you are using vLLM version 0.2 or later.
45
+ - When using vLLM as a server, pass the `--quantization awq` parameter.
46
+
47
+ For example:
48
+
49
+ ```shell
50
+ python3 -m vllm.entrypoints.api_server --model chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ --quantization awq --dtype auto
51
+ ```
52
+
53
+ - When using vLLM from Python code, again set `quantization=awq`.
54
+
55
+ For example:
56
+
57
+ ```python
58
+ from vllm import LLM, SamplingParams
59
+ prompts = [
60
+ "告訴我AI是什麼",
61
+ "(291 - 150) 是多少?",
62
+ "台灣最高的山是哪座?",
63
+ ]
64
+ prompt_template='''[INST] {prompt} [/INST]
65
+ '''
66
+ prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
67
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
68
+ llm = LLM(model="chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ", quantization="awq", dtype="half", max_model_len=8196)
69
+ outputs = llm.generate(prompts, sampling_params)
70
+ # Print the outputs.
71
+ for output in outputs:
72
+ prompt = output.prompt
73
+ generated_text = output.outputs[0].text
74
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
75
+ ```
76
+
77
+ <!-- README_AWQ.md-use-from-python start -->
78
+ ## Inference from Python code using Transformers
79
+
80
+ ### Install the necessary packages
81
+
82
+ - Requires: [Transformers](https://huggingface.co/docs/transformers) 4.37.0 or later.
83
+ - Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.8 or later.
84
+
85
+ ```shell
86
+ pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0"
87
+ ```
88
+
89
+ If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
90
+
91
+ ```shell
92
+ pip3 uninstall -y autoawq
93
+ git clone https://github.com/casper-hansen/AutoAWQ
94
+ cd AutoAWQ
95
+ pip3 install .
96
+ ```
97
+
98
+ ### Transformers example code (requires Transformers 4.37.0 and later)
99
+
100
+ ```python
101
+ from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM
102
+
103
+ checkpoint = "chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ"
104
+ model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
105
+ checkpoint,
106
+ device_map="auto",
107
+ use_safetensors=True,
108
+ )
109
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
110
+
111
+ streamer = TextStreamer(tokenizer, skip_prompt=True)
112
+
113
+ # 創建一個用於文本生成的pipeline。
114
+ text_generation_pipeline = pipeline(
115
+ "text-generation",
116
+ model=model,
117
+ tokenizer=tokenizer,
118
+ use_cache=True,
119
+ device_map="auto",
120
+ max_length=32768,
121
+ do_sample=True,
122
+ top_k=5,
123
+ num_return_sequences=1,
124
+ streamer=streamer,
125
+ eos_token_id=tokenizer.eos_token_id,
126
+ pad_token_id=tokenizer.eos_token_id,
127
+ )
128
+ # Inference is also possible via Transformers' pipeline
129
+ print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?"))
130
+ ```