Breeze-7B-Instruct-v1_0-AWQ
- Model creator: MediaTek Research
- Original model: Breeze-7B-Instruct-v1_0
Description
This repo contains AWQ model files for MediaTek Research's Breeze-7B-Instruct-v1_0.
About AWQ
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.
It is supported by:
- Text Generation Webui - using Loader: AutoAWQ
- vLLM - version 0.2.2 or later for support for all model types.
- Hugging Face Text Generation Inference (TGI)
- Transformers version 4.35.0 and later, from any code or client that supports Transformers
- AutoAWQ - for use from Python code
Multi-user inference server: vLLM
Documentation on installing and using vLLM can be found here.
- Please ensure you are using vLLM version 0.2 or later.
- When using vLLM as a server, pass the
--quantization awq
parameter.
For example:
python3 -m vllm.entrypoints.api_server \
--model chienweichang/Breeze-7B-Instruct-v1_0-AWQ \
--quantization awq \
--max-model-len 2048 \
--dtype auto
- When using vLLM from Python code, again set
quantization=awq
.
For example:
from vllm import LLM, SamplingParams
prompts = [
"告訴我AI是什麼",
"(291 - 150) 是多少?",
"台灣最高的山是哪座?",
]
prompt_template='''[INST] {prompt} [/INST]
'''
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="chienweichang/Breeze-7B-Instruct-v1_0-AWQ", quantization="awq", dtype="half", max_model_len=2048)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Inference from Python code using Transformers
Install the necessary packages
- Requires: Transformers 4.37.0 or later.
- Requires: AutoAWQ 0.1.8 or later.
pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0"
If you have problems installing AutoAWQ using the pre-built wheels, install it from source instead:
pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .
Transformers example code (requires Transformers 4.37.0 and later)
from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM
checkpoint = "chienweichang/Breeze-7B-Instruct-v1_0-AWQ"
model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
checkpoint,
device_map="auto",
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
# 創建一個用於文本生成的pipeline。
text_generation_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
use_cache=True,
device_map="auto",
max_length=32768,
do_sample=True,
top_k=5,
num_return_sequences=1,
streamer=streamer,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
# Inference is also possible via Transformers' pipeline
print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?"))
- Downloads last month
- 589
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.