Edit model card

Breeze-7B-Instruct-64k-v0_1-AWQ

Description

This repo contains AWQ model files for MediaTek Research's Breeze-7B-Instruct-64k-v0.1.

About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.

It is supported by:

Repositories available

Multi-user inference server: vLLM

Documentation on installing and using vLLM can be found here.

  • Please ensure you are using vLLM version 0.2 or later.
  • When using vLLM as a server, pass the --quantization awq parameter.

For example:

python3 -m vllm.entrypoints.api_server \
    --model chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ \
    --quantization awq \
    --max-model-len 2048 \
    --dtype auto
  • When using vLLM from Python code, again set quantization=awq.

For example:

from vllm import LLM, SamplingParams
prompts = [
    "告訴我AI是什麼",
    "(291 - 150) 是多少?",
    "台灣最高的山是哪座?",
]
prompt_template='''[INST] {prompt} [/INST]
'''
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ", quantization="awq", dtype="half", max_model_len=2048)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Inference from Python code using Transformers

Install the necessary packages

pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0"

If you have problems installing AutoAWQ using the pre-built wheels, install it from source instead:

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformers example code (requires Transformers 4.37.0 and later)

from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM

checkpoint = "chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ"
model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    device_map="auto",
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)

streamer = TextStreamer(tokenizer, skip_prompt=True)

# 創建一個用於文本生成的pipeline。
text_generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    use_cache=True,
    device_map="auto",
    max_length=32768,
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    streamer=streamer,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)
# Inference is also possible via Transformers' pipeline
print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?"))
Downloads last month
15
Safetensors
Model size
1.44B params
Tensor type
I32
·
FP16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.