--- license: apache-2.0 language: - zh library_name: transformers quantized_by: chienweichang --- # Breeze-7B-Instruct-64k-v0_1-AWQ - Model creator: [MediaTek Research](https://huggingface.co/MediaTek-Research) - Original model: [Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) ## Description This repo contains AWQ model files for MediaTek Research's [Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1). ### About AWQ AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings. AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead. It is supported by: - [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ - [vLLM](https://github.com/vllm-project/vllm) - version 0.2.2 or later for support for all model types. - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) - [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code ## Repositories available * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/audreyt/Breeze-7B-Instruct-64k-v0.1-GGUF) ## Multi-user inference server: vLLM Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/). - Please ensure you are using vLLM version 0.2 or later. - When using vLLM as a server, pass the `--quantization awq` parameter. For example: ```shell python3 -m vllm.entrypoints.api_server \ --model chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ \ --quantization awq \ --max-model-len 2048 \ --dtype auto ``` - When using vLLM from Python code, again set `quantization=awq`. For example: ```python from vllm import LLM, SamplingParams prompts = [ "告訴我AI是什麼", "(291 - 150) 是多少?", "台灣最高的山是哪座?", ] prompt_template='''[INST] {prompt} [/INST] ''' prompts = [prompt_template.format(prompt=prompt) for prompt in prompts] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ", quantization="awq", dtype="half", max_model_len=2048) outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ## Inference from Python code using Transformers ### Install the necessary packages - Requires: [Transformers](https://huggingface.co/docs/transformers) 4.37.0 or later. - Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.8 or later. ```shell pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0" ``` If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead: ```shell pip3 uninstall -y autoawq git clone https://github.com/casper-hansen/AutoAWQ cd AutoAWQ pip3 install . ``` ### Transformers example code (requires Transformers 4.37.0 and later) ```python from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM checkpoint = "chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ" model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained( checkpoint, device_map="auto", use_safetensors=True, ) tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True) streamer = TextStreamer(tokenizer, skip_prompt=True) # 創建一個用於文本生成的pipeline。 text_generation_pipeline = pipeline( "text-generation", model=model, tokenizer=tokenizer, use_cache=True, device_map="auto", max_length=32768, do_sample=True, top_k=5, num_return_sequences=1, streamer=streamer, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.eos_token_id, ) # Inference is also possible via Transformers' pipeline print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?")) ```