Update README.md

94a35e4 verified 5 months ago

No virus

4.52 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	library_name: transformers
	quantized_by: chienweichang
	---

	# Breeze-7B-32k-Instruct-v1_0-AWQ

	- Model creator: [MediaTek Research](https://huggingface.co/MediaTek-Research)
	- Original model: [Breeze-7B-32k-Instruct-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-32k-Instruct-v1_0)

	## Description

	This repo contains AWQ model files for MediaTek Research's [Breeze-7B-32k-Instruct-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-32k-Instruct-v1_0).

	### About AWQ

	AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

	AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.

	It is supported by:

	- [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ
	- [vLLM](https://github.com/vllm-project/vllm) - version 0.2.2 or later for support for all model types.
	- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
	- [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers
	- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code

	<!-- description end -->
	<!-- repositories-available start -->
	<!-- README_AWQ.md-use-from-vllm start -->
	## Multi-user inference server: vLLM

	Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).

	- Please ensure you are using vLLM version 0.2 or later.
	- When using vLLM as a server, pass the `--quantization awq` parameter.

	For example:

	```shell
	python3 -m vllm.entrypoints.api_server \
	--model chienweichang/Breeze-7B-32k-Instruct-v1_0-AWQ \
	--quantization awq \
	--max-model-len 2048 \
	--dtype auto
	```

	- When using vLLM from Python code, again set `quantization=awq`.

	For example:

	```python
	from vllm import LLM, SamplingParams
	prompts = [
	"告訴我AI是什麼",
	"(291 - 150) 是多少?",
	"台灣最高的山是哪座?",
	]
	prompt_template='''[INST] {prompt} [/INST]
	'''
	prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
	sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=16384)
	llm = LLM(model="chienweichang/Breeze-7B-32k-Instruct-v1_0-AWQ", quantization="awq", dtype="half", max_model_len=16384)
	outputs = llm.generate(prompts, sampling_params)
	# Print the outputs.
	for output in outputs:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
	```

	<!-- README_AWQ.md-use-from-python start -->
	## Inference from Python code using Transformers

	### Install the necessary packages

	- Requires: [Transformers](https://huggingface.co/docs/transformers) 4.37.0 or later.
	- Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.8 or later.

	```shell
	pip3 install --upgrade "autoawq>=0.1.8" "transformers>=4.37.0"
	```

	If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:

	```shell
	pip3 uninstall -y autoawq
	git clone https://github.com/casper-hansen/AutoAWQ
	cd AutoAWQ
	pip3 install .
	```

	### Transformers example code (requires Transformers 4.37.0 and later)

	```python
	from transformers import AutoTokenizer, pipeline, TextStreamer, AutoModelForCausalLM

	checkpoint = "chienweichang/Breeze-7B-32k-Instruct-v1_0-AWQ"
	model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	device_map="auto",
	use_safetensors=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)

	streamer = TextStreamer(tokenizer, skip_prompt=True)

	# 創建一個用於文本生成的pipeline。
	text_generation_pipeline = pipeline(
	"text-generation",
	model=model,
	tokenizer=tokenizer,
	use_cache=True,
	device_map="auto",
	max_length=32768,
	do_sample=True,
	top_k=5,
	num_return_sequences=1,
	streamer=streamer,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.eos_token_id,
	)
	# Inference is also possible via Transformers' pipeline
	print("pipeline output: ", text_generation_pipeline.predict("請問台灣最高的山是?"))
	```