tctsung
/

TinyLlama-1.1B-chat-v1.0-awq

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

TinyLlama-1.1B-chat-v1.0-awq / README.md

tctsung's picture

Update README.md

0e43d15 verified 5 months ago

|

No virus

1.82 kB

	---
	datasets:
	- tctsung/chat_restaurant_recommendation
	pipeline_tag: text-generation
	---
	This model is quantized by autoawq package using `tctsung/chat_restaurant_recommendation` as calibration dataset

	Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

	## Key results:

	1. AWQ quantization resulted in a 1.62x improvement in inference speed, generating 140.47 new tokens per second.
	2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only 17.57% of the original model.
	3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%

	For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git)

	## Inference tutorial

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	# load model & tokenizer:
	model_id = "tctsung/TinyLlama-1.1B-chat-v1.0-awq"
	model = LLM(model = model_id, dtype='half',
	quantization='awq', gpu_memory_utilization=0.9)
	sampling_params = SamplingParams(temperature=1.0,
	max_tokens=1024,
	min_p=0.5,
	top_p=0.85)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# define your own sys & user msg:
	sys_msg = "..."
	user_msg = "..."
	chat_msg = [
	{"role": "system", "content": sys_msg},
	{"role": "user", "content": user_msg}
	]
	input_text = tokenizer.apply_chat_template(chat_msg, tokenize=False, add_generation_prompt=False)
	output = model.generate(input_text, sampling_params)
	output_text = output[0].outputs[0].text
	print(output_text) # show the model output
	```