--- datasets: - tctsung/chat_restaurant_recommendation pipeline_tag: text-generation --- This model is quantized by autoawq package using `tctsung/chat_restaurant_recommendation` as calibration dataset Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) ## Key results: 1. AWQ quantization resulted in a **1.62x improvement** in inference speed, generating **140.47 new tokens per second**. 2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only **17.57%** of the original model. 3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1% For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git) ## Inference tutorial ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer # load model & tokenizer: model_id = "tctsung/TinyLlama-1.1B-chat-v1.0-awq" model = LLM(model = model_id, dtype='half', quantization='awq', gpu_memory_utilization=0.9) sampling_params = SamplingParams(temperature=1.0, max_tokens=1024, min_p=0.5, top_p=0.85) tokenizer = AutoTokenizer.from_pretrained(model_id) # define your own sys & user msg: sys_msg = "..." user_msg = "..." chat_msg = [ {"role": "system", "content": sys_msg}, {"role": "user", "content": user_msg} ] input_text = tokenizer.apply_chat_template(chat_msg, tokenize=False, add_generation_prompt=False) output = model.generate(input_text, sampling_params) output_text = output[0].outputs[0].text print(output_text) # show the model output ```