|
--- |
|
datasets: |
|
- tctsung/chat_restaurant_recommendation |
|
pipeline_tag: text-generation |
|
--- |
|
This model is quantized by autoawq package using `tctsung/chat_restaurant_recommendation` as calibration dataset |
|
|
|
Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) |
|
|
|
## Key results: |
|
|
|
1. AWQ quantization resulted in a **1.62x improvement** in inference speed, generating **140.47 new tokens per second**. |
|
2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only **17.57%** of the original model. |
|
3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1% |
|
|
|
For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git) |
|
|
|
## Inference tutorial |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
from transformers import AutoTokenizer |
|
|
|
# load model & tokenizer: |
|
model_id = "tctsung/TinyLlama-1.1B-chat-v1.0-awq" |
|
model = LLM(model = model_id, dtype='half', |
|
quantization='awq', gpu_memory_utilization=0.9) |
|
sampling_params = SamplingParams(temperature=1.0, |
|
max_tokens=1024, |
|
min_p=0.5, |
|
top_p=0.85) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# define your own sys & user msg: |
|
sys_msg = "..." |
|
user_msg = "..." |
|
chat_msg = [ |
|
{"role": "system", "content": sys_msg}, |
|
{"role": "user", "content": user_msg} |
|
] |
|
input_text = tokenizer.apply_chat_template(chat_msg, tokenize=False, add_generation_prompt=False) |
|
output = model.generate(input_text, sampling_params) |
|
output_text = output[0].outputs[0].text |
|
print(output_text) # show the model output |
|
``` |
|
|