tctsung
/

TinyLlama-1.1B-chat-v1.0-awq

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

tctsung commited on May 19

Commit

0e43d15

•

1 Parent(s): 35c24a3

Update README.md

Files changed (1) hide show

README.md +1 -3

README.md CHANGED Viewed

@@ -7,15 +7,13 @@ This model is quantized by autoawq package using `tctsung/chat_restaurant_recomm
 Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
-For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git)
 ## Key results:
 1. AWQ quantization resulted in a **1.62x improvement** in inference speed, generating **140.47 new tokens per second**.
 2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only **17.57%** of the original model.
 3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%
-<Gallery />
 ## Inference tutorial

 Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
 ## Key results:
 1. AWQ quantization resulted in a **1.62x improvement** in inference speed, generating **140.47 new tokens per second**.
 2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only **17.57%** of the original model.
 3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%
+For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git)
 ## Inference tutorial