Model Card for TinyLlama-1.1B-Chat-v1.0 (Quantized)

This is a quantized version of TinyLlama-1.1B-Chat-v1.0.

Performance Evaluation

The quantized model was tested on the hellaswag dataset with the following results:

Metric	Base Model	Quantized Model	Change
hellaswag accuracy	0.456	0.462	unchanged
hellaswag normalized accuracy	0.64	0.64	unchanged
eval time (GPU) - seconds	219.67	209.34	4.70% decrease

The quantized version of TinyLlama-1.1B-Chat-v1.0 maintains similar accuracy while achieving a 4.7% reduction in evaluation time. This evaluation was conducted using GPU resources on a subset of 100 hellaswag samples for expediency. For production purposes, it is recommended to perform a full evaluation.

Quantization Approach
The model was quantized to 4-bits using the Q4_K_M method with llama.cpp, specifically designed for optimized GPU performance. The following steps were used:

Convert the original model to GGUF format:

python ./llama.cpp/convert_hf_to_gguf.py ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/

Quantize the GGUF model to 4-bit Q4_K_M:

./llama.cpp/build/bin/llama-quantize ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/ggml-model-Q4_K_M.gguf q4_k_m

pavel-tolstyko
/

ggml-model-Q4_K_M

Model Card for TinyLlama-1.1B-Chat-v1.0 (Quantized)

Performance Evaluation

Model tree for pavel-tolstyko/ggml-model-Q4_K_M