h2o-danube3-500m-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-500m-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-500m-chat. It shows the trade-off between size, speed and quality of the models.

Name	Quant method	Model size	MT-Bench AVG	Perplexity	Tokens per second
h2o-danube3-500m-chat-F16.gguf	F16	1.03 GB	3.34	9.46	1870
h2o-danube3-500m-chat-Q8_0.gguf	Q8_0	0.55 GB	3.76	9.46	2144
h2o-danube3-500m-chat-Q6_K.gguf	Q6_K	0.42 GB	3.77	9.46	2418
h2o-danube3-500m-chat-Q5_K_M.gguf	Q5_K_M	0.37 GB	3.20	9.55	2430
h2o-danube3-500m-chat-Q4_K_M.gguf	Q4_K_M	0.32 GB	3.16	9.96	2427

Columns in the table are:

Name -- model name and link
Quant method -- quantization method
Model size -- size of the model in gigabytes
MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

<|prompt|>Why is drinking water so healthy?</s><|answer|>