BramVanroy
/

fietje-2-chat-gguf

Inference Endpoints

Model card Files Files and versions Community

BramVanroy commited on May 1

Commit

71efc4c

•

1 Parent(s): fc00a49

Update README.md

Files changed (1) hide show

README.md +13 -12

README.md CHANGED Viewed

@@ -6,18 +6,19 @@ tags:
 - gguf
 ---
-This repository contains quantized versions of [BramVanroy/fietje-2b-chat](https://huggingface.co/BramVanroy/fietje-2b-chat):
-- `-f16` (5.6GB): best quality, but largest and slowest (recommended if you have the capacity, otherwise q8_0)
-- `-q8_0` (3.0GB): minimal quality loss, smaller
-- `-q5_k_m` (2.0GB): users have reported considerable quality loss in the chat `q5_k_m` version so you may want to avoid it
-Also available on ollama:
-```sh
-# defaults to f16
-ollama run bramvanroy/fietje-2b-chat
-ollama run bramvanroy/fietje-2b-chat:f16
-ollama run bramvanroy/fietje-2b-chat:q8_0
-ollama run bramvanroy/fietje-2b-chat:q5_k_m
-```

 - gguf
 ---
+This repository contains quantized versions of [BramVanroy/fietje-2b-chat](https://huggingface.co/BramVanroy/fietje-2b-chat).
+Available quantization types and expected performance differences compared to base `f16`, higher perplexity=worse (from llama.cpp):
+```
+Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
+Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
+Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
+Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
+Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
+F16     : 13.00G              @ 7B
+```
+Also available on [ollama](https://ollama.com/bramvanroy/fietje-2b-chat).
+Quants were made with release [`b2777`](https://github.com/ggerganov/llama.cpp/releases/tag/b2777) of llama.cpp.