ybabakhin's picture
Update README.md
44fa2c8 verified
metadata
language:
  - en
library_name: transformers
license: apache-2.0
tags:
  - gpt
  - llm
  - large language model
  - h2o-llmstudio
thumbnail: >-
  https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
pipeline_tag: text-generation
quantized_by: h2oai

h2o-danube3-500m-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-500m-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-500m-chat. It shows the trade-off between size, speed and quality of the models.

Name Quant method Model size MT-Bench AVG Perplexity Tokens per second
h2o-danube3-500m-chat-F16.gguf F16 1.03 GB 3.34 9.46 1870
h2o-danube3-500m-chat-Q8_0.gguf Q8_0 0.55 GB 3.76 9.46 2144
h2o-danube3-500m-chat-Q6_K.gguf Q6_K 0.42 GB 3.77 9.46 2418
h2o-danube3-500m-chat-Q5_K_M.gguf Q5_K_M 0.37 GB 3.20 9.55 2430
h2o-danube3-500m-chat-Q4_K_M.gguf Q4_K_M 0.32 GB 3.16 9.96 2427

Columns in the table are:

  • Name -- model name and link
  • Quant method -- quantization method
  • Model size -- size of the model in gigabytes
  • MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
  • Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
  • Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

Prompt template

<|prompt|>Why is drinking water so healthy?</s><|answer|>