license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- deepseek
- gguf
- bf16
metrics:
- accuracy
language:
- en
- zh
DeepSeek-V2-Chat-GGUF
Quantizised from https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
Using llama.cpp b3026 for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.
If you are using an older quant, please set the metadata KV overrides below.
Usage:
Downloading the bf16:
- Find the relevant directory
- Download all files
- Run merge.py
- Merged GGUF should appear
Downloading the quantizations:
- Find the relevant directory
- Download all files
- Point to the first split (most programs should load all the splits automatically now)
Running in llama.cpp:
To start in command line chat mode (chat completion):
main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)
To use llama.cpp's OpenAI compatible server:
server \
-m DeepSeek-V2-Chat.{quant}.gguf \
-c {context_length} \
(--color [recommended: colored output in supported terminals]) \
(-i [note: interactive mode]) \
(--mlock [note: avoid using swap]) \
(--verbose) \
(--log-disable [note: disable logging to file, may be useful for prod]) \
(--metrics [note: prometheus compatible monitoring endpoint]) \
(--api-key [string]) \
(--port [int]) \
(--flash-attn [note: must be fully offloaded to supported GPU])
Making an importance matrix:
imatrix \
-m DeepSeek-V2-Chat.{quant}.gguf \
-f groups_merged.txt \
--verbosity [0, 1, 2] \
-ngl {GPU offloading; must build with CUDA} \
--ofreq {recommended: 1}
Making a quant:
quantize \
DeepSeek-V2-Chat.bf16.gguf \
DeepSeek-V2-Chat.{quant}.gguf \
{quant} \
(--imatrix [file])
Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected negatively.
Quants:
Quant | Status | Size | Description | KV Metadata | Weighted | Notes |
---|---|---|---|---|---|---|
BF16 | Available | 439 GB | Lossless :) | Old | No | Q8_0 is sufficient for most cases |
Q8_0 | Uploading | 233.27 GB | High quality recommended | Updated | Yes | |
Q4_K_M | Available | 132 GB | Medium quality recommended | Old | No | |
Q3_K_M | Uploading | 92.6 GB | Medium-low quality | Updated | Yes | |
IQ3_XS | Available | 89.6 GB | Better than Q3_K_M | Old | Yes | |
Q2_K | Available | 80.0 GB | Low quality not recommended | Old | No | |
IQ2_XXS | Available | 61.5 GB | Lower quality not recommended | Old | Yes | |
IQ1_M | Uploading | 27.3 GB | Extremely low quality not recommended | Old | Yes | Testing purposes; use IQ2 at least |
Planned Quants (weighted/iMatrix):
Planned Quant | Notes |
---|---|
Q5_K_M | |
Q5_K_M | |
Q6_K | |
IQ4_XS | |
IQ2_XS | |
IQ2_S | |
IQ2_M |
Metadata KV overrides (pass them using --override-kv
, can be specified multiple times):
deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
The Q8_0
quant contains these parameters, along with future ones, so as long as you're running a supported build of llama.cpp no --override-kv
parameters are required.
A precompiled AVX2 version is avaliable at llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip
in the root of this repo.
License:
- DeepSeek license for model weights, which can be found in the
LICENSE
file in the root of this repo - MIT license for any repo code
Performance:
~1.5t/s with Ryzen 3 3700x (96gb 3200mhz) [Q2_K]
iMatrix:
Find imatrix.dat
in the root of this repo, made with a Q2_K
quant (see here for info: https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693)
Using groups_merged.txt
, find it here: https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384
Censorship:
This model is a bit censored, finetuning on toxic DPO might help.