File size: 3,960 Bytes
c47b800 306744b c47b800 306744b 9afce9c c47b800 a1f286f c47b800 3734c9a c47b800 3734c9a c47b800 0ceb327 c47b800 0ceb327 3734c9a 0ceb327 3734c9a 0ceb327 3734c9a 0ceb327 a1f286f 0ceb327 c47b800 f83b696 306744b 3734c9a 306744b 440314c 3ef8d77 3734c9a 9afce9c f83b696 3ef8d77 9afce9c f83b696 3ef8d77 f83b696 29a06ae 3734c9a 29a06ae a1f286f 29a06ae 9afce9c f83b696 3734c9a f83b696 3734c9a f83b696 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- deepseek
- gguf
- bf16
metrics:
- accuracy
language:
- en
- zh
---
# DeepSeek-V2-Chat-GGUF
Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)
Using llama.cpp [b3026](https://github.com/ggerganov/llama.cpp/releases/tag/b3026) for quantizisation. Given the rapid release of llama.cpp builds, this will likely change over time.
# Warning: This will not work unless you set metadata KV overrides, nor will it in LM Studio/similar wrapper apps (except supported ones, see below)!
# How to use:
**Downloading the bf16:**
- Find the relevant directory
- Download all files
- Run merge.py
- Merged GGUF should appear
**Downloading the quantizations:**
- Find the relevant directory
- Download all files
- Point to the first split (most programs should load all the splits automatically now)
**Running in llama.cpp:**
To start in command line chat mode (chat completion):
```
main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -c (-i)
```
To use llama.cpp's OpenAI compatible server:
```
server \
-m DeepSeek-V2-Chat.{quant}.gguf \
-c {context_length} \
(--color [recommended: colored output in supported terminals]) \
(-i [note: interactive mode]) \
(--mlock [note: avoid using swap]) \
(--verbose) \
(--log-disable [note: disable logging to file, may be useful for prod]) \
(--metrics [note: prometheus compatible monitoring endpoint]) \
(--api-key [string]) \
(--port [int]) \
(--flash-attn [note: must be fully offloaded to supported GPU])
```
Making an importance matrix:
```
imatrix \
-m DeepSeek-V2-Chat.{quant}.gguf \
-f groups_merged.txt \
--verbosity [0, 1, 2] \
-ngl {GPU offloading; must build with CUDA} \
--ofreq {recommended: 1}
```
Making a quant:
```
quantize \
DeepSeek-V2-Chat.bf16.gguf \
DeepSeek-V2-Chat.{quant}.gguf \
{quant} \
(--imatrix [file])
```
# Quants:
```
- bf16 [size: 439gb]
- q8_0 (uploading) [size: 233.27gb]
- q4_k_m [size: 132gb]
- q2_k [size: 80gb]
- iq2_xxs [size: 61.5gb]
- iq3_xs [size: 89.6gb]
- iq1_m (uploading) [size: 27.3gb]
- q3_k_m (uploading) [size: 92.6gb]
```
Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected a lot.
# Planned Quants (weighted/imatrix):
```
- q5_k_m
- q5_k_s
- q6_k
- iq4_xs
- iq2_xs
- iq2_s
- iq2_m
- iq1_s (note: for fun only, this quant is likely useless)
```
Use these metadata KV overrides (pass them using `--override-kv`, can be specified multiple times):
```
deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.expert_weights_scale=float:16
deepseek2.leading_dense_block_count=int:1
deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
```
The Q8_0 quant contains these parameters, along with future ones, so as long as you're running a supported build of llama.cpp no `--override-kv` parameters are required.
A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.
# License:
- DeepSeek license for model weights, which can be found in the `LICENSE` file in the root of this repo
- MIT license for any repo code
# Performance:
~1.5t/s with Ryzen 3 3700x (96gb 3200mhz) [Q2_K]
# iMatrix:
Find imatrix.dat in the root of this repo, made with a Q2_K quant (see here for info: [https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693](https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693))
Using groups_merged.txt, find it here: [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)
# Censorship:
This model is quite censored, finetuning on toxic DPO might help. |