File size: 3,805 Bytes
c47b800
306744b
c47b800
 
306744b
 
 
 
 
 
 
 
 
c47b800
 
 
 
 
 
 
 
42e3696
 
f83b696
c47b800
 
 
0ceb327
 
c47b800
 
 
 
 
0ceb327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c47b800
f83b696
306744b
440314c
306744b
440314c
3ef8d77
 
 
f83b696
 
3ef8d77
 
f83b696
 
 
 
 
 
 
 
 
 
 
3ef8d77
f83b696
29a06ae
306744b
29a06ae
f83b696
29a06ae
 
 
 
 
130ddce
29a06ae
130ddce
f83b696
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- deepseek
- gguf
- bf16
- chinese
- english
metrics:
- accuracy
---

# Deepseek-V2-Chat-GGUF

Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)

Using llama.cpp fork: [https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2](https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2)

TODO: Make llamafile for Q2_K and Q4_K_M

# Warning: This will not work unless you compile llama.cpp from the repo provided (and set metadata KV overrides)!

# How to use:

**Downloading the bf16:**

- Find the relevant directory
- Download all files
- Run merge.py
- Merged GGUF should appear

**Downloading the quantizations:**
- Find the relevant directory
- Download all files
- Point to the first split (most programs should load all the splits automatically now)

**Running in llama.cpp:**

To start in command line interactive mode (text completion):
```
main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -i
```
To use llama.cpp OpenAI compatible server:
```
server \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -c {context_length} \
  (--color [recommended: colored output in supported terminals]) \
  (-i [note: interactive mode]) \
  (--mlock [note: avoid using swap]) \
  (--verbose) \
  (--log-disable [note: disable logging to file, may be useful for prod]) \
  (--metrics [note: prometheus compatible monitoring endpoint]) \
  (--api-key [string]) \
  (--port [int]) \
  (--flash-attn [note: must be fully offloaded to supported GPU])
```
Making an importance matrix:
```
imatrix \
  -m DeepSeek-V2-Chat.{quant}.gguf \
  -f groups_merged.txt \
  --verbosity [0, 1, 2] \
  -ngl {GPU offloading; must build with CUDA} \
 --ofreq {recommended: 1}
```
Making a quant:
```
quantize \
  DeepSeek-V2-Chat.bf16.gguf \
  DeepSeek-V2-Chat.{quant}.gguf \
  {quant} \
  (--imatrix [file])
```

# Quants:
```
- bf16 [size: 439gb]
- q8_0 (later, please use q4_k_m for now) [estimated size: 233.27gb]
- q4_k_m [size: 132gb]
- q2_k [size: 80gb]
- iq2_xxs [size: 61.5gb]
- iq3_xs (uploading) [size: 89.6gb]
- iq1_m [size: 27.3gb]
```

Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected a lot.

# Planned Quants (using importance matrix):
```
- q5_k_m
- q5_k_s
- q3_k_m
- q6_k
- iq4_nl
- iq4_xs
- iq2_xs
- iq2_s
- iq2_m
- iq1_s (note: for fun only, this quant is likely useless)
```

Note: the model files do not have some DeepSeek v2 specific parameters, will look into adding them

Please use commit `039896407afd40e54321d47c5063c46a52da3e01`, otherwise use these metadata KV overrides:
```
deepseek2.attention.q_lora_rank=int:1536
deepseek2.attention.kv_lora_rank=int:512
deepseek2.expert_shared_count=int:2
deepseek2.expert_feed_forward_length=int:1536
deepseek2.experts_weight_scale=int:16
deepseek2.leading_dense_block_count=int:1
rope.scaling.yarn_log_multiplier=float:0.0707
```

A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.

# License:
- DeepSeek license for model weights
- MIT license for any repo code

# Performance:
~1.5t/s with Ryzen 3 3700x (96gb 3200mhz) [Q2_K]

# iMatrix:
Find imatrix.dat in the root of this repo, made with a Q2_K quant (see here for info: [https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693](https://github.com/ggerganov/llama.cpp/issues/5153#issuecomment-1913185693))

Using groups_merged.txt, find it here: [https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384)

# Censorship:

This model is quite censored, finetuning on toxic DPO might help.