danielhanchen commited on
Commit
8330ded
·
verified ·
1 Parent(s): 944bb5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -31,10 +31,10 @@ tags:
31
  <h1 style="margin-top: 0rem;">Instructions to run this model in llama.cpp:</h2>
32
  </div>
33
 
34
- Prompt format: `"<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"`
35
-
36
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
37
- 1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter
 
 
38
  2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
39
  ```bash
40
  apt-get update
@@ -56,13 +56,13 @@ from huggingface_hub import snapshot_download
56
  snapshot_download(
57
  repo_id = "unsloth/r1-1776-GGUF",
58
  local_dir = "r1-1776-GGUF",
59
- allow_patterns = ["*Q4_K_M*"], # Select quant type Q4_K_M for 4.5bit
60
  )
61
  ```
62
  5. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
63
  ```bash
64
  ./llama.cpp/llama-cli \
65
- --model r1-1776-GGUF/Q4_K_M/r1-1776-Q4_K_M-00001-of-00009.gguf \
66
  --cache-type-k q4_0 \
67
  --threads 12 -no-cnv --prio 2 \
68
  --temp 0.6 \
@@ -83,7 +83,7 @@ snapshot_download(
83
  6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
84
  ```bash
85
  ./llama.cpp/llama-cli \
86
- --model r1-1776-GGUF/Q4_K_M/r1-1776-Q4_K_M-00001-of-00009.gguf \
87
  --cache-type-k q4_0 \
88
  --threads 12 -no-cnv --prio 2 \
89
  --n-gpu-layers 7 \
@@ -95,7 +95,7 @@ snapshot_download(
95
  7. If you want to merge the weights together, use this script:
96
  ```
97
  ./llama.cpp/llama-gguf-split --merge \
98
- r1-1776-GGUF/Q4_K_M/r1-1776-Q4_K_M-00001-of-00009.gguf \
99
  merged_file.gguf
100
  ```
101
 
 
31
  <h1 style="margin-top: 0rem;">Instructions to run this model in llama.cpp:</h2>
32
  </div>
33
 
 
 
34
  Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
35
+ 1. Do not forget about `<|User|>` and `<|Assistant|>` tokens! - Or use a chat template formatter. Also
36
+ do not forget about `<think>\n`!
37
+ Prompt format: `"<|User|>Create a Flappy Bird game in Python.<|Assistant|><think>\n"`
38
  2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
39
  ```bash
40
  apt-get update
 
56
  snapshot_download(
57
  repo_id = "unsloth/r1-1776-GGUF",
58
  local_dir = "r1-1776-GGUF",
59
+ allow_patterns = ["*Q2_K_XL*"], # Select quant type Q2_K_XL for dynamic 2bit
60
  )
61
  ```
62
  5. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
63
  ```bash
64
  ./llama.cpp/llama-cli \
65
+ --model r1-1776-GGUF/Q2_K_XL/r1-1776-Q2_K_XL-00001-of-00005.gguf \
66
  --cache-type-k q4_0 \
67
  --threads 12 -no-cnv --prio 2 \
68
  --temp 0.6 \
 
83
  6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
84
  ```bash
85
  ./llama.cpp/llama-cli \
86
+ --model r1-1776-GGUF/Q2_K_XL/r1-1776-Q2_K_XL-00001-of-00005.gguf \
87
  --cache-type-k q4_0 \
88
  --threads 12 -no-cnv --prio 2 \
89
  --n-gpu-layers 7 \
 
95
  7. If you want to merge the weights together, use this script:
96
  ```
97
  ./llama.cpp/llama-gguf-split --merge \
98
+ r1-1776-GGUF/Q2_K_XL/r1-1776-Q2_K_XL-00001-of-00005.gguf \
99
  merged_file.gguf
100
  ```
101