unsloth
/

r1-1776-GGUF

@@ -31,10 +31,10 @@ tags:
 <h1 style="margin-top: 0rem;">Instructions to run this model in llama.cpp:</h2>
 </div>
-Prompt format: `"<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜><think>\n"`
 Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
-1. Do not forget about `<｜User｜>` and `<｜Assistant｜>` tokens! - Or use a chat template formatter
 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
 ```bash
 apt-get update
@@ -56,13 +56,13 @@ from huggingface_hub import snapshot_download
 snapshot_download(
   repo_id = "unsloth/r1-1776-GGUF",
   local_dir = "r1-1776-GGUF",
-  allow_patterns = ["*Q4_K_M*"], # Select quant type Q4_K_M for 4.5bit
 )
 ```
 5. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
 ```bash
    ./llama.cpp/llama-cli \
-	  --model r1-1776-GGUF/Q4_K_M/r1-1776-Q4_K_M-00001-of-00009.gguf \
 	  --cache-type-k q4_0 \
 	  --threads 12 -no-cnv --prio 2 \
 	  --temp 0.6 \
@@ -83,7 +83,7 @@ snapshot_download(
 6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
 ```bash
   ./llama.cpp/llama-cli \
-    --model r1-1776-GGUF/Q4_K_M/r1-1776-Q4_K_M-00001-of-00009.gguf \
     --cache-type-k q4_0 \
     --threads 12 -no-cnv --prio 2 \
     --n-gpu-layers 7 \
@@ -95,7 +95,7 @@ snapshot_download(
 7. If you want to merge the weights together, use this script:
 ```
 ./llama.cpp/llama-gguf-split --merge \
-    r1-1776-GGUF/Q4_K_M/r1-1776-Q4_K_M-00001-of-00009.gguf \
     merged_file.gguf
 ```

 <h1 style="margin-top: 0rem;">Instructions to run this model in llama.cpp:</h2>
 </div>
 Or you can view more detailed instructions here: [unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic)
+1. Do not forget about `<｜User｜>` and `<｜Assistant｜>` tokens! - Or use a chat template formatter. Also
+do not forget about `<think>\n`!
+Prompt format: `"<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜><think>\n"`
 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
 ```bash
 apt-get update
 snapshot_download(
   repo_id = "unsloth/r1-1776-GGUF",
   local_dir = "r1-1776-GGUF",
+  allow_patterns = ["*Q2_K_XL*"], # Select quant type Q2_K_XL for dynamic 2bit
 )
 ```
 5. Example with Q4_0 K quantized cache **Notice -no-cnv disables auto conversation mode**
 ```bash
    ./llama.cpp/llama-cli \
+	  --model r1-1776-GGUF/Q2_K_XL/r1-1776-Q2_K_XL-00001-of-00005.gguf \
 	  --cache-type-k q4_0 \
 	  --threads 12 -no-cnv --prio 2 \
 	  --temp 0.6 \
 6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
 ```bash
   ./llama.cpp/llama-cli \
+    --model r1-1776-GGUF/Q2_K_XL/r1-1776-Q2_K_XL-00001-of-00005.gguf \
     --cache-type-k q4_0 \
     --threads 12 -no-cnv --prio 2 \
     --n-gpu-layers 7 \
 7. If you want to merge the weights together, use this script:
 ```
 ./llama.cpp/llama-gguf-split --merge \
+    r1-1776-GGUF/Q2_K_XL/r1-1776-Q2_K_XL-00001-of-00005.gguf \
     merged_file.gguf
 ```