Upload README.md
Browse files
README.md
CHANGED
@@ -204,12 +204,12 @@ Windows Command Line users: You can set the environment variable by running `set
|
|
204 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
205 |
|
206 |
```shell
|
207 |
-
./main -ngl 35 -m capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf --color -c
|
208 |
```
|
209 |
|
210 |
Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
|
211 |
|
212 |
-
Change `-c
|
213 |
|
214 |
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|
215 |
|
@@ -258,7 +258,7 @@ from llama_cpp import Llama
|
|
258 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
259 |
llm = Llama(
|
260 |
model_path="./capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf", # Download the model file first
|
261 |
-
n_ctx=
|
262 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
263 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
264 |
)
|
|
|
204 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
205 |
|
206 |
```shell
|
207 |
+
./main -ngl 35 -m capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf --color -c 200000 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "SYSTEM: {system_message}\nUSER: {prompt}\nASSISTANT:"
|
208 |
```
|
209 |
|
210 |
Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
|
211 |
|
212 |
+
Change `-c 200000` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
|
213 |
|
214 |
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|
215 |
|
|
|
258 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
259 |
llm = Llama(
|
260 |
model_path="./capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf", # Download the model file first
|
261 |
+
n_ctx=200000, # The max sequence length to use - note that longer sequence lengths require much more resources
|
262 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
263 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
264 |
)
|