TheBloke
/

CapyTessBorosYi-34B-200K-DARE-Ties-GGUF

Text Generation

Model card Files Files and versions Community

TheBloke commited on Nov 28, 2023

Commit

634ad31

•

1 Parent(s): c5e21e9

Upload README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -204,12 +204,12 @@ Windows Command Line users: You can set the environment variable by running `set
 Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
 ```shell
-./main -ngl 35 -m capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf --color -c 3072 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "SYSTEM: {system_message}\nUSER: {prompt}\nASSISTANT:"
 ```
 Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
-Change `-c 3072` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
 If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
@@ -258,7 +258,7 @@ from llama_cpp import Llama
 # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
 llm = Llama(
   model_path="./capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf",  # Download the model file first
-  n_ctx=3072,  # The max sequence length to use - note that longer sequence lengths require much more resources
   n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
   n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
 )

 Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
 ```shell
+./main -ngl 35 -m capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf --color -c 200000 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "SYSTEM: {system_message}\nUSER: {prompt}\nASSISTANT:"
 ```
 Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
+Change `-c 200000` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
 If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
 # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
 llm = Llama(
   model_path="./capytessborosyi-34b-200k-dare-ties.Q4_K_M.gguf",  # Download the model file first
+  n_ctx=200000,  # The max sequence length to use - note that longer sequence lengths require much more resources
   n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
   n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
 )