CISCai commited on
Commit
dd8e76d
1 Parent(s): 4c1ab44

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -47,7 +47,7 @@ You are an AI programming assistant, utilizing the Deepseek Coder model, develop
47
  <!-- compatibility_gguf start -->
48
  ## Compatibility
49
 
50
- These quantised GGUFv3 files are compatible with llama.cpp from February 26th 2024 onwards, as of commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307)
51
 
52
  They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.
53
 
@@ -90,6 +90,7 @@ Refer to the Provided Files table below to see what files use which methods, and
90
  | [OpenCodeInterpreter-DS-6.7B.IQ4_XS.gguf](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/OpenCodeInterpreter-DS-6.7B.IQ4_XS.gguf) | IQ4_XS | 4 | 3.4 GB| 5.4 GB | small, substantial quality loss |
91
 
92
  Generated importance matrix file: [OpenCodeInterpreter-DS-6.7B.imatrix.dat](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/OpenCodeInterpreter-DS-6.7B.imatrix.dat)
 
93
  Generated importance matrix file (4K context): [OpenCodeInterpreter-DS-6.7B.imatrix-4096.dat](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/OpenCodeInterpreter-DS-6.7B.imatrix-4096.dat)
94
 
95
  **Note**: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
@@ -111,6 +112,9 @@ Change `-c 16384` to the desired sequence length.
111
 
112
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
113
 
 
 
 
114
  For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
115
 
116
  <!-- README_GGUF.md-how-to-run end -->
 
47
  <!-- compatibility_gguf start -->
48
  ## Compatibility
49
 
50
+ These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307)
51
 
52
  They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.
53
 
 
90
  | [OpenCodeInterpreter-DS-6.7B.IQ4_XS.gguf](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/OpenCodeInterpreter-DS-6.7B.IQ4_XS.gguf) | IQ4_XS | 4 | 3.4 GB| 5.4 GB | small, substantial quality loss |
91
 
92
  Generated importance matrix file: [OpenCodeInterpreter-DS-6.7B.imatrix.dat](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/OpenCodeInterpreter-DS-6.7B.imatrix.dat)
93
+
94
  Generated importance matrix file (4K context): [OpenCodeInterpreter-DS-6.7B.imatrix-4096.dat](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/OpenCodeInterpreter-DS-6.7B.imatrix-4096.dat)
95
 
96
  **Note**: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 
112
 
113
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
114
 
115
+ If you are low on V/RAM try quantizing the K-cache with `-ctk q8_0` or even `-ctk q4_0` for big memory savings (depending on context size).
116
+ There is a similar option for V-cache (`-ctv`), however that is [not working yet](https://github.com/ggerganov/llama.cpp/issues/4425).
117
+
118
  For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
119
 
120
  <!-- README_GGUF.md-how-to-run end -->