CISCai commited on
Commit
4d4eda2
1 Parent(s): 4753e72

Updated with llama-cpp-python example

Browse files
Files changed (1) hide show
  1. README.md +63 -1
README.md CHANGED
@@ -93,7 +93,7 @@ Generated importance matrix file: [Cerebrum-1.0-8x7b.imatrix.dat](https://huggin
93
  Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.
94
 
95
  ```shell
96
- ./main -ngl 33 -m Cerebrum-1.0-8x7b.IQ2_XS.gguf --override-kv llama.expert_used_count=int:3 --color -c 16384 --temp 0.7 --repeat_penalty 1.0 -n -1 -p "<s>A chat between a user and a thinking artificial intelligence assistant. The assistant describes its thought process and gives helpful and detailed answers to the user's questions.\nUser: {prompt}\nAI:"
97
  ```
98
 
99
  Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
@@ -107,6 +107,68 @@ There is a similar option for V-cache (`-ctv`), however that is [not working yet
107
 
108
  For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  <!-- README_GGUF.md-how-to-run end -->
111
 
112
  <!-- original-model-card start -->
 
93
  Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.
94
 
95
  ```shell
96
+ ./main -ngl 33 -m Cerebrum-1.0-8x7b.IQ2_XS.gguf --override-kv llama.expert_used_count=int:3 --color -c 16384 --temp 0.7 --repeat-penalty 1.0 -n -1 -p "<s>A chat between a user and a thinking artificial intelligence assistant. The assistant describes its thought process and gives helpful and detailed answers to the user's questions.\nUser: {prompt}\nAI:"
97
  ```
98
 
99
  Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
 
107
 
108
  For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
109
 
110
+ ## How to run from Python code
111
+
112
+ You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module.
113
+
114
+ ### How to load this model in Python code, using llama-cpp-python
115
+
116
+ For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).
117
+
118
+ #### First install the package
119
+
120
+ Run one of the following commands, according to your system:
121
+
122
+ ```shell
123
+ # Prebuilt wheel with basic CPU support
124
+ pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
125
+ # Prebuilt wheel with NVidia CUDA acceleration
126
+ pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
127
+ # Prebuilt wheel with Metal GPU acceleration
128
+ pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
129
+ # Build base version with no GPU acceleration
130
+ pip install llama-cpp-python
131
+ # With NVidia CUDA acceleration
132
+ CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
133
+ # Or with OpenBLAS acceleration
134
+ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
135
+ # Or with CLBLast acceleration
136
+ CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
137
+ # Or with AMD ROCm GPU acceleration (Linux only)
138
+ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
139
+ # Or with Metal GPU acceleration for macOS systems only
140
+ CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
141
+ # Or with Vulkan acceleration
142
+ CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
143
+ # Or with Kompute acceleration
144
+ CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
145
+ # Or with SYCL acceleration
146
+ CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
147
+
148
+ # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
149
+ $env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
150
+ pip install llama-cpp-python
151
+ ```
152
+
153
+ #### Simple llama-cpp-python example code
154
+
155
+ ```python
156
+ from llama_cpp import Llama
157
+
158
+ # Chat Completion API
159
+
160
+ llm = Llama(model_path="./Cerebrum-1.0-8x7b.IQ3_M.gguf", n_gpu_layers=33, n_ctx=16384)
161
+ print(llm.create_chat_completion(
162
+ messages = [
163
+ {"role": "system", "content": "You are a story writing assistant."},
164
+ {
165
+ "role": "user",
166
+ "content": "Write a story about llamas."
167
+ }
168
+ ]
169
+ ))
170
+ ```
171
+
172
  <!-- README_GGUF.md-how-to-run end -->
173
 
174
  <!-- original-model-card start -->