CISCai
/

Cerebrum-1.0-8x7b-SOTA-GGUF

Text Generation

GGUF

English

Inference Endpoints

conversational

Model card Files Files and versions Community

CISCai commited on Apr 4

Commit

4d4eda2

•

1 Parent(s): 4753e72

Updated with llama-cpp-python example

Browse files

Files changed (1) hide show

README.md +63 -1

README.md CHANGED Viewed

@@ -93,7 +93,7 @@ Generated importance matrix file: [Cerebrum-1.0-8x7b.imatrix.dat](https://huggin
 Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.
 ```shell
-./main -ngl 33 -m Cerebrum-1.0-8x7b.IQ2_XS.gguf --override-kv llama.expert_used_count=int:3 --color -c 16384 --temp 0.7 --repeat_penalty 1.0 -n -1 -p "<s>A chat between a user and a thinking artificial intelligence assistant. The assistant describes its thought process and gives helpful and detailed answers to the user's questions.\nUser: {prompt}\nAI:"
 ```
 Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
@@ -107,6 +107,68 @@ There is a similar option for V-cache (`-ctv`), however that is [not working yet
 For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
 <!-- README_GGUF.md-how-to-run end -->
 <!-- original-model-card start -->

 Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.
 ```shell
+./main -ngl 33 -m Cerebrum-1.0-8x7b.IQ2_XS.gguf --override-kv llama.expert_used_count=int:3 --color -c 16384 --temp 0.7 --repeat-penalty 1.0 -n -1 -p "<s>A chat between a user and a thinking artificial intelligence assistant. The assistant describes its thought process and gives helpful and detailed answers to the user's questions.\nUser: {prompt}\nAI:"
 ```
 Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
 For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
+## How to run from Python code
+You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module.
+### How to load this model in Python code, using llama-cpp-python
+For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).
+#### First install the package
+Run one of the following commands, according to your system:
+```shell
+# Prebuilt wheel with basic CPU support
+pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
+# Prebuilt wheel with NVidia CUDA acceleration
+pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
+# Prebuilt wheel with Metal GPU acceleration
+pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
+# Build base version with no GPU acceleration
+pip install llama-cpp-python
+# With NVidia CUDA acceleration
+CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
+# Or with OpenBLAS acceleration
+CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
+# Or with CLBLast acceleration
+CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
+# Or with AMD ROCm GPU acceleration (Linux only)
+CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
+# Or with Metal GPU acceleration for macOS systems only
+CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
+# Or with Vulkan acceleration
+CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
+# Or with Kompute acceleration
+CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
+# Or with SYCL acceleration
+CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
+# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
+$env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
+pip install llama-cpp-python
+```
+#### Simple llama-cpp-python example code
+```python
+from llama_cpp import Llama
+# Chat Completion API
+llm = Llama(model_path="./Cerebrum-1.0-8x7b.IQ3_M.gguf", n_gpu_layers=33, n_ctx=16384)
+print(llm.create_chat_completion(
+    messages = [
+        {"role": "system", "content": "You are a story writing assistant."},
+        {
+            "role": "user",
+            "content": "Write a story about llamas."
+        }
+    ]
+))
+```
 <!-- README_GGUF.md-how-to-run end -->
 <!-- original-model-card start -->