Aug 28, 2023

See my results here: I'm using KoboldCpp-NoCuda Version 1.41 koboldcpp-1.41 (beta)
https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GGUF/discussions/1#64ec09c4c68ddc867b897078

ronenzyroff

Aug 29, 2023

•

edited Aug 30, 2023

Instructions for running using https://github.com/oobabooga/text-generation-webui on Windows 11 on CPU with llama.cpp, with GPU acceleration enabled for the purpose of faster prompt ingestion (x26 speed).
On my laptop, the model runs at 1.0 token per second, but because I'm using the BLAS GPU acceleration, my prompt gets processed by llama.cpp at 26 tokens per second, so I don't have to wait long until the AI starts responding.

Prerequisites

At least one of:
a. An integrated GPU with at least 500 megabytes of dedicated VRAM.
b. A dedicated AMD GPU with at least 4 gigabytes of dedicated VRAM, and AMD ROCm drivers installed globally on the system.
c. A dedicated Nvidia GPU with at least 4 gigabytes of dedicated VRAM, and Nvidia CUDA drivers installed globally on the system.
Windows 11 with Git installed
Minimum RAM (regular RAM, not VRAM)- 64 gigabytes of RAM for the 8-bit quantized model, or 32 gigabytes of RAM for the 4-bit quantized model.

Installation

Download the zip file of the one-click installer for Windows from: https://github.com/oobabooga/text-generation-webui
Extract the zip file to C:/ai/oobabooga_windows
Double-click the file C:/ai/oobabooga_windows/start_windows.bat (no need to run as admin).
In the opened CMD window, wait half a minute and then choose "Nvidia GPU" or based on whatever GPU you have on your system. If you're not sure, choose "Nvidia GPU" (yes, even for an AMD integrated GPU the Nvidia option will still probably work).
Wait 20 minutes until start_windows.bat finishes installing oobabooga.

Configuration

Edit the empty file: C:/ai/oobabooga_windows/CMD_FLAGS.txt and place the text: --listen --model airoboros-c34b-2.1.Q8_0.gguf --loader llamacpp --threads 8 --n_ctx 4096. Adjust according to the number of physical cores on your CPU in order to get the full power of your CPU, adjust according to the model quantization level you're planning to use.
Copy the file: C:/ai/oobabooga_windows/text-generation-webui/settings-template.yaml so that you have a new file with identical contents named: C:/ai/oobabooga_windows/text-generation-webui/settings.yaml. This allows you to customize the default oobabooga settings.
There are many settings in the file C:/ai/oobabooga_windows/text-generation-webui/settings.yaml. Make sure that these very specific options are set to the correct values:

max_new_tokens: 4096
truncation_length: 4096
ban_eos_token: false
add_bos_token: false

Edit: I'm not actually sure about that. The correct setting might actually include the bos token:

max_new_tokens: 4096
truncation_length: 4096
ban_eos_token: false
add_bos_token: true

See my confusion regarding the prompt template in this discussion: https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GGUF/discussions/2#64ee670981e61fdcf76117eb
4. Change the contents of the file: C:/ai/oobabooga_windows/text-generation-webui/prompts/QA.txt from the default value of:

"""Common sense questions and answers

Question: 
Factual answer:
"""

to the value fitting the airoboros-c34b-2.1 model prompt template:

"""A chat.
USER: Can you explain the difference between vector and dequeue in C++?
ASSISTANT: """

Make sure to use a text editor that doesn't add that problematic newline at the end of the QA.txt text file. Use notepad++ to make the edit, instead of regular notepad in order to make sure of the precise prompt template.
4. Create a folder called: C:/ai/oobabooga_windows/text-generation-webui/models/airoboros-c34b-2.1.Q8_0.gguf. Adjust according to the model quantization level that you chose.
5. Download the model from the huggingface GUI from this repo: https://huggingface.co/TheBloke/Airoboros-c34B-2.1-GGUF and place it directly inside of the folder that you created in the previous step. You only need that 1 *.gguf file, so use the browser to download the file. I recommend airoboros-c34b-2.1.Q5_K_M.gguf if you have only 32 gigabytes of RAM.

Running the model

double-click: C:/ai/oobabooga_windows/cmd_windows.bat such that a CMD window opens.
Run the command: python webui.py and if everything works correctly, you should see the output:

llm_load_tensors: mem required  = 34133.87 MB (+  768.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/51 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size  =  768.00 MB
llama_new_context_with_model: compute buffer total size =  561.41 MB
llama_new_context_with_model: VRAM scratch buffer: 560.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
2023-08-29 14:48:12 INFO:Loaded the model in 3.86 seconds.

2023-08-29 14:48:12 INFO:Loading the extension "gallery"...
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

Using the model

Open the Firefox web browser on either your local Windows computer, or on a remote compute on the same network and enter the IP address: "localhost:7860" and press ENTER.
Paste the prompt template in the "Default" tab in the webui.

The prompt template is super sensitive to spaces / newlines.
The correct prompt template that I saw working is:

"""A chat.
USER: Tell me a joke.
ASSISTANT: """

Of course, the prompt doesn't include those triple quotation marks.
Note the lack of a newline at the beginner of the prompt template, note that newline after each line, and note that the last line (ASSISTANT) doesn't have a newline, it has a single space instead.
Multi-turn conversations do work, here is the prompt template:

"""A chat.
USER: Tell me a joke.
ASSISTANT: Why don't secrets ever get lost?

They always pop up when you least expect them!
USER: Explain the joke
ASSISTANT: """

Of course, the prompt doesn't include those triple quotation marks.

Yes, I know that LLMs don't know how to tell jokes yet, but it tries its best to explain that non-sensical joke😜

BingoBird

Sep 1, 2023

/dl/Projects/Neural/LLM/llama.cpp/./main -m ./airoboros-c34b-2.1.Q5_K_M.gguf -f ./testprompt.txt
main: build = 1083 (c1ac54b)
main: seed = 1693546812
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
Segmentation fault

TheBloke
/

Airoboros-c34B-2.1-GGUF

Best open source model for coding (August 2023)

Prerequisites

Installation

Configuration

Running the model

Using the model