Elite-text-gen-web-duplicate / docs /llama.cpp-models.md
pikto's picture
Duplicate from algovenus/text-generation-webui
82fea12

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

Using llama.cpp in the web UI

Setting up the models

Pre-converted

Place the model in the models folder, making sure that its name contains ggml somewhere and ends in .bin.

Convert LLaMA yourself

Follow the instructions in the llama.cpp README to generate the ggml-model.bin file: https://github.com/ggerganov/llama.cpp#usage

GPU acceleration

Enabled with the --n-gpu-layers parameter.

  • If you have enough VRAM, use a high number like --n-gpu-layers 200000 to offload all layers to the GPU.
  • Otherwise, start with a low number like --n-gpu-layers 10 and then gradually increase it until you run out of memory.

To use this feature, you need to manually compile and install llama-cpp-python with GPU support.

Linux

pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

Windows

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

macOS

pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

Here you can find the different compilation options for OpenBLAS / cuBLAS / CLBlast: https://pypi.org/project/llama-cpp-python/

Performance

This was the performance of llama-7b int4 on my i5-12400F (cpu only):

Output generated in 33.07 seconds (6.05 tokens/s, 200 tokens, context 17)

You can change the number of threads with --threads N.