Hub documentation

GGUF usage with llama.cpp

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GGUF usage with llama.cpp

Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp download the model checkpoint and automatically caches it. The location of the cache is defined by LLAMA_CACHE environment variable, read more about it here:

./llama-cli
  --hf-repo lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF \
  --hf-file Meta-Llama-3-8B-Instruct-Q8_0.gguf \
  -p "You are a helpful assistant" -cnv

Note: You can remove -cnv to run the CLI in chat completion mode.

Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:

./llama-server \
  --hf-repo lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF \
  --hf-file Meta-Llama-3-8B-Instruct-Q8_0.gguf

After running the server you can simply utilise the endpoint as below:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{
    "role": "system",
    "content": "You are an AI assistant. Your top priority is achieving user fulfilment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about Python exceptions"
}
]
}'

Replace --hf-repo with any valid Hugging Face hub repo name and --hf-file with the GGUF file name in the hub repo - off you go! 🦙

Note: Remember to build llama.cpp with LLAMA_CURL=1 :)

< > Update on GitHub