license: apache-2.0
library_name: transformers
GGUF Models: Conversion and Upload to Hugging Face
This guide explains what GGUF models are, how to convert models to GGUF format, and how to upload them to the Hugging Face Hub.
What is GGUF?
GGUF (GGML Unified Format) is a file format for storing large language models, particularly optimized for efficient inference on consumer hardware. Key features of GGUF models include:
- Successor to the GGML format
- Designed for efficient quantization and inference
- Supports a wide range of model architectures
- Commonly used with libraries like llama.cpp for running LLMs on consumer hardware
- Allows for reduced model size while maintaining good performance
Why and How to Convert to GGUF Format
Converting models to GGUF format offers several advantages:
- Reduced file size: GGUF models can be quantized to lower precision (e.g., int4, int8), significantly reducing model size.
- Faster inference: The format is optimized for quick loading and efficient inference on CPUs and consumer GPUs.
- Cross-platform compatibility: GGUF models can be used with libraries like llama.cpp, enabling deployment on various platforms.
To convert a model to GGUF format, we'll use the convert-hf-to-gguf.py
script from the llama.cpp repository.
Steps to Convert a Model to GGUF
Clone the llama.cpp repository:
git clone https://github.com/ggerganov/llama.cpp.git
Install required Python libraries:
pip install -r llama.cpp/requirements.txt
Verify the script and understand options:
python llama.cpp/convert-hf-to-gguf-update.py -h
Convert the HuggingFace model to GGUF:
python llama.cpp/convert-hf-to-gguf-update.py ./models/8B/Meta-Llama-3-8B-Instruct --outfile Llama3-8B-instruct-Q8.0.gguf --outtype q8_0
This command converts the model to 8-bit quantization (q8_0). You can choose different quantization levels like int4, int8, or keep it in f16 or f32 format.
Uploading GGUF Models to Hugging Face
Once you have your GGUF model, you can upload it to Hugging Face for easy sharing and versioning.
Prerequisites
- Python 3.6+
huggingface_hub
library installed (pip install huggingface_hub
)- A Hugging Face account and API token
Upload Script
Save the following script as upload_gguf_model.py
:
from huggingface_hub import HfApi
def push_to_hub(hf_token, local_path, model_id):
api = HfApi(token=hf_token)
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
path_or_fileobj=local_path,
path_in_repo="Meta-Llama-2-7B-Instruct.bf16.gguf",
repo_id=model_id
)
print(f"Model successfully pushed to {model_id}")
# Example usage
hf_token = "your_huggingface_token_here"
local_path = "/path/to/your/local/model/directory"
model_id = "your-username/your-model-name"
push_to_hub(hf_token, local_path, model_id)
Usage
Replace the placeholder values in the script:
your_huggingface_token_here
: Your Hugging Face API token/path/to/your/local/model/directory
: The local path to your GGUF model filesyour-username/your-model-name
: Your desired model ID on Hugging Face
Run the script:
python upload_gguf_model.py
Best Practices
- Include a
README.md
file with your model, detailing its architecture, quantization, and usage instructions. - Add a
config.json
file with model configuration details. - Include any necessary tokenizer files.
References
For more detailed information and updates, please refer to the official documentation of llama.cpp and Hugging Face.