Simple autogenerated Python bindings for ggml
This folder contains:
- Scripts to generate full Python bindings from ggml headers (+ stubs for autocompletion in IDEs)
- Some barebones utils (see ggml/utils.py):
ggml.utils.init
builds a context that's freed automatically when the pointer gets GC'dggml.utils.copy
copies between same-shaped tensors (numpy or ggml), w/ automatic (de/re)quantizationggml.utils.numpy
returns a numpy view over a ggml tensor; if it's quantized, it returns a copy (requiresallow_copy=True
)
- Very basic examples (anyone wants to port llama2.c?)
Provided you set GGML_LIBRARY=.../path/to/libggml_shared.so
(see instructions below), it's trivial to do some operations on quantized tensors:
# Make sure libllama.so is in your [DY]LD_LIBRARY_PATH, or set GGML_LIBRARY=.../libggml_shared.so
from ggml import lib, ffi
from ggml.utils import init, copy, numpy
import numpy as np
ctx = init(mem_size=12*1024*1024)
n = 256
n_threads = 4
a = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_Q5_K, n)
b = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_F32, n) # Can't both be quantized
sum = lib.ggml_add(ctx, a, b) # all zeroes for now. Will be quantized too!
gf = ffi.new('struct ggml_cgraph*')
lib.ggml_build_forward_expand(gf, sum)
copy(np.array([i for i in range(n)], np.float32), a)
copy(np.array([i*100 for i in range(n)], np.float32), b)
lib.ggml_graph_compute_with_ctx(ctx, gf, n_threads)
print(numpy(a, allow_copy=True))
# 0. 1.0439453 2.0878906 3.131836 4.1757812 5.2197266. ...
print(numpy(b))
# 0. 100. 200. 300. 400. 500. ...
print(numpy(sum, allow_copy=True))
# 0. 105.4375 210.875 316.3125 421.75 527.1875 ...
Prerequisites
You'll need a shared library of ggml to use the bindings.
Build libggml_shared.so or libllama.so
As of this writing the best is to use ggerganov/llama.cpp's generated libggml_shared.so
or libllama.so
, which you can build as follows:
git clone https://github.com/ggerganov/llama.cpp
# On a CUDA-enabled system add -DLLAMA_CUBLAS=1
# On a Mac add -DLLAMA_METAL=1
cmake llama.cpp \
-B llama_build \
-DCMAKE_C_FLAGS=-Ofast \
-DLLAMA_NATIVE=1 \
-DLLAMA_LTO=1 \
-DBUILD_SHARED_LIBS=1 \
-DLLAMA_MPI=1 \
-DLLAMA_BUILD_TESTS=0 \
-DLLAMA_BUILD_EXAMPLES=0
( cd llama_build && make -j )
# On Mac, this will be libggml_shared.dylib instead
export GGML_LIBRARY=$PWD/llama_build/libggml_shared.so
# Alternatively, you can just copy it to your system's lib dir, e.g /usr/local/lib
(Optional) Regenerate the bindings and stubs
If you added or changed any signatures of the C API, you'll want to regenerate the bindings (ggml/cffi.py) and stubs (ggml/init.pyi).
Luckily it's a one-liner using regenerate.py:
pip install -q cffi
python regenerate.py
By default it assumes llama.cpp
was cloned in ../../../llama.cpp (alongside the ggml folder). You can override this with:
C_INCLUDE_DIR=$LLAMA_CPP_DIR python regenerate.py
You can also edit api.h to control which files should be included in the generated bindings (defaults to llama.cpp/ggml*.h
)
In fact, if you wanted to only generate bindings for the current version of the ggml
repo itself (instead of llama.cpp
; you'd loose support for k-quants), you could run:
API=../../include/ggml/ggml.h python regenerate.py
Develop
Run tests:
pytest
Alternatives
This example's goal is to showcase cffi-generated bindings that are trivial to use and update, but there are already alternatives in the wild:
https://github.com/abetlen/ggml-python: these bindings seem to be hand-written and use ctypes. It has high-quality API reference docs that can be used with these bindings too, but it doesn't expose Metal, CUDA, MPI or OpenCL calls, doesn't support transparent (de/re)quantization like this example does (see ggml.utils module), and won't pick up your local changes.
https://github.com/abetlen/llama-cpp-python: these expose the C++
llama.cpp
interface, which this example cannot easily be extended to support (cffi
only generates bindings of C libraries)pybind11 and nanobind are two alternatives to cffi that support binding C++ libraries, but it doesn't seem either of them have an automatic generator (writing bindings is rather time-consuming).