# Simple autogenerated Python bindings for ggml This folder contains: - Scripts to generate full Python bindings from ggml headers (+ stubs for autocompletion in IDEs) - Some barebones utils (see [ggml/utils.py](./ggml/utils.py)): - `ggml.utils.init` builds a context that's freed automatically when the pointer gets GC'd - `ggml.utils.copy` **copies between same-shaped tensors (numpy or ggml), w/ automatic (de/re)quantization** - `ggml.utils.numpy` returns a numpy view over a ggml tensor; if it's quantized, it returns a copy (requires `allow_copy=True`) - Very basic examples (anyone wants to port [llama2.c](https://github.com/karpathy/llama2.c)?) Provided you set `GGML_LIBRARY=.../path/to/libggml_shared.so` (see instructions below), it's trivial to do some operations on quantized tensors: ```python # Make sure libllama.so is in your [DY]LD_LIBRARY_PATH, or set GGML_LIBRARY=.../libggml_shared.so from ggml import lib, ffi from ggml.utils import init, copy, numpy import numpy as np ctx = init(mem_size=12*1024*1024) n = 256 n_threads = 4 a = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_Q5_K, n) b = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_F32, n) # Can't both be quantized sum = lib.ggml_add(ctx, a, b) # all zeroes for now. Will be quantized too! gf = ffi.new('struct ggml_cgraph*') lib.ggml_build_forward_expand(gf, sum) copy(np.array([i for i in range(n)], np.float32), a) copy(np.array([i*100 for i in range(n)], np.float32), b) lib.ggml_graph_compute_with_ctx(ctx, gf, n_threads) print(numpy(a, allow_copy=True)) # 0. 1.0439453 2.0878906 3.131836 4.1757812 5.2197266. ... print(numpy(b)) # 0. 100. 200. 300. 400. 500. ... print(numpy(sum, allow_copy=True)) # 0. 105.4375 210.875 316.3125 421.75 527.1875 ... ``` ### Prerequisites You'll need a shared library of ggml to use the bindings. #### Build libggml_shared.so or libllama.so As of this writing the best is to use [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)'s generated `libggml_shared.so` or `libllama.so`, which you can build as follows: ```bash git clone https://github.com/ggerganov/llama.cpp # On a CUDA-enabled system add -DLLAMA_CUBLAS=1 # On a Mac add -DLLAMA_METAL=1 cmake llama.cpp \ -B llama_build \ -DCMAKE_C_FLAGS=-Ofast \ -DLLAMA_NATIVE=1 \ -DLLAMA_LTO=1 \ -DBUILD_SHARED_LIBS=1 \ -DLLAMA_MPI=1 \ -DLLAMA_BUILD_TESTS=0 \ -DLLAMA_BUILD_EXAMPLES=0 ( cd llama_build && make -j ) # On Mac, this will be libggml_shared.dylib instead export GGML_LIBRARY=$PWD/llama_build/libggml_shared.so # Alternatively, you can just copy it to your system's lib dir, e.g /usr/local/lib ``` #### (Optional) Regenerate the bindings and stubs If you added or changed any signatures of the C API, you'll want to regenerate the bindings ([ggml/cffi.py](./ggml/cffi.py)) and stubs ([ggml/__init__.pyi](./ggml/__init__.pyi)). Luckily it's a one-liner using [regenerate.py](./regenerate.py): ```bash pip install -q cffi python regenerate.py ``` By default it assumes `llama.cpp` was cloned in ../../../llama.cpp (alongside the ggml folder). You can override this with: ```bash C_INCLUDE_DIR=$LLAMA_CPP_DIR python regenerate.py ``` You can also edit [api.h](./api.h) to control which files should be included in the generated bindings (defaults to `llama.cpp/ggml*.h`) In fact, if you wanted to only generate bindings for the current version of the `ggml` repo itself (instead of `llama.cpp`; you'd loose support for k-quants), you could run: ```bash API=../../include/ggml/ggml.h python regenerate.py ``` ## Develop Run tests: ```bash pytest ``` ### Alternatives This example's goal is to showcase [cffi](https://cffi.readthedocs.io/)-generated bindings that are trivial to use and update, but there are already alternatives in the wild: - https://github.com/abetlen/ggml-python: these bindings seem to be hand-written and use [ctypes](https://docs.python.org/3/library/ctypes.html). It has [high-quality API reference docs](https://ggml-python.readthedocs.io/en/latest/api-reference/#ggml.ggml) that can be used with these bindings too, but it doesn't expose Metal, CUDA, MPI or OpenCL calls, doesn't support transparent (de/re)quantization like this example does (see [ggml.utils](./ggml/utils.py) module), and won't pick up your local changes. - https://github.com/abetlen/llama-cpp-python: these expose the C++ `llama.cpp` interface, which this example cannot easily be extended to support (`cffi` only generates bindings of C libraries) - [pybind11](https://github.com/pybind/pybind11) and [nanobind](https://github.com/wjakob/nanobind) are two alternatives to cffi that support binding C++ libraries, but it doesn't seem either of them have an automatic generator (writing bindings is rather time-consuming).