svjack's picture
Upload folder using huggingface_hub
9fcf2b6
|
raw
history blame
4.82 kB

Simple autogenerated Python bindings for ggml

This folder contains:

  • Scripts to generate full Python bindings from ggml headers (+ stubs for autocompletion in IDEs)
  • Some barebones utils (see ggml/utils.py):
    • ggml.utils.init builds a context that's freed automatically when the pointer gets GC'd
    • ggml.utils.copy copies between same-shaped tensors (numpy or ggml), w/ automatic (de/re)quantization
    • ggml.utils.numpy returns a numpy view over a ggml tensor; if it's quantized, it returns a copy (requires allow_copy=True)
  • Very basic examples (anyone wants to port llama2.c?)

Provided you set GGML_LIBRARY=.../path/to/libggml_shared.so (see instructions below), it's trivial to do some operations on quantized tensors:

# Make sure libllama.so is in your [DY]LD_LIBRARY_PATH, or set GGML_LIBRARY=.../libggml_shared.so

from ggml import lib, ffi
from ggml.utils import init, copy, numpy
import numpy as np

ctx = init(mem_size=12*1024*1024)
n = 256
n_threads = 4

a = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_Q5_K, n)
b = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_F32, n) # Can't both be quantized
sum = lib.ggml_add(ctx, a, b) # all zeroes for now. Will be quantized too!

gf = ffi.new('struct ggml_cgraph*')
lib.ggml_build_forward_expand(gf, sum)

copy(np.array([i for i in range(n)], np.float32), a)
copy(np.array([i*100 for i in range(n)], np.float32), b)

lib.ggml_graph_compute_with_ctx(ctx, gf, n_threads)

print(numpy(a, allow_copy=True))
#  0.    1.0439453   2.0878906   3.131836    4.1757812   5.2197266. ...
print(numpy(b))
#  0.  100.        200.        300.        400.        500.         ...
print(numpy(sum, allow_copy=True))
#  0.  105.4375    210.875     316.3125    421.75      527.1875     ...

Prerequisites

You'll need a shared library of ggml to use the bindings.

Build libggml_shared.so or libllama.so

As of this writing the best is to use ggerganov/llama.cpp's generated libggml_shared.so or libllama.so, which you can build as follows:

git clone https://github.com/ggerganov/llama.cpp
# On a CUDA-enabled system add -DLLAMA_CUBLAS=1
# On a Mac add -DLLAMA_METAL=1
cmake llama.cpp \
  -B llama_build \
  -DCMAKE_C_FLAGS=-Ofast \
  -DLLAMA_NATIVE=1 \
  -DLLAMA_LTO=1 \
  -DBUILD_SHARED_LIBS=1 \
  -DLLAMA_MPI=1 \
  -DLLAMA_BUILD_TESTS=0 \
  -DLLAMA_BUILD_EXAMPLES=0
( cd llama_build && make -j )

# On Mac, this will be libggml_shared.dylib instead
export GGML_LIBRARY=$PWD/llama_build/libggml_shared.so
# Alternatively, you can just copy it to your system's lib dir, e.g /usr/local/lib

(Optional) Regenerate the bindings and stubs

If you added or changed any signatures of the C API, you'll want to regenerate the bindings (ggml/cffi.py) and stubs (ggml/init.pyi).

Luckily it's a one-liner using regenerate.py:

pip install -q cffi

python regenerate.py

By default it assumes llama.cpp was cloned in ../../../llama.cpp (alongside the ggml folder). You can override this with:

C_INCLUDE_DIR=$LLAMA_CPP_DIR python regenerate.py

You can also edit api.h to control which files should be included in the generated bindings (defaults to llama.cpp/ggml*.h)

In fact, if you wanted to only generate bindings for the current version of the ggml repo itself (instead of llama.cpp; you'd loose support for k-quants), you could run:

API=../../include/ggml/ggml.h python regenerate.py

Develop

Run tests:

pytest

Alternatives

This example's goal is to showcase cffi-generated bindings that are trivial to use and update, but there are already alternatives in the wild:

  • https://github.com/abetlen/ggml-python: these bindings seem to be hand-written and use ctypes. It has high-quality API reference docs that can be used with these bindings too, but it doesn't expose Metal, CUDA, MPI or OpenCL calls, doesn't support transparent (de/re)quantization like this example does (see ggml.utils module), and won't pick up your local changes.

  • https://github.com/abetlen/llama-cpp-python: these expose the C++ llama.cpp interface, which this example cannot easily be extended to support (cffi only generates bindings of C libraries)

  • pybind11 and nanobind are two alternatives to cffi that support binding C++ libraries, but it doesn't seem either of them have an automatic generator (writing bindings is rather time-consuming).