File size: 4,822 Bytes
13d3ba0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# Simple autogenerated Python bindings for ggml
This folder contains:
- Scripts to generate full Python bindings from ggml headers (+ stubs for autocompletion in IDEs)
- Some barebones utils (see [ggml/utils.py](./ggml/utils.py)):
- `ggml.utils.init` builds a context that's freed automatically when the pointer gets GC'd
- `ggml.utils.copy` **copies between same-shaped tensors (numpy or ggml), w/ automatic (de/re)quantization**
- `ggml.utils.numpy` returns a numpy view over a ggml tensor; if it's quantized, it returns a copy (requires `allow_copy=True`)
- Very basic examples (anyone wants to port [llama2.c](https://github.com/karpathy/llama2.c)?)
Provided you set `GGML_LIBRARY=.../path/to/libggml_shared.so` (see instructions below), it's trivial to do some operations on quantized tensors:
```python
# Make sure libllama.so is in your [DY]LD_LIBRARY_PATH, or set GGML_LIBRARY=.../libggml_shared.so
from ggml import lib, ffi
from ggml.utils import init, copy, numpy
import numpy as np
ctx = init(mem_size=12*1024*1024)
n = 256
n_threads = 4
a = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_Q5_K, n)
b = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_F32, n) # Can't both be quantized
sum = lib.ggml_add(ctx, a, b) # all zeroes for now. Will be quantized too!
gf = ffi.new('struct ggml_cgraph*')
lib.ggml_build_forward_expand(gf, sum)
copy(np.array([i for i in range(n)], np.float32), a)
copy(np.array([i*100 for i in range(n)], np.float32), b)
lib.ggml_graph_compute_with_ctx(ctx, gf, n_threads)
print(numpy(a, allow_copy=True))
# 0. 1.0439453 2.0878906 3.131836 4.1757812 5.2197266. ...
print(numpy(b))
# 0. 100. 200. 300. 400. 500. ...
print(numpy(sum, allow_copy=True))
# 0. 105.4375 210.875 316.3125 421.75 527.1875 ...
```
### Prerequisites
You'll need a shared library of ggml to use the bindings.
#### Build libggml_shared.so or libllama.so
As of this writing the best is to use [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)'s generated `libggml_shared.so` or `libllama.so`, which you can build as follows:
```bash
git clone https://github.com/ggerganov/llama.cpp
# On a CUDA-enabled system add -DLLAMA_CUBLAS=1
# On a Mac add -DLLAMA_METAL=1
cmake llama.cpp \
-B llama_build \
-DCMAKE_C_FLAGS=-Ofast \
-DLLAMA_NATIVE=1 \
-DLLAMA_LTO=1 \
-DBUILD_SHARED_LIBS=1 \
-DLLAMA_MPI=1 \
-DLLAMA_BUILD_TESTS=0 \
-DLLAMA_BUILD_EXAMPLES=0
( cd llama_build && make -j )
# On Mac, this will be libggml_shared.dylib instead
export GGML_LIBRARY=$PWD/llama_build/libggml_shared.so
# Alternatively, you can just copy it to your system's lib dir, e.g /usr/local/lib
```
#### (Optional) Regenerate the bindings and stubs
If you added or changed any signatures of the C API, you'll want to regenerate the bindings ([ggml/cffi.py](./ggml/cffi.py)) and stubs ([ggml/__init__.pyi](./ggml/__init__.pyi)).
Luckily it's a one-liner using [regenerate.py](./regenerate.py):
```bash
pip install -q cffi
python regenerate.py
```
By default it assumes `llama.cpp` was cloned in ../../../llama.cpp (alongside the ggml folder). You can override this with:
```bash
C_INCLUDE_DIR=$LLAMA_CPP_DIR python regenerate.py
```
You can also edit [api.h](./api.h) to control which files should be included in the generated bindings (defaults to `llama.cpp/ggml*.h`)
In fact, if you wanted to only generate bindings for the current version of the `ggml` repo itself (instead of `llama.cpp`; you'd loose support for k-quants), you could run:
```bash
API=../../include/ggml/ggml.h python regenerate.py
```
## Develop
Run tests:
```bash
pytest
```
### Alternatives
This example's goal is to showcase [cffi](https://cffi.readthedocs.io/)-generated bindings that are trivial to use and update, but there are already alternatives in the wild:
- https://github.com/abetlen/ggml-python: these bindings seem to be hand-written and use [ctypes](https://docs.python.org/3/library/ctypes.html). It has [high-quality API reference docs](https://ggml-python.readthedocs.io/en/latest/api-reference/#ggml.ggml) that can be used with these bindings too, but it doesn't expose Metal, CUDA, MPI or OpenCL calls, doesn't support transparent (de/re)quantization like this example does (see [ggml.utils](./ggml/utils.py) module), and won't pick up your local changes.
- https://github.com/abetlen/llama-cpp-python: these expose the C++ `llama.cpp` interface, which this example cannot easily be extended to support (`cffi` only generates bindings of C libraries)
- [pybind11](https://github.com/pybind/pybind11) and [nanobind](https://github.com/wjakob/nanobind) are two alternatives to cffi that support binding C++ libraries, but it doesn't seem either of them have an automatic generator (writing bindings is rather time-consuming).
|