Chimera.APE — v0.1.0-alpha (single-file build)
Three queries, one binary, zero regrets. Runs on anything, answers to no one.
One Actually Portable Executable. One download. Everything inside.
chimera-full.ape (~7.5 GB) bundles, in a single self-contained file that
runs unmodified on Linux / macOS / Windows / BSD:
- the orchestrator (C++/Cosmopolitan),
- a llamafile + Gemma 4 12B QAT q4_0 (embeddings and chat from one server),
- the multimodal projector (image + audio understanding),
- QLever (SPARQL knowledge graph + BM25 text index), and
- TurboVec (quantized approximate-nearest-neighbor vector search).
Point it at a directory of files — text, code, images, audio — and it digests everything into a hybrid graph-vector database. Ask a question and it answers with synthesized, cited, checksum-verified provenance. No network, no sidecar downloads, no runtime dependencies.
GitHub (source, smaller organ-only build, full docs): https://github.com/SEBK4C/Chimera.APE
Quick start
# Download this one file (no weights to fetch separately — they're inside):
hf download SEBK4C/Chimera.APE chimera-full.ape --local-dir .
chmod +x chimera-full.ape
# Ingest a directory. First run unpacks the embedded organs + weights into
# <dir>/.chimera/runtime/ (one-time, a few GB):
./chimera-full.ape ingest ~/notes
# Ask:
./chimera-full.ape --search "what did we decide about the billing rewrite?" \
--db ~/notes/.chimera
Maria Chen leads Project Phoenix [1]. It is a rewrite of the billing system [1].
Sources:
[1] phoenix.md#1 ✓ verified
✓ verified means the cited file is byte-identical to what was ingested;
âš drifted / âš missing tell you when it isn't. Citations are promises the
checksum keeps.
GPU (NVIDIA / Metal) — interactive ingest & search
CPU works everywhere but is slow (~7 tok/s — minutes per document). On a GPU, ingest and search become interactive. The orchestrator passes offload flags straight through to the embedded llamafile:
./chimera-full.ape ingest ~/notes --gpu auto # offload all layers (default-on GPU box)
./chimera-full.ape ingest ~/notes --gpu nvidia # pin the CUDA backend
./chimera-full.ape ingest ~/notes --gpu 24 # partial offload, N layers (small VRAM)
./chimera-full.ape ingest ~/notes --gpu off # force CPU
./chimera-full.ape --search "..." --db ... --gpu auto
--gpu |
llamafile flags | meaning |
|---|---|---|
auto (default) |
-ngl 999 |
offload all layers; falls back to CPU if no GPU |
off / disable |
--gpu disable |
force CPU |
integer N |
-ngl N |
offload N layers (VRAM-limited cards) |
nvidia/amd/apple |
--gpu <vendor> -ngl 999 |
pin the backend vendor |
CUDA prereqs: a working NVIDIA driver is enough (llamafile ships a prebuilt
tinyBLAS path); with the CUDA toolkit (nvcc on PATH) it JITs an optimized
ggml-cuda module once and caches it under ~/.llamafile/. The first GPU run
logs the device(s) and throughput to <db>/.chimera/logs/llamafile.log.
Verified on this build: 2× NVIDIA RTX 4090 (driver 580 / CUDA 12.8) —
--gpu auto offloads Gemma 4 12B across both cards and runs ingest + search
end-to-end with ✓ verified citations at ~90 tok/s generation (vs ~7 tok/s on
CPU). Multimodal embeddings run on GPU too: image and audio embed natively
as the model's end hidden state over the projector+interleave forward pass
(LAST pooling), in the same 3840-d space as text — so --search-file
(image→image, audio→audio) works on GPU. See
docs/GEMMA4-EMBEDDINGS.md
and docs/GPU.md.
Images and audio
PNG/JPEG/WAV/MP3 are first-class documents. At ingest the model transcribes legible text or describes the scene/sound, indexes that derived text, and stores the raw media embedding for query-by-example:
./chimera-full.ape --search "the budget figure on the banner" --db ~/notes/.chimera
./chimera-full.ape --search-file query.png --db ~/notes/.chimera
Other commands
./chimera-full.ape status --db DIR/.chimera # counts, dims, index staleness
./chimera-full.ape verify --db ... [--paranoid] # re-checksum the corpus
./chimera-full.ape vacuum --db ... # purge superseded data, rebuild text index
./chimera-full.ape sparql "SELECT ..." --db ... # raw SPARQL into the live graph
Hardware
Runs CPU-only (slow — minutes per document at ingest, ~7 tok/s on a fast
CPU) or on a GPU (--gpu auto, interactive — see above). Needs ≥16 GB RAM
(the model maps ~8 GB) and ~8 GB free disk for the one-time runtime extraction.
Two flavors
| File | Size | Use |
|---|---|---|
chimera-full.ape (here) |
~7.5 GB | true single file; weights embedded |
chimera.ape (on GitHub releases) |
~315 MB | organs embedded, weights sidecar via --model |
Known alpha limitations
- Sequential ingest (CPU-bound on CPU hosts); §5 bounded-queue concurrency is designed, not yet wired.
- Incremental ingests don't extend the BM25 text index (vector + graph search unaffected);
vacuumrebuilds it. - Linux x86_64 is the tested platform;
turbovec-servercarries Linux ABI assumptions inside its APE shell, so other OSes are expected-but-unverified. - Dense rendered-text OCR has a known upstream vision-pipeline bug; photos/scenes describe well.
- Embeddings use
LASTpooling — the end hidden state of Gemma 4 12B's projector+interleave forward pass — for text, image, and audio alike (one shared 3840-d space; this is what makes native multimodal embedding work on GPU). The embedded llamafile carries the patch that makes this GPU-safe. If you indexed with an earlier (mean-pooled) build, re-ingest; dimensionality (3840) is unchanged.
Built with Cosmopolitan Libc. Gemma 4 weights © Google, Apache 2.0.