locate-anything.cpp - GGUF

GGUF builds of nvidia/LocateAnything-3B for locate-anything.cpp - a C++/ggml inference engine for open-vocabulary detection / visual grounding, no Python at inference time.

Brought to you by the LocalAI team.

The detections are the same as the official PyTorch implementation (the engine is parity-gated against it), and it runs faster - on CPU and GPU.

Files

File	Bits (LM)	Size	Notes
`locate-anything-f16.gguf`	f16	~9.2 GB	LM matmuls in f16, everything else f32
`locate-anything-q8_0.gguf`	q8_0	~6.3 GB	near-lossless; box-identical to f32 - recommended
`locate-anything-q6_k.gguf`	q6_k	~5.5 GB	box-identical to f32
`locate-anything-q5_k.gguf`	q5_k	~5.1 GB	sub-pixel box drift
`locate-anything-q4_k.gguf`	q4_k	~4.7 GB	smallest; sub-pixel box drift

The full-precision f32 GGUF (~15 GB) is reproducible from the HF weights with scripts/convert_locateanything_to_gguf.py in the repo.

Performance

Same detections as the official model, faster. Full methodology, the warm/median setup, parity checks, and more images are in the repo's benchmarks/BENCHMARK.md.

Quantization (CPU, Ryzen 9 9950X3D)

Slow-mode inference on the 448 fixture; vs official divides the official PyTorch f32 time (23.65 s) by each. Only the Qwen2 LM matmuls are quantized, so box parity is preserved through q6_k:

dtype	size	infer	vs official f32	boxes
f16	9.15 GB	13.68 s	1.7×	identical
q8_0	6.26 GB	6.07 s	3.9×	identical
q6_k	5.51 GB	5.77 s	4.1×	identical
q5_k	5.10 GB	5.11 s	4.6×	sub-pixel
q4_k	4.72 GB	4.29 s	5.5×	sub-pixel

GPU (NVIDIA GB10, vs the official bf16 model)

Run against the official model exactly as its model card documents (bf16), greedily, on one GB10 GPU. Precision-matched (our f16 vs its bf16) ours is ~1.7× faster; the recommended q8_0 build (box-identical) is ~1.9-2.1×:

Quantization policy

Only the Qwen2 language-model matmuls (attn_{q,k,v,o}, ffn_{gate,up,down}, lm.output) are quantized. The MoonViT vision tower, the projector, all norms and biases, and the two host-read f32 tensors (lm.tok_embd, vit.pos_emb) stay f32 - so the parity-sensitive vision path is untouched. q8_0/q6_k are box-identical; lower bit-widths trade a little box precision for size.

Usage

# build the CLI (see the repo README), then:
locate-anything-cli detect \
    --model locate-anything-q8_0.gguf \
    --input image.jpg \
    --prompt "Locate all the instances that matches the following description: person</c>car." \
    --annotated out.png
# -> {"detections":[{"label":"person","box":[...]}, ...]}  + an annotated PNG

Decode modes: --mode hybrid (default), slow, fast. GPU: build with -DLA_GGML_CUDA=ON and run with LA_DEVICE= (auto-GPU). Separate categories in the prompt with </c>.

License

The model weights are NVIDIA's, distributed under NVIDIA's license; this repository redistributes them in GGUF form for use with locate-anything.cpp (MIT).