llama.cpp Support & LocateAnything-3B-GGUF (unofficial)

#12
by yuuko-eth - opened

First of all, big thanks to NVlabs for bringing such a banger release. Screen grounding ain't something easy!

To use it on my devices though, I need to do some quantisation and stuff, so I thought I'd share the llama.cpp fork that added support for this as well as the quant ladder uploaded here with NVIDIA license. With readings on Issue 24020, that hinted of Kimi 2.5 ref, I've sketched and tested our support of this model in llama.cpp with Claude. Ran layer-by-layer verification as well (inspired by richiejp). Quants were of the LLM side though as the mmproj stay in BF16. Please do keep in mind that this is a patch that tries to be minimally invasive to the codebase, so to preserve the coords tokens the server should be launched with --special. No PBD yet.

Tested against my scripts to ground Windows and macOS screens, and the setup is working beautifully. E2E testing was original PyTorch vs mtmd-cli, with Q4_K_M variant still within +-1px. macOS sample image and accuracy testing results below:

macos
Also downscaled the input to see if that affects accuracy. A useful matrix for the divergence v.s. BF16:
Screenshot from 2026-06-04 15-41-27

Since I have yet to implement the PBD path, speed is of normal autoregressive VLM at this size (CUDA, -ngl 99):

Screenshot from 2026-06-04 15-41-05

Testing envs for E2E building llama.cpp to grounding:

  • Ubuntu 24.04
    • CUDA 13.0
    • RTX 3090
  • Apple Silicon M1~M3

Grounding call template, assuming ./screenshot.jpg:

PORT=8095
PROMPT="Locate the Apple logo."

# Media marker so that the llama server knows where to splice the image.
# Use LLAMA_MEDIA_MARKER env if customisation
# For now, we fetch the (randomized) media marker and base64-encode the image to a file
MARKER=$(curl -s http://127.0.0.1:$PORT/props | jq -r .media_marker)
base64 -w0 ./screenshot.jpg > /tmp/screenshot.b64

# Assemble request payload
jq -n --arg p "<|im_start|>user
${MARKER}${PROMPT}<|im_end|>
<|im_start|>assistant
" --rawfile img /tmp/screenshot.b64 \
  '{prompt:{prompt_string:$p,multimodal_data:[$img]},n_predict:64,temperature:0}' > /tmp/la_payload.json

# POST to llama-server
curl -s http://127.0.0.1:$PORT/completion -H "Content-Type: application/json" -d @/tmp/la_payload.json

Source repo for port: llama.cpp @ mtmd-grounders/
Followed TheBloke template and produced the GGUFs
Ideas and suggestions on how to do the fast mode is welcome! πŸ‘

Sign up or log in to comment