Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/LocateAnything-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/LocateAnything-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/LocateAnything-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/LocateAnything-3B
- SGLang
How to use nvidia/LocateAnything-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
docker model run hf.co/nvidia/LocateAnything-3B
llama.cpp Support & LocateAnything-3B-GGUF (unofficial)
First of all, big thanks to NVlabs for bringing such a banger release. Screen grounding ain't something easy!
To use it on my devices though, I need to do some quantisation and stuff, so I thought I'd share the llama.cpp fork that added support for this as well as the quant ladder uploaded here with NVIDIA license. With readings on Issue 24020, that hinted of Kimi 2.5 ref, I've sketched and tested our support of this model in llama.cpp with Claude. Ran layer-by-layer verification as well (inspired by richiejp). Quants were of the LLM side though as the mmproj stay in BF16. Please do keep in mind that this is a patch that tries to be minimally invasive to the codebase, so to preserve the coords tokens the server should be launched with --special. No PBD yet.
Tested against my scripts to ground Windows and macOS screens, and the setup is working beautifully. E2E testing was original PyTorch vs mtmd-cli, with Q4_K_M variant still within +-1px. macOS sample image and accuracy testing results below:

Also downscaled the input to see if that affects accuracy. A useful matrix for the divergence v.s. BF16:
Since I have yet to implement the PBD path, speed is of normal autoregressive VLM at this size (CUDA, -ngl 99):
Testing envs for E2E building llama.cpp to grounding:
- Ubuntu 24.04
- CUDA 13.0
- RTX 3090
- Apple Silicon M1~M3
Grounding call template, assuming ./screenshot.jpg:
PORT=8095
PROMPT="Locate the Apple logo."
# Media marker so that the llama server knows where to splice the image.
# Use LLAMA_MEDIA_MARKER env if customisation
# For now, we fetch the (randomized) media marker and base64-encode the image to a file
MARKER=$(curl -s http://127.0.0.1:$PORT/props | jq -r .media_marker)
base64 -w0 ./screenshot.jpg > /tmp/screenshot.b64
# Assemble request payload
jq -n --arg p "<|im_start|>user
${MARKER}${PROMPT}<|im_end|>
<|im_start|>assistant
" --rawfile img /tmp/screenshot.b64 \
'{prompt:{prompt_string:$p,multimodal_data:[$img]},n_predict:64,temperature:0}' > /tmp/la_payload.json
# POST to llama-server
curl -s http://127.0.0.1:$PORT/completion -H "Content-Type: application/json" -d @/tmp/la_payload.json
Source repo for port: llama.cpp @ mtmd-grounders/
Followed TheBloke template and produced the GGUFs
Ideas and suggestions on how to do the fast mode is welcome! π
