Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/LocateAnything-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/LocateAnything-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/LocateAnything-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/LocateAnything-3B
- SGLang
How to use nvidia/LocateAnything-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
docker model run hf.co/nvidia/LocateAnything-3B
Benchmark: locateanything-batch(MTP) vs the llama.cpp & vLLM AR ports
Benchmarking the LocateAnything-3B runtimes: native MTP vs the llama.cpp & vLLM AR ports
There are now three ways to run nvidia/LocateAnything-3B outside the reference snippet:
- llama.cpp — the
yuuko-eth/llama.cppmtmd-grounders fork +yuuko-eth/LocateAnything-3B-GGUF(discussion #12) - vLLM —
WuNein/LocateAnything-vLLM(discussion #13) - native MTP, batched —
locateanything-batch(discussion #10), which keeps the model's own multi-token-prediction decode but lifts thebatch == 1lock
Both community ports are excellent for getting the model running, but neither implements MTP — they decode plain autoregressively (speculative_config=None in vLLM; one token/step in llama.cpp). I was curious how much that costs, so I benchmarked all three head-to-head. Sharing the numbers and (more usefully) why they come out the way they do.
TL;DR
End-to-end grounding throughput, one RTX 5070 Ti (16 GB), single clean detection per image, greedy, batch 8, precision-matched (bf16/fp16), each run serially on an otherwise-idle GPU:
| Runtime | decode | precision | img/s | ms/img |
|---|---|---|---|---|
native MTP (locateanything-batch) |
fast-MTP + batched | bf16 | 4.53 | 221 |
| llama.cpp (yuuko-eth fork) | autoregressive | BF16 | 2.61 | 383 |
| vLLM (WuNein) | autoregressive | fp16 | 1.02 | 977 |
MTP is ~1.7× faster than llama.cpp and ~4.4× faster than vLLM end-to-end. All three decode the identical box (<ref>rectangle</ref><box><117><181><460><678></box>), so this is purely a runtime comparison.
The surprising part isn't that MTP wins — it's where the time actually goes, which flips some intuitions about these runtimes.
Method (and the traps I hit, in case you benchmark this too)
A few things make a naive comparison misleading; each is worth knowing:
Precision. The GGUF repo ships Q4_K_M by default, but vLLM runs fp16 and the native path is bf16. Quantization gives llama.cpp a real single-stream decode edge that has nothing to do with the runtime. There's a prebuilt
LocateAnything-3B-BF16.ggufin the GGUF repo — use it for an apples-to-apples comparison.vLLM hides the vision cost. The WuNein port computes the MoonViT vision embeddings client-side and ships
prompt_embedsto the server, so the server-side latency excludes vision entirely. For a true end-to-end number you must time the vision encode too.Output length must match. The model's default
"Locate all the instances that matches the following description: …"prompt makes the MTP path emit a whole stream of boxes (and, on out-of-domain inputs, degenerate into repeats — hundreds of tokens), while plain AR stops early. That's not a fair latency comparison. Switching to a single-detection prompt —"Detect the <obj>."— makes every runtime emit exactly one box and stop at EOS (~9–10 tokens), so the decode work is identical.
Final protocol: 8 single-object images (1024×768, one colored rectangle each), prompt Detect the rectangle., greedy, batch 8 (batch 32 strained the 16 GB card and regressed the MTP path — see caveats).
Where the end-to-end time goes
This is the key finding. For a short grounding output, E2E is dominated by vision-encode + prefill, not decode:
| Runtime | vision encode | prefill + decode (server) | bottleneck |
|---|---|---|---|
| vLLM | ~82% | ~18% | client-side, serial-per-image vision encode |
| llama.cpp | prefill (~1k image tokens) dominates | decode ~9 tok is negligible | prefill |
| native MTP | batched (all images in one flash pass) + batched prefill | — | best-amortized → fastest |
So the runtime that wins is the one that batches the front of the pipeline. The native path packs every image into one MoonViT extract_feature (flash varlen, block-diagonal — bit-identical to per-image but 2.6–3×) and does one batched shared-prefix prefill (~3.6× over batch-1). The vLLM port does the opposite — vision runs one image at a time on the client — which is why its E2E is the slowest despite having by far the highest raw decode throughput (see next).
Raw LM decode capacity (forced 128 tokens, vision excluded)
If you only look at decode tokens/s, you get a very different — and misleading — ranking:
| Runtime | precision | decode tok/s @ batch 1 | aggregate @ batch 32 |
|---|---|---|---|
| llama.cpp | Q4_K_M | 253 | 526 |
| llama.cpp | BF16 | 120 | 445 |
| vLLM | fp16 | 100 | 1333 |
- Quantization buys llama.cpp a 2.1× single-stream decode edge (253 vs 120) — pure memory bandwidth (4.94 vs 16 bits/weight). It collapses to 1.18× at batch 32 (decode becomes compute-bound once weights amortize over the batch).
- At precision parity, llama.cpp single-stream (120) ≈ vLLM (100). vLLM's continuous batching is far ahead at batch 32 (1333 tok/s) — but for short grounding outputs that capacity is wasted, because vision dominates E2E.
- One more quantization gotcha: Q4_K_M degraded output quality on this model — on several images it produced spurious repeated boxes/labels where BF16 stayed clean. So Q4's speed advantage is doubly moot for real grounding.
Takeaways
- If you want throughput and you're OK running PyTorch, the native MTP path is the one to use — it's the only runtime that actually uses the model as designed, and the batched vision/prefill is what wins on real (short) grounding outputs.
- llama.cpp is the easiest deploy (one native binary, runs on Windows, CPU-offload friendly) and is a solid single-stream option. Use the BF16 GGUF unless you've verified Q4 quality on your data.
- The vLLM port is clever (extract the Qwen2 backbone, inject
prompt_embeds), but on a single 16 GB card the split design can't host the vision tower and a full-GPU vLLM together, and the client-side serial vision encode becomes ~82% of the wall clock. Its batched-decode strength would pay off with a separate GPU for vision, longer outputs, or very high concurrency.
Huge thanks to @yuuko-eth (llama.cpp + GGUF) and @WuNein (vLLM) — having multiple working runtimes for this model is great, and none of this comparison would exist without them.
Setup: RTX 5070 Ti (sm_120, 16 GB) · CUDA 12.8/13.0 · llama.cpp yuuko-eth mtmd-grounders fork (mainline llama.cpp does not load this model's projector) · vLLM 0.22.0 (WSL) · locateanything-batch bf16 + flash-attn. Greedy throughout. Scripts + raw JSON: the locateanything-batch repo under examples/ (benchmark.py, bench_compare_llamacpp.py, _bench_results/COMPARISON.md).
Oh boy, thank you for the thorough benchmark! In practice anything lower than Q4 would be visible unusable on first glance so that's the floor I included.
So maybe Q6 or Q8 as usable ones? Pairing grammar constraints to Q4KM does maintain the output quality in my test cases where it's mostly about single element locating, but that's just my own parameters 😂
Fairly enough, but I still see room for optimization. For instance, using Qwen2 with FP8 precision on the RTX 4090, and explicitly enable Flash Attention 2 for the ViT implementation.
https://huggingface.co/shigureui/LocateAnything-Qwen2-FP8
Working on it, still.