Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/LocateAnything-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/LocateAnything-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/LocateAnything-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/LocateAnything-3B

SGLang

How to use nvidia/LocateAnything-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/LocateAnything-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/LocateAnything-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
```
docker model run hf.co/nvidia/LocateAnything-3B
```

Benchmark: locateanything-batch（MTP） vs the llama.cpp & vLLM AR ports

#14

by Liuwang971 - opened 1 day ago

Discussion

Liuwang971

1 day ago

Benchmarking the LocateAnything-3B runtimes: native MTP vs the llama.cpp & vLLM AR ports

There are now three ways to run nvidia/LocateAnything-3B outside the reference snippet:

llama.cpp — the yuuko-eth/llama.cpp mtmd-grounders fork + yuuko-eth/LocateAnything-3B-GGUF (discussion #12)
vLLM — WuNein/LocateAnything-vLLM (discussion #13)
native MTP, batched — locateanything-batch (discussion #10), which keeps the model's own multi-token-prediction decode but lifts the batch == 1 lock

Both community ports are excellent for getting the model running, but neither implements MTP — they decode plain autoregressively (speculative_config=None in vLLM; one token/step in llama.cpp). I was curious how much that costs, so I benchmarked all three head-to-head. Sharing the numbers and (more usefully) why they come out the way they do.

TL;DR

End-to-end grounding throughput, one RTX 5070 Ti (16 GB), single clean detection per image, greedy, batch 8, precision-matched (bf16/fp16), each run serially on an otherwise-idle GPU:

Runtime	decode	precision	img/s	ms/img
native MTP (`locateanything-batch`)	fast-MTP + batched	bf16	4.53	221
llama.cpp (yuuko-eth fork)	autoregressive	BF16	2.61	383
vLLM (WuNein)	autoregressive	fp16	1.02	977

MTP is ~1.7× faster than llama.cpp and ~4.4× faster than vLLM end-to-end. All three decode the identical box (<ref>rectangle</ref><box><117><181><460><678></box>), so this is purely a runtime comparison.

The surprising part isn't that MTP wins — it's where the time actually goes, which flips some intuitions about these runtimes.

Method (and the traps I hit, in case you benchmark this too)

A few things make a naive comparison misleading; each is worth knowing:

Precision. The GGUF repo ships Q4_K_M by default, but vLLM runs fp16 and the native path is bf16. Quantization gives llama.cpp a real single-stream decode edge that has nothing to do with the runtime. There's a prebuilt LocateAnything-3B-BF16.gguf in the GGUF repo — use it for an apples-to-apples comparison.
vLLM hides the vision cost. The WuNein port computes the MoonViT vision embeddings client-side and ships prompt_embeds to the server, so the server-side latency excludes vision entirely. For a true end-to-end number you must time the vision encode too.
Output length must match. The model's default "Locate all the instances that matches the following description: …" prompt makes the MTP path emit a whole stream of boxes (and, on out-of-domain inputs, degenerate into repeats — hundreds of tokens), while plain AR stops early. That's not a fair latency comparison. Switching to a single-detection prompt — "Detect the <obj>." — makes every runtime emit exactly one box and stop at EOS (~9–10 tokens), so the decode work is identical.

Final protocol: 8 single-object images (1024×768, one colored rectangle each), prompt Detect the rectangle., greedy, batch 8 (batch 32 strained the 16 GB card and regressed the MTP path — see caveats).

Where the end-to-end time goes

This is the key finding. For a short grounding output, E2E is dominated by vision-encode + prefill, not decode:

Runtime	vision encode	prefill + decode (server)	bottleneck
vLLM	~82%	~18%	client-side, serial-per-image vision encode
llama.cpp	prefill (~1k image tokens) dominates	decode ~9 tok is negligible	prefill
native MTP	batched (all images in one flash pass) + batched prefill	—	best-amortized → fastest

So the runtime that wins is the one that batches the front of the pipeline. The native path packs every image into one MoonViT extract_feature (flash varlen, block-diagonal — bit-identical to per-image but 2.6–3×) and does one batched shared-prefix prefill (~3.6× over batch-1). The vLLM port does the opposite — vision runs one image at a time on the client — which is why its E2E is the slowest despite having by far the highest raw decode throughput (see next).

Raw LM decode capacity (forced 128 tokens, vision excluded)

If you only look at decode tokens/s, you get a very different — and misleading — ranking:

Runtime	precision	decode tok/s @ batch 1	aggregate @ batch 32
llama.cpp	Q4_K_M	253	526
llama.cpp	BF16	120	445
vLLM	fp16	100	1333

Quantization buys llama.cpp a 2.1× single-stream decode edge (253 vs 120) — pure memory bandwidth (4.94 vs 16 bits/weight). It collapses to 1.18× at batch 32 (decode becomes compute-bound once weights amortize over the batch).
At precision parity, llama.cpp single-stream (120) ≈ vLLM (100). vLLM's continuous batching is far ahead at batch 32 (1333 tok/s) — but for short grounding outputs that capacity is wasted, because vision dominates E2E.
One more quantization gotcha: Q4_K_M degraded output quality on this model — on several images it produced spurious repeated boxes/labels where BF16 stayed clean. So Q4's speed advantage is doubly moot for real grounding.

Takeaways

If you want throughput and you're OK running PyTorch, the native MTP path is the one to use — it's the only runtime that actually uses the model as designed, and the batched vision/prefill is what wins on real (short) grounding outputs.
llama.cpp is the easiest deploy (one native binary, runs on Windows, CPU-offload friendly) and is a solid single-stream option. Use the BF16 GGUF unless you've verified Q4 quality on your data.
The vLLM port is clever (extract the Qwen2 backbone, inject prompt_embeds), but on a single 16 GB card the split design can't host the vision tower and a full-GPU vLLM together, and the client-side serial vision encode becomes ~82% of the wall clock. Its batched-decode strength would pay off with a separate GPU for vision, longer outputs, or very high concurrency.

Huge thanks to @yuuko-eth (llama.cpp + GGUF) and @WuNein (vLLM) — having multiple working runtimes for this model is great, and none of this comparison would exist without them.

Setup: RTX 5070 Ti (sm_120, 16 GB) · CUDA 12.8/13.0 · llama.cpp yuuko-eth mtmd-grounders fork (mainline llama.cpp does not load this model's projector) · vLLM 0.22.0 (WSL) · locateanything-batch bf16 + flash-attn. Greedy throughout. Scripts + raw JSON: the locateanything-batch repo under examples/ (benchmark.py, bench_compare_llamacpp.py, _bench_results/COMPARISON.md).

yuuko-eth

about 24 hours ago

Oh boy, thank you for the thorough benchmark! In practice anything lower than Q4 would be visible unusable on first glance so that's the floor I included.
So maybe Q6 or Q8 as usable ones? Pairing grammar constraints to Q4KM does maintain the output quality in my test cases where it's mostly about single element locating, but that's just my own parameters 😂

shigureui

about 4 hours ago

Fairly enough, but I still see room for optimization. For instance, using Qwen2 with FP8 precision on the RTX 4090, and explicitly enable Flash Attention 2 for the ViT implementation.

https://huggingface.co/shigureui/LocateAnything-Qwen2-FP8

Working on it, still.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment