Benchmark: locateanything-batch(MTP) vs the llama.cpp & vLLM AR ports

#14
by Liuwang971 - opened

Benchmarking the LocateAnything-3B runtimes: native MTP vs the llama.cpp & vLLM AR ports

There are now three ways to run nvidia/LocateAnything-3B outside the reference snippet:

  • llama.cpp — the yuuko-eth/llama.cpp mtmd-grounders fork + yuuko-eth/LocateAnything-3B-GGUF (discussion #12)
  • vLLMWuNein/LocateAnything-vLLM (discussion #13)
  • native MTP, batchedlocateanything-batch (discussion #10), which keeps the model's own multi-token-prediction decode but lifts the batch == 1 lock

Both community ports are excellent for getting the model running, but neither implements MTP — they decode plain autoregressively (speculative_config=None in vLLM; one token/step in llama.cpp). I was curious how much that costs, so I benchmarked all three head-to-head. Sharing the numbers and (more usefully) why they come out the way they do.

TL;DR

End-to-end grounding throughput, one RTX 5070 Ti (16 GB), single clean detection per image, greedy, batch 8, precision-matched (bf16/fp16), each run serially on an otherwise-idle GPU:

Runtime decode precision img/s ms/img
native MTP (locateanything-batch) fast-MTP + batched bf16 4.53 221
llama.cpp (yuuko-eth fork) autoregressive BF16 2.61 383
vLLM (WuNein) autoregressive fp16 1.02 977

MTP is ~1.7× faster than llama.cpp and ~4.4× faster than vLLM end-to-end. All three decode the identical box (<ref>rectangle</ref><box><117><181><460><678></box>), so this is purely a runtime comparison.

The surprising part isn't that MTP wins — it's where the time actually goes, which flips some intuitions about these runtimes.

Method (and the traps I hit, in case you benchmark this too)

A few things make a naive comparison misleading; each is worth knowing:

  1. Precision. The GGUF repo ships Q4_K_M by default, but vLLM runs fp16 and the native path is bf16. Quantization gives llama.cpp a real single-stream decode edge that has nothing to do with the runtime. There's a prebuilt LocateAnything-3B-BF16.gguf in the GGUF repo — use it for an apples-to-apples comparison.

  2. vLLM hides the vision cost. The WuNein port computes the MoonViT vision embeddings client-side and ships prompt_embeds to the server, so the server-side latency excludes vision entirely. For a true end-to-end number you must time the vision encode too.

  3. Output length must match. The model's default "Locate all the instances that matches the following description: …" prompt makes the MTP path emit a whole stream of boxes (and, on out-of-domain inputs, degenerate into repeats — hundreds of tokens), while plain AR stops early. That's not a fair latency comparison. Switching to a single-detection prompt — "Detect the <obj>." — makes every runtime emit exactly one box and stop at EOS (~9–10 tokens), so the decode work is identical.

Final protocol: 8 single-object images (1024×768, one colored rectangle each), prompt Detect the rectangle., greedy, batch 8 (batch 32 strained the 16 GB card and regressed the MTP path — see caveats).

Where the end-to-end time goes

This is the key finding. For a short grounding output, E2E is dominated by vision-encode + prefill, not decode:

Runtime vision encode prefill + decode (server) bottleneck
vLLM ~82% ~18% client-side, serial-per-image vision encode
llama.cpp prefill (~1k image tokens) dominates decode ~9 tok is negligible prefill
native MTP batched (all images in one flash pass) + batched prefill best-amortized → fastest

So the runtime that wins is the one that batches the front of the pipeline. The native path packs every image into one MoonViT extract_feature (flash varlen, block-diagonal — bit-identical to per-image but 2.6–3×) and does one batched shared-prefix prefill (~3.6× over batch-1). The vLLM port does the opposite — vision runs one image at a time on the client — which is why its E2E is the slowest despite having by far the highest raw decode throughput (see next).

Raw LM decode capacity (forced 128 tokens, vision excluded)

If you only look at decode tokens/s, you get a very different — and misleading — ranking:

Runtime precision decode tok/s @ batch 1 aggregate @ batch 32
llama.cpp Q4_K_M 253 526
llama.cpp BF16 120 445
vLLM fp16 100 1333
  • Quantization buys llama.cpp a 2.1× single-stream decode edge (253 vs 120) — pure memory bandwidth (4.94 vs 16 bits/weight). It collapses to 1.18× at batch 32 (decode becomes compute-bound once weights amortize over the batch).
  • At precision parity, llama.cpp single-stream (120) ≈ vLLM (100). vLLM's continuous batching is far ahead at batch 32 (1333 tok/s) — but for short grounding outputs that capacity is wasted, because vision dominates E2E.
  • One more quantization gotcha: Q4_K_M degraded output quality on this model — on several images it produced spurious repeated boxes/labels where BF16 stayed clean. So Q4's speed advantage is doubly moot for real grounding.

Takeaways

  • If you want throughput and you're OK running PyTorch, the native MTP path is the one to use — it's the only runtime that actually uses the model as designed, and the batched vision/prefill is what wins on real (short) grounding outputs.
  • llama.cpp is the easiest deploy (one native binary, runs on Windows, CPU-offload friendly) and is a solid single-stream option. Use the BF16 GGUF unless you've verified Q4 quality on your data.
  • The vLLM port is clever (extract the Qwen2 backbone, inject prompt_embeds), but on a single 16 GB card the split design can't host the vision tower and a full-GPU vLLM together, and the client-side serial vision encode becomes ~82% of the wall clock. Its batched-decode strength would pay off with a separate GPU for vision, longer outputs, or very high concurrency.

Huge thanks to @yuuko-eth (llama.cpp + GGUF) and @WuNein (vLLM) — having multiple working runtimes for this model is great, and none of this comparison would exist without them.


Setup: RTX 5070 Ti (sm_120, 16 GB) · CUDA 12.8/13.0 · llama.cpp yuuko-eth mtmd-grounders fork (mainline llama.cpp does not load this model's projector) · vLLM 0.22.0 (WSL) · locateanything-batch bf16 + flash-attn. Greedy throughout. Scripts + raw JSON: the locateanything-batch repo under examples/ (benchmark.py, bench_compare_llamacpp.py, _bench_results/COMPARISON.md).

Oh boy, thank you for the thorough benchmark! In practice anything lower than Q4 would be visible unusable on first glance so that's the floor I included.
So maybe Q6 or Q8 as usable ones? Pairing grammar constraints to Q4KM does maintain the output quality in my test cases where it's mostly about single element locating, but that's just my own parameters 😂

Fairly enough, but I still see room for optimization. For instance, using Qwen2 with FP8 precision on the RTX 4090, and explicitly enable Flash Attention 2 for the ViT implementation.

https://huggingface.co/shigureui/LocateAnything-Qwen2-FP8

Working on it, still.

Sign up or log in to comment