Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/LocateAnything-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/LocateAnything-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/LocateAnything-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/LocateAnything-3B

SGLang

How to use nvidia/LocateAnything-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/LocateAnything-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/LocateAnything-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
```
docker model run hf.co/nvidia/LocateAnything-3B
```

Inference support for vLLM and SGLang OpenAI endpoints

by Vishva007 - opened 4 days ago

Discussion

Vishva007

4 days ago

Hi NVIDIA Team,

I'm interested in deploying LocateAnything-3B using high-throughput inference engines like vLLM or SGLang.

Are there any specific configuration flags required to handle the Parallel Box Decoding (PBD) architecture when serving via an OpenAI-compatible endpoint?
Does the current implementation in these engines support the custom MLP projector and MoonViT encoder natively, or is a specific trust-remote-code setup required?
If not currently supported, are there plans for an official integration or a recommended Docker container for scalable production serving?

Thanks for this impressive grounding model!

ShihaoW

NVIDIA org about 24 hours ago

Hi @Vishva007 ,

Thanks for your interest in LocateAnything-3B and for the kind words about our grounding model!

Regarding your deployment questions, to be completely transparent, our team currently lacks extensive experience in deploying models on high-throughput inference engines like vLLM or SGLang. Here is the current status regarding your questions:

Our Recent Attempts: I actually tried using GPT-5.5 xhigh to do some "vibe coding" to hack together a vLLM-compatible version recently, but I ran into a ton of issues and roadblocks.

Potential Reference: During my exploration, one resource that seemed somewhat relevant as a reference point is the vllm-project/dllm-plugin (vLLM plugin for block-based diffusion language model support).

Future Plans: Honestly, adapting our complex multi-modal architecture to fit perfectly into these engines feels like a very difficult path right now. Because of this, we don't have immediate plans for an official integration or a recommended Docker container for scalable production serving at this exact moment.

Since we are still exploring this space ourselves, if you decide to dive into it or make any progress, we would absolutely love to hear your insights or welcome a community PR! We are very open to collaborating with the community to figure out the best deployment path.

Thanks again!

Vishva007

about 21 hours ago

Hi Shihao,

Thank you for the incredibly candid response! I really appreciate the transparency regarding your recent experiments and the current state of vLLM/SGLang support.

To be honest, modifying core engine architectures to support your custom MLP and MoonViT setup is likely a bit out of my depth as well! That is definitely a massive undertaking.

I'll check out the dllm-plugin reference you mentioned. If I happen to hack together a workable workaround or make any breakthroughs, I’ll gladly share them here.

Thanks again to you and the team for an amazing grounding model!

Best regards,
Vishva

Columbus688

about 2 hours ago

You can refer to Kimi-VL's code for most of codes for vLLM adaptation, but still there's much to modify: 1) Not compatible with transformers v5; 2) processor may be a big problem, I'm stucking at _get_prompt_updates because of processor's not being able to correctly deal with <image-1> to <img><IMG_CONTEXT></img> conversion while Kimi-VL's processor correctly dealing with this problem

seems OK to reuse vLLM's version of MoonViT, and MLP can go with a hf_to_vllm_mapper defined in custom LocateAnythingForConditionalGeneration
for PBD MTP, I haven't tried to solve how to adapt this into vLLM's framework (just AR path now, and stucked at that f*cking processor)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment