Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/LocateAnything-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/LocateAnything-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/LocateAnything-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/LocateAnything-3B
- SGLang
How to use nvidia/LocateAnything-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/LocateAnything-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/LocateAnything-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
docker model run hf.co/nvidia/LocateAnything-3B
Inference support for vLLM and SGLang OpenAI endpoints
Hi NVIDIA Team,
I'm interested in deploying LocateAnything-3B using high-throughput inference engines like vLLM or SGLang.
- Are there any specific configuration flags required to handle the Parallel Box Decoding (PBD) architecture when serving via an OpenAI-compatible endpoint?
- Does the current implementation in these engines support the custom MLP projector and MoonViT encoder natively, or is a specific
trust-remote-codesetup required? - If not currently supported, are there plans for an official integration or a recommended Docker container for scalable production serving?
Thanks for this impressive grounding model!
Hi @Vishva007 ,
Thanks for your interest in LocateAnything-3B and for the kind words about our grounding model!
Regarding your deployment questions, to be completely transparent, our team currently lacks extensive experience in deploying models on high-throughput inference engines like vLLM or SGLang. Here is the current status regarding your questions:
Our Recent Attempts: I actually tried using GPT-5.5 xhigh to do some "vibe coding" to hack together a vLLM-compatible version recently, but I ran into a ton of issues and roadblocks.
Potential Reference: During my exploration, one resource that seemed somewhat relevant as a reference point is the vllm-project/dllm-plugin (vLLM plugin for block-based diffusion language model support).
Future Plans: Honestly, adapting our complex multi-modal architecture to fit perfectly into these engines feels like a very difficult path right now. Because of this, we don't have immediate plans for an official integration or a recommended Docker container for scalable production serving at this exact moment.
Since we are still exploring this space ourselves, if you decide to dive into it or make any progress, we would absolutely love to hear your insights or welcome a community PR! We are very open to collaborating with the community to figure out the best deployment path.
Thanks again!
Hi Shihao,
Thank you for the incredibly candid response! I really appreciate the transparency regarding your recent experiments and the current state of vLLM/SGLang support.
To be honest, modifying core engine architectures to support your custom MLP and MoonViT setup is likely a bit out of my depth as well! That is definitely a massive undertaking.
I'll check out the dllm-plugin reference you mentioned. If I happen to hack together a workable workaround or make any breakthroughs, I’ll gladly share them here.
Thanks again to you and the team for an amazing grounding model!
Best regards,
Vishva
You can refer to Kimi-VL's code for most of codes for vLLM adaptation, but still there's much to modify: 1) Not compatible with transformers v5; 2) processor may be a big problem, I'm stucking at _get_prompt_updates because of processor's not being able to correctly deal with <image-1> to <img><IMG_CONTEXT></img> conversion while Kimi-VL's processor correctly dealing with this problem
seems OK to reuse vLLM's version of MoonViT, and MLP can go with a hf_to_vllm_mapper defined in custom LocateAnythingForConditionalGeneration
for PBD MTP, I haven't tried to solve how to adapt this into vLLM's framework (just AR path now, and stucked at that f*cking processor)