Inference support for vLLM and SGLang OpenAI endpoints

#3
by Vishva007 - opened

Hi NVIDIA Team,

I'm interested in deploying LocateAnything-3B using high-throughput inference engines like vLLM or SGLang.

  1. Are there any specific configuration flags required to handle the Parallel Box Decoding (PBD) architecture when serving via an OpenAI-compatible endpoint?
  2. Does the current implementation in these engines support the custom MLP projector and MoonViT encoder natively, or is a specific trust-remote-code setup required?
  3. If not currently supported, are there plans for an official integration or a recommended Docker container for scalable production serving?

Thanks for this impressive grounding model!

Hi @Vishva007 ,

Thanks for your interest in LocateAnything-3B and for the kind words about our grounding model!

Regarding your deployment questions, to be completely transparent, our team currently lacks extensive experience in deploying models on high-throughput inference engines like vLLM or SGLang. Here is the current status regarding your questions:

Our Recent Attempts: I actually tried using GPT-5.5 xhigh to do some "vibe coding" to hack together a vLLM-compatible version recently, but I ran into a ton of issues and roadblocks.

Potential Reference: During my exploration, one resource that seemed somewhat relevant as a reference point is the vllm-project/dllm-plugin (vLLM plugin for block-based diffusion language model support).

Future Plans: Honestly, adapting our complex multi-modal architecture to fit perfectly into these engines feels like a very difficult path right now. Because of this, we don't have immediate plans for an official integration or a recommended Docker container for scalable production serving at this exact moment.

Since we are still exploring this space ourselves, if you decide to dive into it or make any progress, we would absolutely love to hear your insights or welcome a community PR! We are very open to collaborating with the community to figure out the best deployment path.

Thanks again!

Hi Shihao,

Thank you for the incredibly candid response! I really appreciate the transparency regarding your recent experiments and the current state of vLLM/SGLang support.

To be honest, modifying core engine architectures to support your custom MLP and MoonViT setup is likely a bit out of my depth as well! That is definitely a massive undertaking.

I'll check out the dllm-plugin reference you mentioned. If I happen to hack together a workable workaround or make any breakthroughs, I’ll gladly share them here.

Thanks again to you and the team for an amazing grounding model!

Best regards,
Vishva

You can refer to Kimi-VL's code for most of codes for vLLM adaptation, but still there's much to modify: 1) Not compatible with transformers v5; 2) processor may be a big problem, I'm stucking at _get_prompt_updates because of processor's not being able to correctly deal with <image-1> to <img><IMG_CONTEXT></img> conversion while Kimi-VL's processor correctly dealing with this problem

seems OK to reuse vLLM's version of MoonViT, and MLP can go with a hf_to_vllm_mapper defined in custom LocateAnythingForConditionalGeneration
for PBD MTP, I haven't tried to solve how to adapt this into vLLM's framework (just AR path now, and stucked at that f*cking processor)

Sign up or log in to comment