Confused about tags - Is this model text-only or multimodal?

#2
by bandageshi - opened

Hi! I'm a beginner and a bit confused about this model's capabilities.

The tags on the page say image-text-to-text, but in your serving instructions, the vLLM command uses --language-model-only. Also, I noticed there's no preprocessor_config.json in the files.

Could you clarify if this model supports image inputs, or if it's purely for text?

Thanks for the cool model!

Hi @banageshi . All of these models are trained for mxfp4 text + reasoning. Some of these are my agents auto uploaded test commands after evals. I did not touch any vision but it should work

Let me know if I can help out

I just checked the model.safetensors.index.json and noticed that the vision layers are indeed still there. So theoretically, could I just copy over the preprocessor_config.json from the official Gemma 4 model to get the vision features working? Sounds a bit risky though lol.

By the way, I think this model is really cool! I've been hoping Google would release a mid-sized MoE model like gpt-oss-120b, but it's pretty clear they don't want open-source models eating into Gemini's lunch.

I'll definitely play around with it and give it a try. An NVFP4 quantized version would be awesome if you ever plan on making one!

Sign up or log in to comment