Instructions to use XiaofengAlg/MechVL-4B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use XiaofengAlg/MechVL-4B-SFT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="XiaofengAlg/MechVL-4B-SFT")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("XiaofengAlg/MechVL-4B-SFT") model = AutoModelForMultimodalLM.from_pretrained("XiaofengAlg/MechVL-4B-SFT") - Notebooks
- Google Colab
- Kaggle
MechVL-4B-SFT
The SFT checkpoint of MechVL — the domain-specialized multimodal model for mechanical engineering drawing understanding, introduced in:
MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding (ICML 2026)
Model description
MechVL-4B-SFT is initialized from Qwen3-VL-4B-Instruct and trained with full-parameter SFT on the LLM module (vision encoder & projection frozen) over the MechVQA training split. It serves as the reference policy (Ï€_ref) for the subsequent RL stage.
| Base model | Qwen3-VL-4B-Instruct |
| Architecture | Qwen3VLForConditionalGeneration |
| Stage | 1 / 2 — SFT (→ RL) |
| MechVQA Total | 76.36 |
| RL checkpoint | MonteXiaofeng/MechVL-4B-RL |
Usage (transformers)
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
"MonteXiaofeng/MechVL-4B-SFT", dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained("MonteXiaofeng/MechVL-4B-SFT")
messages = [{"role": "user", "content": [
{"type": "image", "url": "path/to/drawing.png"},
{"type": "text", "text": "å›¾çº¸ä¸æ ‡æ³¨çš„零件总长度是多少?"},
]}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(out[0], skip_special_tokens=True))
For batch vLLM inference (SFT/RL dual-mode), see scripts/batch_infer.py.
Training
Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split, with a unified response schema (rationale + concise final answer). See §4.1 of the paper.
Citation
@misc{kou2026mechvqabenchmarkingenhancingmultimodal,
title={MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding},
author={Qian Kou and Xiaofeng Shi and Yulin Li and Xiaosong Qiu and Xinyang Wang and Hua Zhou and Cao Dongxing},
year={2026},
eprint={2605.30794},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.30794}
}
License
Apache-2.0.
- Downloads last month
- 1
Model tree for XiaofengAlg/MechVL-4B-SFT
Base model
Qwen/Qwen3-VL-4B-Instruct