Instructions to use ASD9987/MOFT-Lingshu-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ASD9987/MOFT-Lingshu-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="ASD9987/MOFT-Lingshu-7B")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("ASD9987/MOFT-Lingshu-7B") model = AutoModelForMultimodalLM.from_pretrained("ASD9987/MOFT-Lingshu-7B") - Notebooks
- Google Colab
- Kaggle
MOFT-Lingshu-7B
MOFT-Lingshu-7B is a medical vision–language model (VLM) obtained by applying Multi-Objective Fine-Tuning (MOFT) on top of the open-source backbone Lingshu-7B.
MOFT splits linguistic fluency and medical semantic accuracy into two separate supervised fine-tuning trajectories, then merges them in parameter space via Linear Mode Connectivity (LMC) using linear interpolation. This mitigates gradient dilution from frequent syntactic tokens and reduces gradient interference between the two objectives. Training uses the curated dataset PSV2026 (~52K samples), including standard QA pairs and critique-based augmentation (CR).
- Project page: https://lycus99.github.io/MOFT/
- Code & resources: https://github.com/Lycus99/MOFT
Model summary
| Item | Details |
|---|---|
| Backbone | lingshu-medical-mllm/Lingshu-7B (generalist multimodal medical understanding & reasoning) |
| Architecture | Same as Lingshu-7B (medical VLM built on the Qwen2.5-VL family) |
| Fine-tuning | MOFT: trajectory decoupling ((c=1.0) fluency / (c=2.5) medical entities) + linear parameter fusion (best (λ \approx 0.5) in the paper) |
| Training data | PSV2026 from PathVQA, SLAKE, VQA-RAD, etc., cleaned and restructured with commercial VLMs; PSV2026-QA + PSV2026-CR (~26K each) |
| Medical entity weighting | GPT-5 labels medical-entity tokens for higher loss weights; ~11.64% of tokens per response (paper) |
| Framework | MS-Swift, full-parameter fine-tuning, bf16, DeepSpeed ZeRO-2 |
| Hardware | NVIDIA H20 |
| Hyperparameters (paper) | Global batch size 64, LR 5×10⁻⁶, warmup ratio 0.05, 3 epochs |
| Inference (paper) | vLLM + greedy decoding; evaluation with rubric-guided LLM-as-a-judge (e.g., DeepSeek-V3.2) |
Evaluation (PSV2026, paper Table)
Protocol: five-dimensional binary rubric + LLM judge on the PSV2026 test set (9,264 instances); accuracy (%) per question type.
Zero-shot backbone vs. MOFT (Medical / Proposed rows in the table)
| Model | Visual Perception | Factual Verification | Comparative Analysis | Mechanistic Reasoning | Diagnostic Inference | Clinical Management | Avg. |
|---|---|---|---|---|---|---|---|
| Lingshu-7B (zero-shot) | 61.5 | 78.9 | 67.2 | 54.4 | 43.4 | 63.1 | 54.3 |
| +SFT | 71.1 | 84.2 | 74.8 | 64.1 | 51.7 | 72.8 | 63.0 |
| MOFT-Lingshu-7B | 73.1 | 86.1 | 76.9 | 66.1 | 54.9 | 72.5 | 65.5 |
Intended use & limitations
Research scope. This work primarily proposes MOFT as an adaptation strategy for medical VLMs that targets better accuracy than standard supervised fine-tuning (SFT) under a controlled benchmark setting. The released checkpoint is intended only for fundamental research—for example, studying multi-objective fine-tuning, parameter-space fusion, or rubric-based evaluation of medical VLMs—not as a finished clinical product or validated medical device.
Not for clinical use. The model must not be used as the sole basis for diagnosis, treatment, or any clinical decision. It may hallucinate, reflect biases, or overfit to training distributions; generalization to unseen diseases, populations, or imaging protocols is limited. Any exploratory use of outputs should be reviewed by qualified clinicians and must comply with local regulations governing medical AI and patient data.
How to use
Follow Lingshu-7B’s model card for chat templates, image preprocessing, and resolution limits. Load this repo as a drop-in replacement for the backbone path:
from transformers import AutoModelForCausalLM, AutoProcessor
model_id = "ASD9987/MOFT-Lingshu-7B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
The paper uses vLLM with greedy decoding for batched inference; validate latency, memory, and compliance before any deployment.
Training data & privacy
- PSV2026 is built from public benchmarks (PathVQA, SLAKE, VQA-RAD) with automated cleaning and augmentation; see the paper Dataset section.
- The release is not intended to contain identifiable patient data; users remain responsible for applicable medical-AI and data-governance laws.
Citation
If you use this model, cite Lingshu-7B and the MOFT / PSV2026 work (replace the MOFT entry with the final venue when published).
Lingshu (backbone)
@article{xu2025lingshu,
title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning},
author={Xu, Weiwen and Chan, Hou Pong and Li, Long and others},
journal={arXiv preprint arXiv:2506.07044},
year={2025}
}
MOFT (update venue when available)
@article{li2026moft,
title={{MOFT}: Multi-Objective Fine-Tuning for Medical Vision-Language Models},
author={Li, Yuchong and Zeng, Xiaojun and You, Caizhen and Wu, Pengbo and Guo, Zixian and Yang, Jian and Jia, Fucang and Zhang, Lei},
note={Manuscript},
year={2026}
}
Acknowledgments
The Lingshu backbone is credited to the original authors and the Hugging Face release. MOFT and PSV2026 follow collaborative work affiliated with The Hong Kong Polytechnic University and partners. Thanks to the open-source community and benchmark maintainers.
- Downloads last month
- 3
Model tree for ASD9987/MOFT-Lingshu-7B
Base model
lingshu-medical-mllm/Lingshu-7B