MOFT-Lingshu-7B

MOFT-Lingshu-7B is a medical vision–language model (VLM) obtained by applying Multi-Objective Fine-Tuning (MOFT) on top of the open-source backbone Lingshu-7B.

MOFT splits linguistic fluency and medical semantic accuracy into two separate supervised fine-tuning trajectories, then merges them in parameter space via Linear Mode Connectivity (LMC) using linear interpolation. This mitigates gradient dilution from frequent syntactic tokens and reduces gradient interference between the two objectives. Training uses the curated dataset PSV2026 (~52K samples), including standard QA pairs and critique-based augmentation (CR).


Model summary

Item Details
Backbone lingshu-medical-mllm/Lingshu-7B (generalist multimodal medical understanding & reasoning)
Architecture Same as Lingshu-7B (medical VLM built on the Qwen2.5-VL family)
Fine-tuning MOFT: trajectory decoupling ((c=1.0) fluency / (c=2.5) medical entities) + linear parameter fusion (best (λ \approx 0.5) in the paper)
Training data PSV2026 from PathVQA, SLAKE, VQA-RAD, etc., cleaned and restructured with commercial VLMs; PSV2026-QA + PSV2026-CR (~26K each)
Medical entity weighting GPT-5 labels medical-entity tokens for higher loss weights; ~11.64% of tokens per response (paper)
Framework MS-Swift, full-parameter fine-tuning, bf16, DeepSpeed ZeRO-2
Hardware NVIDIA H20
Hyperparameters (paper) Global batch size 64, LR 5×10⁻⁶, warmup ratio 0.05, 3 epochs
Inference (paper) vLLM + greedy decoding; evaluation with rubric-guided LLM-as-a-judge (e.g., DeepSeek-V3.2)

Evaluation (PSV2026, paper Table)

Protocol: five-dimensional binary rubric + LLM judge on the PSV2026 test set (9,264 instances); accuracy (%) per question type.

Zero-shot backbone vs. MOFT (Medical / Proposed rows in the table)

Model Visual Perception Factual Verification Comparative Analysis Mechanistic Reasoning Diagnostic Inference Clinical Management Avg.
Lingshu-7B (zero-shot) 61.5 78.9 67.2 54.4 43.4 63.1 54.3
+SFT 71.1 84.2 74.8 64.1 51.7 72.8 63.0
MOFT-Lingshu-7B 73.1 86.1 76.9 66.1 54.9 72.5 65.5

Intended use & limitations

Research scope. This work primarily proposes MOFT as an adaptation strategy for medical VLMs that targets better accuracy than standard supervised fine-tuning (SFT) under a controlled benchmark setting. The released checkpoint is intended only for fundamental research—for example, studying multi-objective fine-tuning, parameter-space fusion, or rubric-based evaluation of medical VLMs—not as a finished clinical product or validated medical device.

Not for clinical use. The model must not be used as the sole basis for diagnosis, treatment, or any clinical decision. It may hallucinate, reflect biases, or overfit to training distributions; generalization to unseen diseases, populations, or imaging protocols is limited. Any exploratory use of outputs should be reviewed by qualified clinicians and must comply with local regulations governing medical AI and patient data.


How to use

Follow Lingshu-7B’s model card for chat templates, image preprocessing, and resolution limits. Load this repo as a drop-in replacement for the backbone path:

from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "ASD9987/MOFT-Lingshu-7B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

The paper uses vLLM with greedy decoding for batched inference; validate latency, memory, and compliance before any deployment.


Training data & privacy

  • PSV2026 is built from public benchmarks (PathVQA, SLAKE, VQA-RAD) with automated cleaning and augmentation; see the paper Dataset section.
  • The release is not intended to contain identifiable patient data; users remain responsible for applicable medical-AI and data-governance laws.

Citation

If you use this model, cite Lingshu-7B and the MOFT / PSV2026 work (replace the MOFT entry with the final venue when published).

Lingshu (backbone)

@article{xu2025lingshu,
  title={Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning},
  author={Xu, Weiwen and Chan, Hou Pong and Li, Long and others},
  journal={arXiv preprint arXiv:2506.07044},
  year={2025}
}

MOFT (update venue when available)

@article{li2026moft,
  title={{MOFT}: Multi-Objective Fine-Tuning for Medical Vision-Language Models},
  author={Li, Yuchong and Zeng, Xiaojun and You, Caizhen and Wu, Pengbo and Guo, Zixian and Yang, Jian and Jia, Fucang and Zhang, Lei},
  note={Manuscript},
  year={2026}
}

Acknowledgments

The Lingshu backbone is credited to the original authors and the Hugging Face release. MOFT and PSV2026 follow collaborative work affiliated with The Hong Kong Polytechnic University and partners. Thanks to the open-source community and benchmark maintainers.

Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ASD9987/MOFT-Lingshu-7B

Finetuned
(4)
this model

Paper for ASD9987/MOFT-Lingshu-7B