On the Compositional Generalization of Multimodal LLMs for Medical Imaging
Abstract
Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at https://github.com/FreedomIntelligence/Med-MAT.
Community
Multimodal LLMs can use compositional generalization to understand unseen images.
The very first example question to the MLLM contains a prominent typo. I have a lot of trouble taking what should be painstakingly detailed research seriously when the training and evaluation datasets' underlying information theory and accuracy are suspicious from the outset.
Thank you for taking the time to review our paper. I sincerely apologize for the poor reading experience caused by my rushed submission. I will carefully check and correct all spelling errors in the latest version of the paper.
Regarding the error you mentioned, it only appears in an example in the paper's illustration and is not related to the real dataset used in the paper. I have just rechecked all the prompts in the Med-MAT dataset and found no errors.
I sincerely hope you will be able to review the paper again at your convenience. Your feedback is invaluable, and we will do our best to improve the paper to provide useful insights for society.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities (2024)
- FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data (2024)
- Generalizable Single-Source Cross-modality Medical Image Segmentation via Invariant Causal Mechanisms (2024)
- A Survey of Medical Vision-and-Language Applications and Their Techniques (2024)
- Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine (2024)
- Multimodal Fusion Learning with Dual Attention for Medical Imaging (2024)
- Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper