Model Card for WhyLesionCLIP 👍🏽

Model Details
Get Started
Uses
Training Details
Evaluation
Citation

Model Details

WhyLesionCLIP can align skin lesion images with text descriptions. It is fine-tuned from OpenCLIP (ViT-L/14) on ISIC with clinical reports generated by GPT-4V. WhyLesionCLIP significantly outperforms PubMedCLIP, BioMedCLIP, etc. in zero-shot and linear probing on various skin lesion datasets. (See results in Evaluation) While our CLIP models excel with careful data curation, training converges quickly, suggesting the current contrastive objective might not fully exploit the information from the data, potentially taking shortcuts, such as comparing images from different patients instead of focusing on diseases. Future research should explore more suitable objectives and larger-scale data collections to develop more robust medical foundation models.

Paper: https://arxiv.org/pdf/2405.14839
Website: https://yueyang1996.github.io/knobo/
Repository: https://github.com/YueYANG1996/KnoBo

How to Get Started with the Model

Use the code below to get started with the model.

pip install open_clip_torch

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms("hf-hub:yyupenn/whylesionclip")
model.eval()
tokenizer = open_clip.get_tokenizer("ViT-L-14")

image = preprocess(Image.open("test_skin.jpg")).unsqueeze(0)
text = tokenizer(["dark brown", "bleeding", "irregular shape"])

with torch.no_grad(), torch.cuda.amp.autocast():
 image_features = model.encode_image(image)
 text_features = model.encode_text(text)
 image_features /= image_features.norm(dim=-1, keepdim=True)
 text_features /= text_features.norm(dim=-1, keepdim=True)

 text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Uses

As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot medical image (skin lesion) classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models.

Direct Use

WhyLesionCLIP can be used for zero-shot skin lesion classification. You can use it to compute the similarity between an skin lesion image and a text description.

Downstream Use

WhyLesionCLIP can be used as a feature extractor for downstream tasks. You can use it to extract features from skin lesion images and text descriptions for other downstream tasks.

Out-of-Scope Use

WhyLesionCLIP should not be used for clinical diagnosis or treatment. It is not intended to be used for any clinical decision-making. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Training Details

Training Data

We employ the ISIC dataset and use GPT-4V to generate clinical reports for 56,590 images. We preprocess these reports by extracting medically relevant findings, each described in a short and concise term. In total, we assemble 438K image-text pairs for training WhyLesionCLIP.

Training Details

We utilize the training script from OpenCLIP and select ViT-L/14 as the backbone. Training is performed on 4 RTX A6000 GPUs for 10 epochs with a batch size of 128 and a learning rate of 1e−5. We choose checkpoints based on the lowest contrastive loss on validation sets.

Evaluation

Testing Data

We evaluate on 5 skin lesion classification datasets: HAM10000, BCN20000, PAD-UFES-20, Melanoma, and UWaterloo. We report the zero-shot and linear probing accuracy on the above 5 datasets.

Baselines

We compare various CLIP models, including OpenAI-CLIP, OpenCLIP, PubMedCLIP, BioMedCLIP, PMC-CLIP and MedCLIP. We evaluate these models in both zero-shot and linear probe scenarios. In zero-shot, GPT-4 generates prompts for each class, and we use the ensemble of cosine similarities between the image and prompts as the score for each class. In linear probing, we use the CLIP models as image encoders to extract features for logistic regression. Additionally, we include DenseNet-121 (fine-tuned on the pretraining datasets with cross-entropy loss) as a baseline for linear probing.

Results

The figure below shows the averaged Zero-shot and Linear Probe performance of different models on five skin lesion datasets.

Citation

Please cite our paper if you use this model in your work:

@article{yang2024textbook,
 title={A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis}, 
 author={Yue Yang and Mona Gandhi and Yufei Wang and Yifan Wu and Michael S. Yao and Chris Callison-Burch and James C. Gee and Mark Yatskar},
 journal={arXiv preprint arXiv:2405.14839},
 year={2024}
}

yyupenn
/

whylesionclip

Model Card for WhyLesionCLIP 👍🏽

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Training Details

Training Data

Training Details

Evaluation

Testing Data

Baselines

Results

Citation