metadata

license: mit
language:
  - en
tags:
  - medical
  - vision
widget:
  - src: https://d168r5mdg5gtkq.cloudfront.net/medpix/img/full/synpic9078.jpg
    candidate_labels: Chest X-Ray, Brain MRI, Abdomen CT Scan
    example_title: Abdomen CT Scan

Model Card for PubMedCLIP

PubMedCLIP is a fine-tuned version of CLIP for the medical domain.

Model Description

PubMedCLIP was trained on the Radiology Objects in COntext (ROCO) dataset, a large-scale multimodal medical imaging dataset. The ROCO dataset includes diverse imaging modalities (such as X-Ray, MRI, ultrasound, fluoroscopy, etc.) from various human body regions (such as head, spine, chest, abdomen, etc.) captured from open-access PubMed articles.

The authors of PubMedCLIP have released three different pre-trained models at this link which use ResNet-50, ResNet-50x4 and ViT32 as image encoders. This repository includes only the ViT32 variant of the PubMedCLIP model.

Repository: PubMedCLIP Official GitHub Repository
Paper: Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Use with Transformers

import requests
from PIL import Image

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("flaviagiammarino/pubmed-clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("flaviagiammarino/pubmed-clip-vit-base-patch32")

url = "https://d168r5mdg5gtkq.cloudfront.net/medpix/img/full/synpic9078.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["Chest X-Ray", "Brain MRI", "Abdominal CT Scan"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Additional Information

Licensing Information

The authors have released the model code and pre-trained checkpoints under the MIT License.

Citation Information

@article{eslami2021does,
  title={Does clip benefit visual question answering in the medical domain as much as it does in the general domain?},
  author={Eslami, Sedigheh and de Melo, Gerard and Meinel, Christoph},
  journal={arXiv preprint arXiv:2112.13906},
  year={2021}
}