LAPVQA โ€” Pretrain (Contrastive)

Part of the LAPVQA collection.

Description

A ViT-L/14 vision encoder trained from scratch on MIMIC-CXR chest X-ray / report pairs using InfoNCE contrastive learning (image encoder vs. 6-layer bidirectional text encoder). The encoder is intended to be used as a frozen feature extractor for downstream CXR tasks.

Architecture

Component Detail
Vision backbone ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px
Text encoder 6-layer, 512-dim bidirectional transformer, GPT-2 vocab (50 257)
Projection Linear โ†’ 512-dim shared embedding space
Loss InfoNCE (symmetric softmax cross-entropy)
Training data MIMIC-CXR (physionet.org/content/mimic-cxr)
Epochs 50

Downstream Evaluation (frozen encoder + linear probe)

Dataset Mean AUC
NIH CXR-14 (14-class) 0.653
CheXpert-5 (5-class) 0.759

Files

File Description
encoder_final.pt Vision encoder weights at the end of training
model_best.pt Full model (encoder + text encoder) at best val loss

Usage

import torch
from lapvqa.pretrain.model import ContrastiveModel

ckpt = torch.load("encoder_final.pt", map_location="cpu")
model = ContrastiveModel()
model.vision_encoder.load_state_dict(ckpt)
model.eval()

Citation

If you use these weights please cite MIMIC-CXR:

@article{johnson2019mimic,
  title   = {MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports},
  author  = {Johnson, Alistair EW and others},
  journal = {Scientific data},
  volume  = {6}, pages = {317}, year = {2019}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including dmusingu/lapvqa-pretrain-contrastive