LAPVQA
Collection
Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. β’ 14 items β’ Updated
Part of the LAPVQA collection.
A ViT-L/14 vision encoder trained from scratch on MIMIC-CXR using a sigmoid (multi-label binary cross-entropy) contrastive loss β an alternative to InfoNCE that treats each image-text pair independently rather than competing within the batch.
| Component | Detail |
|---|---|
| Vision backbone | ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px |
| Text encoder | 6-layer, 512-dim bidirectional transformer, GPT-2 vocab (50 257) |
| Projection | Linear β 512-dim shared embedding space |
| Loss | Per-pair sigmoid BCE (SigLIP-style) |
| Training data | MIMIC-CXR (physionet.org/content/mimic-cxr) |
| Epochs | 50 |
| Dataset | Mean AUC |
|---|---|
| NIH CXR-14 (14-class) | 0.650 |
| CheXpert-5 (5-class) | 0.785 |
| File | Description |
|---|---|
encoder_final.pt |
Vision encoder weights at end of training |
model_best.pt |
Full model at best validation loss |
model_epochXXX.pt |
Periodic epoch snapshots (every 10 epochs) |
import torch
from lapvqa.pretrain.model import ContrastiveModel
ckpt = torch.load("encoder_final.pt", map_location="cpu")
model = ContrastiveModel()
model.vision_encoder.load_state_dict(ckpt)
model.eval()