LAPVQA — Pretrain (Sigmoid)

Description

A ViT-L/14 vision encoder trained from scratch on MIMIC-CXR using a sigmoid (multi-label binary cross-entropy) contrastive loss — an alternative to InfoNCE that treats each image-text pair independently rather than competing within the batch.

Architecture

Component	Detail
Vision backbone	ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px
Text encoder	6-layer, 512-dim bidirectional transformer, GPT-2 vocab (50 257)
Projection	Linear → 512-dim shared embedding space
Loss	Per-pair sigmoid BCE (SigLIP-style)
Training data	MIMIC-CXR (physionet.org/content/mimic-cxr)
Epochs	50

Downstream Evaluation (frozen encoder + linear probe)

Dataset	Mean AUC
NIH CXR-14 (14-class)	0.650
CheXpert-5 (5-class)	0.785

Files

File	Description
`encoder_final.pt`	Vision encoder weights at end of training
`model_best.pt`	Full model at best validation loss
`model_epochXXX.pt`	Periodic epoch snapshots (every 10 epochs)

Usage

import torch
from lapvqa.pretrain.model import ContrastiveModel

ckpt = torch.load("encoder_final.pt", map_location="cpu")
model = ContrastiveModel()
model.vision_encoder.load_state_dict(ckpt)
model.eval()

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including dmusingu/lapvqa-pretrain-sigmoid

LAPVQA

Collection

Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. • 14 items • Updated 3 days ago