LAPVQA
Collection
Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. β’ 14 items β’ Updated
Part of the LAPVQA collection.
Task heads for Differential VQA: given a prior and a current chest X-ray,
answer questions about radiological changes. Trained on MIMIC-Diff-VQA with five
frozen encoders. Each .pt file is a plain state dict of DiffVQAHead.
DiffVQAHead
vis_proj : Linear(vis_dim β 512) # shared for both images
frame_emb : Embedding(2, 512) # 0=reference, 1=current
memory : [ref_proj + frame_emb(0) ; curr_proj + frame_emb(1)] β [B, 2N, 512]
tok_emb : Embedding(50257, 512)
pos_emb : Embedding(200, 512)
decoder : 6 Γ TransformerDecoderLayer (pre-norm)
lm_head : Linear(512 β 50257, bias=False)
| File | Encoder | vis_dim |
|---|---|---|
clip-vit-l14_best.pt |
CLIP ViT-L/14 | 1024 |
coca_best.pt |
CoCa | 768 |
florence2_best.pt |
Florence-2 | 1024 |
siglip_best.pt |
SigLIP | 1152 |
owlv2_best.pt |
OWLv2 | 1024 |
| Encoder | BLEU-1 | BLEU-4 | ROUGE-1 | RadGraph-s |
|---|---|---|---|---|
| CLIP ViT-L/14 | 0.184 | 0.128 | 0.336 | 0.322 |
| CoCa | 0.196 | 0.138 | 0.320 | 0.317 |
| Florence-2 | 0.191 | 0.138 | 0.319 | 0.318 |
| SigLIP | 0.186 | 0.131 | 0.322 | 0.313 |
import torch
import tiktoken
from lapvqa.diffvqa.model import DiffVQAHead
ckpt = torch.load("coca_best.pt", map_location="cpu")
head = DiffVQAHead(vis_dim=768) # adjust vis_dim per encoder
head.load_state_dict(ckpt)
head.eval()
enc = tiktoken.get_encoding("gpt2")
bos_id = eos_id = enc.eot_token
# curr_vis, ref_vis: [B, N, vis_dim] β patch tokens from the frozen encoder
answers = head.generate(
curr_vis = curr_vis,
ref_vis = ref_vis,
prompt_ids = question_ids, # [B, Q]
bos_id = bos_id,
eos_id = eos_id,
max_new_tokens = 128,
)
decoded = [enc.decode(ids) for ids in answers]