--- tags: - chest-xray - radiology - visual-question-answering - differential-vqa - mimic-cxr license: apache-2.0 --- # LAPVQA — Differential VQA (Frozen Off-the-shelf Encoders) Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa). ## Description Task heads for **Differential VQA**: given a *prior* and a *current* chest X-ray, answer questions about radiological changes. Trained on MIMIC-Diff-VQA with five **frozen** encoders. Each `.pt` file is a plain state dict of `DiffVQAHead`. ## Architecture — `DiffVQAHead` ``` vis_proj : Linear(vis_dim → 512) # shared for both images frame_emb : Embedding(2, 512) # 0=reference, 1=current memory : [ref_proj + frame_emb(0) ; curr_proj + frame_emb(1)] → [B, 2N, 512] tok_emb : Embedding(50257, 512) pos_emb : Embedding(200, 512) decoder : 6 × TransformerDecoderLayer (pre-norm) lm_head : Linear(512 → 50257, bias=False) ``` | File | Encoder | vis_dim | |---|---|---| | `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 | | `coca_best.pt` | CoCa | 768 | | `florence2_best.pt` | Florence-2 | 1024 | | `siglip_best.pt` | SigLIP | 1152 | | `owlv2_best.pt` | OWLv2 | 1024 | ## Results (test set) | Encoder | BLEU-1 | BLEU-4 | ROUGE-1 | RadGraph-s | |---|---|---|---|---| | CLIP ViT-L/14 | 0.184 | 0.128 | 0.336 | 0.322 | | CoCa | 0.196 | 0.138 | 0.320 | 0.317 | | Florence-2 | 0.191 | 0.138 | 0.319 | 0.318 | | SigLIP | 0.186 | 0.131 | 0.322 | 0.313 | ## Loading ```python import torch import tiktoken from lapvqa.diffvqa.model import DiffVQAHead ckpt = torch.load("coca_best.pt", map_location="cpu") head = DiffVQAHead(vis_dim=768) # adjust vis_dim per encoder head.load_state_dict(ckpt) head.eval() enc = tiktoken.get_encoding("gpt2") bos_id = eos_id = enc.eot_token # curr_vis, ref_vis: [B, N, vis_dim] — patch tokens from the frozen encoder answers = head.generate( curr_vis = curr_vis, ref_vis = ref_vis, prompt_ids = question_ids, # [B, Q] bos_id = bos_id, eos_id = eos_id, max_new_tokens = 128, ) decoded = [enc.decode(ids) for ids in answers] ```