---
tags:
- chest-xray
- radiology
- visual-question-answering
- differential-vqa
- mimic-cxr
license: apache-2.0
---

# LAPVQA — Differential VQA (Native / End-to-end)

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

DiffVQA models trained **end-to-end** (encoder + head jointly). Each `.pt` file
is a plain state dict of `DiffVQAHead`. MAE-ViT-L/16 is the primary encoder studied.

## Results (test set, MAE-ViT-L/16)

| BLEU-4 | ROUGE-2 | RadGraph-s | BERTScore F1 |
|---|---|---|---|
| 0.472 | 0.573 | 0.288 | 0.938 |

| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 |
| `coca_best.pt` | CoCa | 768 |
| `florence2_best.pt` | Florence-2 | 1024 |
| `mae-vit-l16_best.pt` | MAE ViT-L/16 | 1024 |
| `siglip_best.pt` | SigLIP | 1152 |

## Loading

```python
import torch
from lapvqa.diffvqa.model import DiffVQAHead

ckpt = torch.load("mae-vit-l16_best.pt", map_location="cpu")
head = DiffVQAHead(vis_dim=1024)
head.load_state_dict(ckpt)
head.eval()
```