Ar-CXR — Arabic Chest X-ray Vision–Language Model

أشعة · Ar-CXR  ·  Arabic Chest X-ray Vision–Language Model  ·  by Vionex Digital Solutions
The first chest X-ray VLM that generates native Arabic radiology reports

⚕️ MEDICAL RESEARCH PROTOTYPE — NOT A MEDICAL DEVICE. Ar-CXR is released for research only. It must not be used for clinical decision-making, diagnosis, triage, or any patient-facing purpose.

Ar-CXR is the first chest X-ray vision–language model that generates native Arabic radiology reports. It couples a frozen RAD-DINO image encoder with a Falcon-H1-7B Arabic-capable decoder through a feature-preserving MLP connector, with low-rank adaptation (LoRA) of the vision encoder to break the grounding ceiling. It is trained on CheXpert-Plus with reports machine-translated to Modern Standard Arabic.

This repository ships trained adapters only (the deltas we are allowed to redistribute), not the base-model weights. The first run downloads RAD-DINO and Falcon-H1 from their own repositories under their own licenses. See How to use.


What's in this repository

Path Contents Used by
weights/generation/ BLIP-2 Q-Former (64 queries), proj (768→3072), prefix LayerNorm generate_report()
weights/decoder_lora/ LoRA (r=64, α=128) adapters for Falcon-H1 generate_report()
weights/connector/ MLP connector (768→3072→3072), prefix LayerNorm, 11-way aux grounding head predict_findings()
weights/vision_lora/ LoRA (r=16, α=32) adapters for RAD-DINO predict_findings()
config.json Full Ar-CXR composite configuration both
generation_config.json Decoding settings used in the paper
modeling_ar_cxr.py Reference inference code (assembles base models + adapters)
results/ The exact evaluation JSONs behind every number below

Architecture

Ar-CXR is two trained configurations that share the two base models but use different visual connectors. They are loaded and run independently — the connectors are not interchangeable.

Ar-CXR two-connector architecture: Q-Former generation + MLP grounding

  • Generation uses a BLIP-2 Q-Former connector over the frozen encoder; its decoder-LoRA was trained (section-masked ITG, with prefix-LayerNorm and the fixed Arabic instruction) to read the 64-token Q-Former prefix. This is the configuration behind every report-generation number below.
  • Grounding uses an MLP connector + vision-LoRA; the 11-way aux head sits on the MLP output, so its gradient flows into the connector and (via LoRA) the encoder — it is not an inert probe. This is the configuration behind every AUROC below.

⚠️ The two connectors are architecturally distinct: the generation decoder-LoRA reads the 64-token Q-Former prefix, while the grounding head reads the 257-token MLP prefix. Do not feed one connector's prefix to the other's head/decoder.

The central grounding finding: connector design and a vision LoRA, not decoder scale, govern grounding. A 64-query Q-Former connector caps the grounding macro-AUROC at 0.667; an MLP connector lifts the frozen-feature ceiling to 0.730; vision-LoRA breaks it to 0.789.


Results

All numbers come straight from the JSONs in results/. No number here is estimated.

1. Visual grounding — connector ablation (macro-AUROC of the 11-finding head)

Connector macro-AUROC
Raw RAD-DINO (linear probe) 0.613
Q-Former, 64 queries (BLIP-2 default) 0.667
MLP connector (frozen encoder) 0.730
MLP + vision-LoRA (this model) 0.789

Held-out test (n=10,810): 0.7895 (95% CI [0.785, 0.794]). External Stanford holdout (n=233): 0.7864. The two agree within CI.

Grounding connector ablation — macro-AUROC ladder

2. Identical-protocol comparison vs TorchXRayVision (same images, same gold labels, 9 shared findings)

TorchXRayVision DenseNet Ar-CXR Δ
macro (9 findings) 0.669 0.768 +0.099
fracture 0.476 0.697 +0.221
pneumonia 0.575 0.739 +0.164
pneumothorax 0.701 0.859 +0.158
cardiomegaly 0.711 0.823 +0.112
effusion 0.788 0.885 +0.096

Ar-CXR vs TorchXRayVision — per-finding AUROC

Read conservatively: TXV is evaluated zero-shot under domain shift against a CheXbert-on-impression label definition it was not trained on. The defensible claim is that Ar-CXR's grounding beats a widely used off-the-shelf classifier on this protocol, not that it beats supervised classifiers in general.

3. Arabic report generation vs open VLM baselines (n=200, image-only, identical Arabic instruction)

Model METEOR chrF BERTScore-F1 CIDEr Clinical Jaccard
Ar-CXR (ours) 19.2 29.2 61.6 0.21 40.4
Lingshu-7B (CXR specialist) 5.5 22.1 56.0 0.05 6.8
AIN-7B (Arabic VLM) 7.6 25.1 53.6 0.09 17.3
Qwen2.5-VL-7B 5.6 22.2 51.9 0.05 8.7
IDEFICS2-8B (EN→AR) 4.5 15.4 51.2 0.06 4.5

Ar-CXR ranks first on every automatic metric. Note the clinical Jaccard gap (40.4 vs ≤17.3): the baselines — even Lingshu, a CXR specialist — produce fluent text but miss the Arabic finding vocabulary (Jaccard ≤17.3).


How to use

Requires accepting the base-model licenses on the Hub (tiiuae/Falcon-H1-7B-Instruct, microsoft/rad-dino-maira-2) and a GPU (~18 GB VRAM in bf16).

import torch
from huggingface_hub import snapshot_download
from modeling_ar_cxr import ArCXR        # ships in this repo

repo = snapshot_download("Vionex-digital/Ar-CXR")
model = ArCXR.from_pretrained_adapters(repo, device="cuda", dtype=torch.bfloat16)

from PIL import Image
image = Image.open("chest_xray.png").convert("RGB")

# 1) Generate an Arabic report
report = model.generate_report(image)
print(report)

# 2) Grounding: per-finding probabilities (research diagnostic, not a classifier)
print(model.predict_findings(image))   # {'effusion': 0.88, 'cardiomegaly': 0.82, ...}

The Arabic instruction used in training/eval (baked into generate_report, no need to pass it) is:

اكتب تقرير أشعة صدر باللغة العربية بناءً على الصورة:

Decoding: greedy, repetition_penalty=1.3, no_repeat_ngram_size=3, max_new_tokens=200. The reported metrics use this greedy configuration. Generation is deterministic within a fixed environment, but greedy decoding is sensitive at near-ties, so reports may differ by a few tokens (into clinically-equivalent phrasings) across GPUs/driver/library versions — this is normal LLM behaviour, not a sign of a load error. The grounding head (predict_findings) is bitwise reproducible.


Training data

  • Source: CheXpert-Plus — 223,462 radiographs, 187,711 studies, 64,725 patients.
  • Arabic reports: 221,247 reports machine-translated EN→Modern Standard Arabic. We do not redistribute the translated corpus (Stanford CheXpert-Plus data-use agreement).
  • Splits: patient-level (seed 42), 90/5/5; the official CheXpert validation studies are an external "Stanford holdout".
  • Gold labels: CheXbert run on each report's impression, mapped to 11 findings (positive-only).

License

This is a composite, research-only release. The redistributed adapters and code are released for non-commercial research; you must also comply with every upstream license, whichever is most restrictive:

Component Source License
Decoder base tiiuae/Falcon-H1-7B-Instruct TII Falcon-LLM License 2.0
Vision base microsoft/rad-dino-maira-2 MSR license (research use)
Training data CheXpert-Plus Stanford CheXpert-Plus Data Use Agreement

See LICENSE.md and NOTICE.md. The model and its outputs are not for clinical use.


Citation

@techreport{khaled2026arcxr,
  title       = {Ar-CXR: A Native Arabic Chest X-ray Vision--Language Model for
                 Radiology Report Generation and Visual Grounding},
  institution = {Vionex Digital Solutions},
  year        = {2026}
}

AI-tool disclosure

Software-engineering and manuscript-preparation assistance was provided by an AI coding assistant under author supervision. All experiments, results, and claims were designed, executed, and verified by the authors.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vionex-digital/Ar-CXR

Adapter
(1)
this model

Evaluation results

  • chrF on CheXpert-Plus (Arabic, machine-translated) — 200-image test
    self-reported
    29.200
  • BERTScore-F1 (AraBERT-v02) on CheXpert-Plus (Arabic, machine-translated) — 200-image test
    self-reported
    61.600
  • Arabic Clinical-term Jaccard on CheXpert-Plus (Arabic, machine-translated) — 200-image test
    self-reported
    40.400
  • macro-AUROC (11 findings) on CheXpert-Plus patient-disjoint held-out test (n=10,810)
    self-reported
    0.789