Ar-CXR — Arabic Chest X-ray Vision–Language Model

أشعة · Ar-CXR · Arabic Chest X-ray Vision–Language Model · by Vionex Digital Solutions
The first chest X-ray VLM that generates native Arabic radiology reports

⚕️ MEDICAL RESEARCH PROTOTYPE — NOT A MEDICAL DEVICE. Ar-CXR is released for research only. It must not be used for clinical decision-making, diagnosis, triage, or any patient-facing purpose.

Ar-CXR is the first chest X-ray vision–language model that generates native Arabic radiology reports. It couples a frozen RAD-DINO image encoder with a Falcon-H1-7B Arabic-capable decoder through a feature-preserving MLP connector, with low-rank adaptation (LoRA) of the vision encoder to break the grounding ceiling. It is trained on CheXpert-Plus with reports machine-translated to Modern Standard Arabic.

Language: Arabic (Modern Standard Arabic)
Model type: Multimodal vision–language model (image + text → Arabic report) + auxiliary CXR grounding head
Finetuned from: microsoft/rad-dino-maira-2 (vision) + tiiuae/Falcon-H1-7B-Instruct (decoder)
License: Composite, research-only — see License

This repository ships trained adapters only (the deltas we are allowed to redistribute), not the base-model weights. The first run downloads RAD-DINO and Falcon-H1 from their own repositories under their own licenses. See How to use.

What's in this repository

Path	Contents	Used by
`weights/generation/`	BLIP-2 Q-Former (64 queries), proj (768→3072), prefix LayerNorm	`generate_report()`
`weights/decoder_lora/`	LoRA (r=64, α=128) adapters for Falcon-H1	`generate_report()`
`weights/connector/`	MLP connector (768→3072→3072), prefix LayerNorm, 11-way aux grounding head	`predict_findings()`
`weights/vision_lora/`	LoRA (r=16, α=32) adapters for RAD-DINO	`predict_findings()`
`config.json`	Full Ar-CXR composite configuration	both
`generation_config.json`	Decoding settings used in the paper
`modeling_ar_cxr.py`	Reference inference code (assembles base models + adapters)
`results/`	The exact evaluation JSONs behind every number below

Architecture

Ar-CXR is two trained configurations that share the two base models but use different visual connectors. They are loaded and run independently — the connectors are not interchangeable.

Ar-CXR two-connector architecture: Q-Former generation + MLP grounding

Generation uses a BLIP-2 Q-Former connector over the frozen encoder; its decoder-LoRA was trained (section-masked ITG, with prefix-LayerNorm and the fixed Arabic instruction) to read the 64-token Q-Former prefix. This is the configuration behind every report-generation number below.
Grounding uses an MLP connector + vision-LoRA; the 11-way aux head sits on the MLP output, so its gradient flows into the connector and (via LoRA) the encoder — it is not an inert probe. This is the configuration behind every AUROC below.

⚠️ The two connectors are architecturally distinct: the generation decoder-LoRA reads the 64-token Q-Former prefix, while the grounding head reads the 257-token MLP prefix. Do not feed one connector's prefix to the other's head/decoder.

The central grounding finding: connector design and a vision LoRA, not decoder scale, govern grounding. A 64-query Q-Former connector caps the grounding macro-AUROC at 0.667; an MLP connector lifts the frozen-feature ceiling to 0.730; vision-LoRA breaks it to 0.789.

Results

All numbers come straight from the JSONs in results/. No number here is estimated.

1. Visual grounding — connector ablation (macro-AUROC of the 11-finding head)

Connector	macro-AUROC
Raw RAD-DINO (linear probe)	0.613
Q-Former, 64 queries (BLIP-2 default)	0.667
MLP connector (frozen encoder)	0.730
MLP + vision-LoRA (this model)	0.789

Held-out test (n=10,810): 0.7895 (95% CI [0.785, 0.794]). External Stanford holdout (n=233): 0.7864. The two agree within CI.

Grounding connector ablation — macro-AUROC ladder

2. Identical-protocol comparison vs TorchXRayVision (same images, same gold labels, 9 shared findings)

	TorchXRayVision DenseNet	Ar-CXR	Δ
macro (9 findings)	0.669	0.768	+0.099
fracture	0.476	0.697	+0.221
pneumonia	0.575	0.739	+0.164
pneumothorax	0.701	0.859	+0.158
cardiomegaly	0.711	0.823	+0.112
effusion	0.788	0.885	+0.096

Ar-CXR vs TorchXRayVision — per-finding AUROC

Read conservatively: TXV is evaluated zero-shot under domain shift against a CheXbert-on-impression label definition it was not trained on. The defensible claim is that Ar-CXR's grounding beats a widely used off-the-shelf classifier on this protocol, not that it beats supervised classifiers in general.

3. Arabic report generation vs open VLM baselines (n=200, image-only, identical Arabic instruction)

Model	METEOR	chrF	BERTScore-F1	CIDEr	Clinical Jaccard
Ar-CXR (ours)	19.2	29.2	61.6	0.21	40.4
Lingshu-7B (CXR specialist)	5.5	22.1	56.0	0.05	6.8
AIN-7B (Arabic VLM)	7.6	25.1	53.6	0.09	17.3
Qwen2.5-VL-7B	5.6	22.2	51.9	0.05	8.7
IDEFICS2-8B (EN→AR)	4.5	15.4	51.2	0.06	4.5

Ar-CXR ranks first on every automatic metric. Note the clinical Jaccard gap (40.4 vs ≤17.3): the baselines — even Lingshu, a CXR specialist — produce fluent text but miss the Arabic finding vocabulary (Jaccard ≤17.3).

How to use

Requires accepting the base-model licenses on the Hub (tiiuae/Falcon-H1-7B-Instruct, microsoft/rad-dino-maira-2) and a GPU (~18 GB VRAM in bf16).

import torch
from huggingface_hub import snapshot_download
from modeling_ar_cxr import ArCXR        # ships in this repo

repo = snapshot_download("Vionex-digital/Ar-CXR")
model = ArCXR.from_pretrained_adapters(repo, device="cuda", dtype=torch.bfloat16)

from PIL import Image
image = Image.open("chest_xray.png").convert("RGB")

# 1) Generate an Arabic report
report = model.generate_report(image)
print(report)

# 2) Grounding: per-finding probabilities (research diagnostic, not a classifier)
print(model.predict_findings(image))   # {'effusion': 0.88, 'cardiomegaly': 0.82, ...}

The Arabic instruction used in training/eval (baked into generate_report, no need to pass it) is:

اكتب تقرير أشعة صدر باللغة العربية بناءً على الصورة:

Decoding: greedy, repetition_penalty=1.3, no_repeat_ngram_size=3, max_new_tokens=200. The reported metrics use this greedy configuration. Generation is deterministic within a fixed environment, but greedy decoding is sensitive at near-ties, so reports may differ by a few tokens (into clinically-equivalent phrasings) across GPUs/driver/library versions — this is normal LLM behaviour, not a sign of a load error. The grounding head (predict_findings) is bitwise reproducible.

Training data

Source: CheXpert-Plus — 223,462 radiographs, 187,711 studies, 64,725 patients.
Arabic reports: 221,247 reports machine-translated EN→Modern Standard Arabic. We do not redistribute the translated corpus (Stanford CheXpert-Plus data-use agreement).
Splits: patient-level (seed 42), 90/5/5; the official CheXpert validation studies are an external "Stanford holdout".
Gold labels: CheXbert run on each report's impression, mapped to 11 findings (positive-only).

License

This is a composite, research-only release. The redistributed adapters and code are released for non-commercial research; you must also comply with every upstream license, whichever is most restrictive:

Component	Source	License
Decoder base	`tiiuae/Falcon-H1-7B-Instruct`	TII Falcon-LLM License 2.0
Vision base	`microsoft/rad-dino-maira-2`	MSR license (research use)
Training data	CheXpert-Plus	Stanford CheXpert-Plus Data Use Agreement

See LICENSE.md and NOTICE.md. The model and its outputs are not for clinical use.

Citation

@techreport{khaled2026arcxr,
  title       = {Ar-CXR: A Native Arabic Chest X-ray Vision--Language Model for
                 Radiology Report Generation and Visual Grounding},
  institution = {Vionex Digital Solutions},
  year        = {2026}
}

AI-tool disclosure

Software-engineering and manuscript-preparation assistance was provided by an AI coding assistant under author supervision. All experiments, results, and claims were designed, executed, and verified by the authors.

Downloads last month: -

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vionex-digital/Ar-CXR

Base model

microsoft/rad-dino-maira-2

Adapter

(1)

this model

Evaluation results

chrF on CheXpert-Plus (Arabic, machine-translated) — 200-image test
self-reported

29.200
BERTScore-F1 (AraBERT-v02) on CheXpert-Plus (Arabic, machine-translated) — 200-image test
self-reported

61.600
Arabic Clinical-term Jaccard on CheXpert-Plus (Arabic, machine-translated) — 200-image test
self-reported

40.400
macro-AUROC (11 findings) on CheXpert-Plus patient-disjoint held-out test (n=10,810)
self-reported

0.789