File size: 8,185 Bytes
96b4ee6
9bb8f14
bd2d75c
 
96b4ee6
bd2d75c
 
1d1737c
 
 
 
 
 
 
96b4ee6
9bb8f14
250e296
9bb8f14
 
 
e5a7bfb
9bb8f14
 
 
 
 
69ba5c3
 
 
 
9bb8f14
 
 
726c994
 
 
9bb8f14
726c994
 
9bb8f14
726c994
9bb8f14
 
 
 
 
 
 
 
 
 
 
 
 
22738c5
9bb8f14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d47dbc
9bb8f14
 
 
 
 
fcd5036
 
726c994
fcd5036
93af83c
fcd5036
 
9bb8f14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
language: en
tags:
- exbert
license: mit
widget:
- text: "Left pleural effusion with adjacent [MASK]."
  example_title: "Radiology 1"
- text: "Heart size normal and lungs are [MASK]."
  example_title: "Radiology 2"
- text: "[MASK] is a tumor suppressor gene."
  example_title: "Biomedical"
- text: "The patient was on [MASK] for chronic atrial fibrillation"
  example_title: "Medication"
---

# CXR-BERT-general

[CXR-BERT](https://arxiv.org/abs/2204.09817) is a chest X-ray (CXR) domain-specific language model that makes use of an improved vocabulary, novel pretraining procedure, weight regularization, and text augmentations. The resulting model demonstrates improved performance on radiology natural language inference, radiology masked language model token prediction, and downstream vision-language processing tasks such as zero-shot phrase grounding and image classification.

First, we pretrain **CXR-BERT-general** from a randomly initialized BERT model via Masked Language Modeling (MLM) on abstracts [PubMed](https://pubmed.ncbi.nlm.nih.gov/) and clinical notes from the publicly-available [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) and [MIMIC-CXR](https://physionet.org/content/mimic-cxr/). In that regard, the general model is expected be applicable for research in clinical domains other than the chest radiology through domain specific fine-tuning.

**CXR-BERT-specialized** is continually pretrained from CXR-BERT-general to further specialize in the chest X-ray domain. At the final stage, CXR-BERT is trained in a multi-modal contrastive learning framework, similar to the [CLIP](https://arxiv.org/abs/2103.00020) framework. The latent representation of [CLS] token is utilized to align text/image embeddings.

## Model variations

| Model                                             | Model identifier on HuggingFace                                                                             | Vocabulary     | Note                                                      |
| ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -------------- | --------------------------------------------------------- |
| CXR-BERT-general                                  | [microsoft/BiomedVLP-CXR-BERT-general](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-general)         | PubMed & MIMIC | Pretrained for biomedical literature and clinical domains |
| CXR-BERT-specialized (after multi-modal training) | [microsoft/BiomedVLP-CXR-BERT-specialized](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-specialized) | PubMed & MIMIC | Pretrained for chest X-ray domain                         |

## Citation

The corresponding manuscript is accepted to be presented at the [**European Conference on Computer Vision (ECCV) 2022**](https://eccv2022.ecva.net/)

```bibtex
@misc{https://doi.org/10.48550/arxiv.2204.09817,
  doi = {10.48550/ARXIV.2204.09817},
  url = {https://arxiv.org/abs/2204.09817},
  author = {Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan},
  title = {Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing},
  publisher = {arXiv},
  year = {2022},
}
```

## Model Use

### Intended Use

This model is intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper.

#### Primary Intended Use

The primary intended use is to support AI researchers building on top of this work. CXR-BERT and its associated models should be helpful for exploring various clinical NLP & VLP research questions, especially in the radiology domain.

#### Out-of-Scope Use

**Any** deployed use case of the model --- commercial or otherwise --- is currently out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are not intended for deployed use cases. Please refer to [the associated paper](https://arxiv.org/abs/2204.09817) for more details.

## Data

This model builds upon existing publicly-available datasets:

- [PubMed](https://pubmed.ncbi.nlm.nih.gov/)
- [MIMIC-III](https://physionet.org/content/mimiciii/)
- [MIMIC-CXR](https://physionet.org/content/mimic-cxr/)

These datasets reflect a broad variety of sources ranging from biomedical abstracts to intensive care unit notes to chest X-ray radiology notes. The radiology notes are accompanied with their associated chest x-ray DICOM images in MIMIC-CXR dataset.  

## Performance

We demonstrate that this language model achieves state-of-the-art results in radiology natural language inference through its improved vocabulary and novel language pretraining objective leveraging semantics and discourse characteristics in radiology reports.

A highlight of comparison to other common models, including [ClinicalBERT](https://aka.ms/clinicalbert) and [PubMedBERT](https://aka.ms/pubmedbert):

|                                                 | RadNLI accuracy (MedNLI transfer) | Mask prediction accuracy | Avg. # tokens after tokenization | Vocabulary size |
| ----------------------------------------------- | :-------------------------------: | :----------------------: | :------------------------------: | :-------------: |
| RadNLI baseline                                 |               53.30               |            -             |                -                 |        -        |
| ClinicalBERT                                    |               47.67               |          39.84           |         78.98 (+38.15%)          |     28,996      |
| PubMedBERT                                      |               57.71               |          35.24           |         63.55 (+11.16%)          |     28,895      |
| CXR-BERT (after Phase-III)                      |               60.46               |          77.72           |          58.07 (+1.59%)          |     30,522      |
| **CXR-BERT (after Phase-III + Joint Training)** |             **65.21**             |        **81.58**         |        **58.07 (+1.59%)**        |     30,522      |

CXR-BERT also contributes to better vision-language representation learning through its improved text encoding capability. Below is the zero-shot phrase grounding performance on the **MS-CXR** dataset, which evaluates the quality of image-text latent representations.

| Vision–Language Pretraining Method | Text Encoder | MS-CXR Phrase Grounding (Avg. CNR Score) |
| ---------------------------------- | ------------ | :--------------------------------------: |
| Baseline                           | ClinicalBERT |                  0.769                   |
| Baseline                           | PubMedBERT   |                  0.773                   |
| ConVIRT                            | ClinicalBERT |                  0.818                   |
| GLoRIA                             | ClinicalBERT |                  0.930                   |
| **BioViL**                         | **CXR-BERT** |                **1.027**                 |
| **BioViL-L**                       | **CXR-BERT** |                **1.142**                 |

Additional details about performance can be found in the corresponding paper, [Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing](https://arxiv.org/abs/2204.09817).

## Limitations

This model was developed using English corpora, and thus can be considered English-only.

## Further information

Please refer to the corresponding paper, ["Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing", ECCV'22](https://arxiv.org/abs/2204.09817) for additional details on the model training and evaluation.

For additional inference pipelines with CXR-BERT, please refer to the [HI-ML GitHub](https://aka.ms/biovil-code) repository. The associated source files will soon be accessible through this link.