microsoft
/

BiomedVLP-BioViL-T

@@ -18,7 +18,7 @@ widget:
  [BioViL-T](https://arxiv.org/abs/2301.04558) is a domain-specific vision-language model designed to analyze chest X-rays (CXRs) and radiology reports. It was trained using a temporal multi-modal pre-training procedure, which distinguishes it from its predecessor model ([BioViL](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136960001.pdf)). In detail, BioViL-T takes advantage of the temporal structure between data points, resulting in improved downstream performance on multiple benchmarks, while using the same training dataset as its predecessor. In particular, the resultant model displays significant improvement in embedding temporal information present in the image and text modalities (see [results](#performance)), as well as in the joint space. The canonical model can be adapted to both single- and multi-image downstream applications including: natural language inference, phrase-grounding, image/text classification, and language decoding.
-The corresponding BERT language model is trained in two stages as follows: First, we pretrain [CXR-BERT-general](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-general) from a randomly initialized BERT model via Masked Language Modeling (MLM) on abstracts [PubMed](https://pubmed.ncbi.nlm.nih.gov/) and clinical notes from the publicly-available [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) and [MIMIC-CXR](https://physionet.org/content/mimic-cxr/). The general model can be fine-tuned for research in other clinical domains by adjusting the parameters specific to the target domain. At the second stage, BioViL-T is continually pretrained from CXR-BERT-general using a multi-modal pre-training procedure by utilising radiology reports and sequences of chest X-rays. We utilise the latent representation of [CLS] token to align text/image embeddings.
 ## Language model variations

  [BioViL-T](https://arxiv.org/abs/2301.04558) is a domain-specific vision-language model designed to analyze chest X-rays (CXRs) and radiology reports. It was trained using a temporal multi-modal pre-training procedure, which distinguishes it from its predecessor model ([BioViL](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136960001.pdf)). In detail, BioViL-T takes advantage of the temporal structure between data points, resulting in improved downstream performance on multiple benchmarks, while using the same training dataset as its predecessor. In particular, the resultant model displays significant improvement in embedding temporal information present in the image and text modalities (see [results](#performance)), as well as in the joint space. The canonical model can be adapted to both single- and multi-image downstream applications including: natural language inference, phrase-grounding, image/text classification, and language decoding.
+The corresponding BERT language model is trained in two stages: First, we pretrain [CXR-BERT-general](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-general) from a randomly initialized BERT model via Masked Language Modeling (MLM) on abstracts [PubMed](https://pubmed.ncbi.nlm.nih.gov/) and clinical notes from the publicly-available [MIMIC-III](https://physionet.org/content/mimiciii/1.4/) and [MIMIC-CXR](https://physionet.org/content/mimic-cxr/). The general model can be fine-tuned for research in other clinical domains by adjusting the parameters specific to the target domain. In the second stage, BioViL-T is continually pretrained from CXR-BERT-general using a multi-modal pre-training procedure by utilising radiology reports and sequences of chest X-rays. We utilise the latent representation of [CLS] token to align text and image embeddings.
 ## Language model variations