microsoft
/

BiomedVLP-BioViL-T

@@ -107,7 +107,10 @@ These datasets reflect a broad variety of sources ranging from biomedical abstra
 ## Performance
-The presented model achieves state-of-the-art results in radiology natural language inference by leveraging semantics and discourse characteristics at training time more efficiently. The experiments were performed on the RadNLI and MS-CXR-T benchmarks, which measure the quality of text embeddings in terms of static and temporal semantics respectively. BioViL-T is benchmarked against other commonly used language models, including [PubMedBERT](https://aka.ms/pubmedbert) and [CXR-BERT](https://aka.ms/biovil).
 |                                                                                           | MS-CXR-T                          | MS-CXR-T                 | RadNLI (2 classes)   |  RadNLI (2 classes) |
 | -----------------------------------------------                                           | :-------------------------------: | :----------------------: | :-------------------------: | :-------------:  |
@@ -116,7 +119,6 @@ The presented model achieves state-of-the-art results in radiology natural langu
 | [CXR-BERT-General](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-general)           |             62.60                 |        .601              |        87.59                |     .902        |
 | [CXR-BERT-Specialized]((https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-specialized)) |             78.12                 |        .837              |        89.66                |     .932        |
 | **BioViL-T**                                                                              |             **87.77**             |        **.933**          |        **90.52**                |     **.947**        |
-<br/>
 The novel pretraining framework yields also better vision-language representations. Below is the zero-shot phrase grounding performance obtained on the [MS-CXR](https://physionet.org/content/ms-cxr/0.1/) benchmark dataset, which evaluates the quality of image-text latent representations.
@@ -125,9 +127,8 @@ The novel pretraining framework yields also better vision-language representatio
 | BioViL                             |                  1.07 +- 0.04            |            0.229 +- 0.005      |
 | BioViL-L                           |                  1.21 +- 0.05            |            0.202 +- 0.010      |
 | **BioViL-T**                       |                **1.33 +- 0.04**          |          **0.240 +- 0.005**    |
-<br/>
-Additional experimental results and discussion can be found in the corresponding paper, [Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing](https://arxiv.org/abs/2301.04558).
 ## Limitations

 ## Performance
+The presented model achieves state-of-the-art results in radiology natural language inference by leveraging semantics and discourse characteristics at training time more efficiently.
+The experiments were performed on the RadNLI and MS-CXR-T benchmarks, which measure the quality of text embeddings in terms of static and temporal semantics respectively.
+BioViL-T is benchmarked against other commonly used SOTA domain specific BERT models, including [PubMedBERT](https://aka.ms/pubmedbert) and [CXR-BERT](https://aka.ms/biovil).
+The results below show that BioViL-T has increased sensitivity of sentence embeddings to temporal content (MS-CXR-T) whilst better capturing the static content (RadNLI).
 |                                                                                           | MS-CXR-T                          | MS-CXR-T                 | RadNLI (2 classes)   |  RadNLI (2 classes) |
 | -----------------------------------------------                                           | :-------------------------------: | :----------------------: | :-------------------------: | :-------------:  |
 | [CXR-BERT-General](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-general)           |             62.60                 |        .601              |        87.59                |     .902        |
 | [CXR-BERT-Specialized]((https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-specialized)) |             78.12                 |        .837              |        89.66                |     .932        |
 | **BioViL-T**                                                                              |             **87.77**             |        **.933**          |        **90.52**                |     **.947**        |
 The novel pretraining framework yields also better vision-language representations. Below is the zero-shot phrase grounding performance obtained on the [MS-CXR](https://physionet.org/content/ms-cxr/0.1/) benchmark dataset, which evaluates the quality of image-text latent representations.
 | BioViL                             |                  1.07 +- 0.04            |            0.229 +- 0.005      |
 | BioViL-L                           |                  1.21 +- 0.05            |            0.202 +- 0.010      |
 | **BioViL-T**                       |                **1.33 +- 0.04**          |          **0.240 +- 0.005**    |
+Additional experimental results and discussion can be found in the corresponding paper, ["Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing", CVPR'23](https://arxiv.org/abs/2301.04558).
 ## Limitations