ouBioBERT-Base, Uncased

Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) is a language model based on the BERT-Base (Devlin, et al., 2019) architecture. We pre-trained ouBioBERT on PubMed abstracts from the PubMed baseline (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline) via our method.

The details of the pre-training procedure can be found in Wada, et al. (2020).

Evaluation

We evaluated the performance of ouBioBERT in terms of the biomedical language understanding evaluation (BLUE) benchmark (Peng, et al., 2019). The numbers are mean (standard deviation) on five different random seeds.

Dataset	Task Type	Score
MedSTS	Sentence similarity	84.9 (0.6)
BIOSSES	Sentence similarity	92.3 (0.8)
BC5CDR-disease	Named-entity recognition	87.4 (0.1)
BC5CDR-chemical	Named-entity recognition	93.7 (0.2)
ShARe/CLEFE	Named-entity recognition	80.1 (0.4)
DDI	Relation extraction	81.1 (1.5)
ChemProt	Relation extraction	75.0 (0.3)
i2b2 2010	Relation extraction	74.0 (0.8)
HoC	Document classification	86.4 (0.5)
MedNLI	Inference	83.6 (0.7)
Total	Macro average of the scores	83.8 (0.3)

Code for Fine-tuning

We made the source code for fine-tuning freely available at our repository.

Citation

If you use our work in your research, please kindly cite the following paper:

@misc{2005.07202,
Author = {Shoya Wada and Toshihiro Takeda and Shiro Manabe and Shozo Konishi and Jun Kamohara and Yasushi Matsumura},
Title = {A pre-training technique to localize medical BERT and enhance BioBERT},
Year = {2020},
Eprint = {arXiv:2005.07202},
}