Model Description

CompactBioBERT is a distilled version of the BioBERT model which is distilled for 100k training steps using a total batch size of 192 on the PubMed dataset.

Distillation Procedure

This model has the same overall architecture as DistilBioBERT with the difference that here we combine the distillation approaches of DistilBioBERT and TinyBioBERT. We utilise the same initialisation technique as in DistilBioBERT, and apply a layer-to-layer distillation with three major components, namely, MLM, layer, and output distillation.

Initialisation

Following DistilBERT, we initialise the student model by taking weights from every other layer of the teacher.

Architecture

In this model, the size of the hidden dimension and the embedding layer are both set to 768. The vocabulary size is 28996. The number of transformer layers is 6 and the expansion rate of the feed-forward layer is 4. Overall, this model has around 65 million parameters.

Citation

If you use this model, please consider citing the following paper:

@misc{https://doi.org/10.48550/arxiv.2209.03182,
  doi = {10.48550/ARXIV.2209.03182},
  url = {https://arxiv.org/abs/2209.03182},
  author = {Rohanian, Omid and Nouriborji, Mohammadmahdi and Kouchaki, Samaneh and Clifton, David A.},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, 68T50},
  title = {On the Effectiveness of Compact Biomedical Transformers},
  publisher = {arXiv},
  year = {2022}, 
  copyright = {arXiv.org perpetual, non-exclusive license}
}