ScholarBERT_100_WB Model

This is the ScholarBERT_100_WB variant of the ScholarBERT model family.

The model is pretrained on a large collection of scientific research articles (221B tokens). Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the BERT-base and BERT-large models.

This is a cased (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.

The model is based on the same architecture as BERT-large and has a total of 340M parameters.

Model Architecture

Hyperparameter Value
Layers 24
Hidden Size 1024
Attention Heads 16
Total Parameters 340M

Training Dataset

The vocab and the model are pertrained on 100% of the PRD scientific literature dataset and the Wikipedia+BookCorpus.

The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”), a nonprofit organization based in California. This dataset was constructed from a corpus of journal article files, from which We successfully extracted text from 75,496,055 articles from 178,928 journals. The articles span across Arts & Humanities, Life Sciences & Biomedicine, Physical Sciences, Social Sciences, and Technology. The distribution of articles is shown below.

corpus pie chart

BibTeX entry and citation info

If using this model, please cite this paper:

@misc{hong2023diminishing,
      title={The Diminishing Returns of Masked Language Models to Science}, 
      author={Zhi Hong and Aswathy Ajith and Gregory Pauloski and Eamon Duede and Kyle Chard and Ian Foster},
      year={2023},
      eprint={2205.11342},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.