globuslabs's picture
Update README.md
afbd33e
|
raw
history blame
1.99 kB
metadata
language: en
tags:
  - science
  - multi-disciplinary
license: apache-2.0

ScholarBERT_100 Model

This is the ScholarBERT_100_64bit variant of the ScholarBERT model family. The difference between this variant and the ScholarBERT_100 model is that its tokenizer is trained with int64 rather than the default int32, so the count of very frequent tokens (e.g., "the") does not overflow.

The model is pretrained on a large collection of scientific research articles (221B tokens).

This is a cased (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.

The model is based on the same architecture as BERT-large and has a total of 340M parameters.

Model Architecture

Hyperparameter Value
Layers 24
Hidden Size 1024
Attention Heads 16
Total Parameters 340M

Training Dataset

The vocab and the model are pertrained on 100% of the PRD scientific literature dataset.

The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”), a nonprofit organization based in California. This dataset was constructed from a corpus of journal article files, from which We successfully extracted text from 75,496,055 articles from 178,928 journals. The articles span across Arts & Humanities, Life Sciences & Biomedicine, Physical Sciences, Social Sciences, and Technology. The distribution of articles is shown below.

corpus pie chart

BibTeX entry and citation info

If using this model, please cite this paper:

@inproceedings{hong2023diminishing,
  title={The diminishing returns of masked language models to science},
  author={Hong, Zhi and Ajith, Aswathy and Pauloski, James and Duede, Eamon and Chard, Kyle and Foster, Ian},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2023},
  pages={1270--1283},
  year={2023}
}