Back to all models

Unable to determine this model’s pipeline type. Check the docs .

Monthly model downloads

allenai/scibert_scivocab_cased allenai/scibert_scivocab_cased
10,206 downloads
last 30 days

pytorch

tf

Contributed by

Allen Institute for AI non-profit
3 team members · 33 models

How to use this model directly from the 🤗/transformers library:

			
Copy to clipboard
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_cased") model = AutoModel.from_pretrained("allenai/scibert_scivocab_cased")

SciBERT

This is the pretrained model presented in SciBERT: A Pretrained Language Model for Scientific Text, which is a BERT model trained on scientific text.

The training corpus was papers taken from Semantic Scholar. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions.

Available models include:

  • scibert_scivocab_cased
  • scibert_scivocab_uncased

The original repo can be found here.

If using these models, please cite the following paper:

@inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}