metadata

license: mit
language:
  - en

SciBERT Longformer

This is a Lonformer version of the SciBERT uncased model by Allen AI. The model is slower than SciBERT (~2.5x in my benchmarks) but can allow for 8x wider max_seq_length (4096 vs. 512) which is handy in the case of working with long texts, e.g. scientific full texts.

The conversion to Longformer was performed with a tutorial by Allen AI: see a Google Colab Notebook by Yury which closely follows the tutorial.

Note:

no additional MLM pretraining of the Longformer was performed, the collab notebook stops at step 3, and step 4 is not done. The model can be improved with this additional MLM pretraining, better to do so with scientific texts, e.g. S2ORC, again by Allen AI.
no extensive benchmarks of SciBERT Longformer vs. SciBERT were performed in terms of downstream task performance

Links:

the original SciBERT repo
the original Longformer repo

If using these models, please consider citing the following papers:

@inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}