Cross-Document Language Modeling

CDLM: Cross-Document Language Modeling. Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. PDF

Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are <doc-s>, </doc-s> (the last two tokens in the vocabulary), and <s>, </s>, respectively.

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm')
model = AutoModel.from_pretrained('biu-nlp/cdlm')

The original repo is here.

If you find our work useful, please cite the paper as:

  title={Cross-Document Language Modeling},
  author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido},
  journal={Findings of the Association for Computational Linguistics: EMNLP 2021},
