--- language: en tags: - longformer - cdlm license: apache-2.0 inference: false --- # Cross-Document Language Modeling CDLM: Cross-Document Language Modeling. Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF](https://arxiv.org/pdf/2101.00406.pdf) Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are ``, `` (the last two tokens in the vocabulary), and ``, ``, respectively. ```python from transformers import AutoTokenizer, AutoModel # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm') model = AutoModel.from_pretrained('biu-nlp/cdlm') ``` The original repo is [here](https://github.com/aviclu/CDLM). If you find our work useful, please cite the paper as: ```python @article{caciularu2021cross, title={Cross-Document Language Modeling}, author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido}, journal={Findings of the Association for Computational Linguistics: EMNLP 2021}, year={2021} } ```