File size: 1,204 Bytes
c105869 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
---
language: en
tags:
- longformer
- cdlm
license: apache-2.0
inference: false
---
# Cross-Document Language Modeling
CDLM: Cross-Document Language Modeling.
Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF](https://arxiv.org/pdf/2101.00406.pdf)
Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are `<doc-s>`, `</doc-s>` (the last two tokens in the vocabulary), and `<s>`, `</s>`, respectively.
```python
from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm')
model = AutoModel.from_pretrained('biu-nlp/cdlm')
```
The original repo is [here](https://github.com/aviclu/CDLM).
If you find our work useful, please cite the paper as:
```python
@article{caciularu2021cross,
title={Cross-Document Language Modeling},
author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido},
journal={Findings of the Association for Computational Linguistics: EMNLP 2021},
year={2021}
}
``` |