|
--- |
|
language: en |
|
tags: |
|
- longformer |
|
- cdlm |
|
license: apache-2.0 |
|
inference: false |
|
|
|
--- |
|
|
|
|
|
# Cross-Document Language Modeling |
|
|
|
CDLM: Cross-Document Language Modeling. |
|
Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF](https://arxiv.org/pdf/2101.00406.pdf) |
|
|
|
|
|
Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are `<doc-s>`, `</doc-s>` (the last two tokens in the vocabulary), and `<s>`, `</s>`, respectively. |
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
# load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm') |
|
model = AutoModel.from_pretrained('biu-nlp/cdlm') |
|
``` |
|
|
|
The original repo is [here](https://github.com/aviclu/CDLM). |
|
|
|
If you find our work useful, please cite the paper as: |
|
|
|
```python |
|
@article{caciularu2021cross, |
|
title={Cross-Document Language Modeling}, |
|
author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido}, |
|
journal={Findings of the Association for Computational Linguistics: EMNLP 2021}, |
|
year={2021} |
|
} |
|
``` |