cattana commited on
Commit
c105869
1 Parent(s): 7a4b09a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - longformer
5
+ - cdlm
6
+ license: apache-2.0
7
+ inference: false
8
+
9
+ ---
10
+
11
+
12
+ # Cross-Document Language Modeling
13
+
14
+ CDLM: Cross-Document Language Modeling.
15
+ Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF](https://arxiv.org/pdf/2101.00406.pdf)
16
+
17
+
18
+ Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are `<doc-s>`, `</doc-s>` (the last two tokens in the vocabulary), and `<s>`, `</s>`, respectively.
19
+
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer, AutoModel
23
+ # load model and tokenizer
24
+ tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm')
25
+ model = AutoModel.from_pretrained('biu-nlp/cdlm')
26
+ ```
27
+
28
+ The original repo is [here](https://github.com/aviclu/CDLM).
29
+
30
+ If you find our work useful, please cite the paper as:
31
+
32
+ ```python
33
+ @article{caciularu2021cross,
34
+ title={Cross-Document Language Modeling},
35
+ author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido},
36
+ journal={Findings of the Association for Computational Linguistics: EMNLP 2021},
37
+ year={2021}
38
+ }
39
+ ```