shijie-wu commited on
Commit
e3d2acc
1 Parent(s): f7685c1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - en
5
+ tags:
6
+ - bert
7
+ - roberta
8
+ - exbert
9
+ license: mit
10
+ datasets:
11
+ - arabic_billion_words
12
+ - cc100
13
+ - gigaword
14
+ - oscar
15
+ - wikipedia
16
+ ---
17
+
18
+ # An English-Arabic Bilingual Encoder
19
+
20
+ `roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (`d_model= 1024`), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabularies using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
21
+
22
+ ## Pretraining Detail
23
+
24
+ We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decaylinearly to 0, multilingual sampling alpha of 0.3,and the fairseq (Ott et al., 2019) implementation.
25
+
26
+ ## Citation
27
+
28
+ Please cite this paper for reference:
29
+
30
+ ```bibtex
31
+ @inproceedings{yarmohammadi-etal-2021-everything,
32
+ title = "Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction",
33
+ author = "Yarmohammadi, Mahsa and
34
+ Wu, Shijie and
35
+ Marone, Marc and
36
+ Xu, Haoran and
37
+ Ebner, Seth and
38
+ Qin, Guanghui and
39
+ Chen, Yunmo and
40
+ Guo, Jialiang and
41
+ Harman, Craig and
42
+ Murray, Kenton and
43
+ White, Aaron Steven and
44
+ Dredze, Mark and
45
+ Van Durme, Benjamin",
46
+ booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
47
+ year = "2021",
48
+ }
49
+ ```