An English-Arabic Bilingual Encoder

from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")

roberta-large-eng-ara-128k is an English�Arabic bilingual encoders of 24-layer Transformers (d_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Su�rez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English�Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.

Pretraining Detail

We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decay linearly to 0, multilingual sampling alpha of 0.3, and the fairseq (Ott et al., 2019) implementation.

Citation

Please cite this paper for reference:

@inproceedings{yarmohammadi-etal-2021-everything,
    title = "Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction",
    author = "Yarmohammadi, Mahsa  and
      Wu, Shijie  and
      Marone, Marc  and
      Xu, Haoran  and
      Ebner, Seth  and
      Qin, Guanghui  and
      Chen, Yunmo and
      Guo, Jialiang and
      Harman, Craig  and
      Murray, Kenton and
      White, Aaron Steven  and
      Dredze, Mark and
      Van Durme, Benjamin",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    year = "2021",
}

jhu-clsp
/

roberta-large-eng-ara-128k

An English-Arabic Bilingual Encoder

Pretraining Detail

Citation

Datasets used to train jhu-clsp/roberta-large-eng-ara-128k