- ar
- en
- bert
- roberta
- exbert
license: mit
- arabic_billion_words
- cc100
- gigaword
- oscar
- wikipedia
# An English-Arabic Bilingual Encoder
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
`roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
## Pretraining Detail
We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decay linearly to 0, multilingual sampling alpha of 0.3, and the fairseq (Ott et al., 2019) implementation.
## Citation
Please cite this paper for reference:
title = "Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction",
author = "Yarmohammadi, Mahsa and
Wu, Shijie and
Marone, Marc and
Xu, Haoran and
Ebner, Seth and
Qin, Guanghui and
Chen, Yunmo and
Guo, Jialiang and
Harman, Craig and
Murray, Kenton and
White, Aaron Steven and
Dredze, Mark and
Van Durme, Benjamin",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
year = "2021",