metadata

language: ar
datasets:
  - wikipedia
  - Osian
  - 1.5B-Arabic-Corpus
  - oscar-arabic-unshuffled
  - Assafir(private)
  - Twitter(private)
widget:
  - text: ' عاصمة لبنان هي [MASK] .'

AraBERTv0.2-Twitter

AraBERTv0.2-Twitter-base/large are two new models for Arabic dialects and tweets, trained by continuing the pre-training using the MLM task on ~60M Arabic tweets (filtered from a collection on 100M).

The two new models have had emojies added to their vocabulary in addition to common words that weren't at first present. The pre-training was done with a max sentence length of 64 only for 1 epoch.

AraBERT is an Arabic pretrained language model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup

Other Models

Model	HuggingFace Model Name	Size (MB/Params)	Pre-Segmentation	DataSet (Sentences/Size/nWords)
AraBERTv0.2-base	bert-base-arabertv02	543MB / 136M	No	200M / 77GB / 8.6B
AraBERTv0.2-large	bert-large-arabertv02	1.38G / 371M	No	200M / 77GB / 8.6B
AraBERTv2-base	bert-base-arabertv2	543MB / 136M	Yes	200M / 77GB / 8.6B
AraBERTv2-large	bert-large-arabertv2	1.38G / 371M	Yes	200M / 77GB / 8.6B
AraBERTv0.1-base	bert-base-arabertv01	543MB / 136M	No	77M / 23GB / 2.7B
AraBERTv1-base	bert-base-arabert	543MB / 136M	Yes	77M / 23GB / 2.7B
AraBERTv0.2-Twitter-base	bert-base-arabertv02-twitter	543MB / 136M	No	Same as v02 + 60M Multi-Dialect Tweets
AraBERTv0.2-Twitter-large	bert-large-arabertv02-twitter	1.38G / 371M	No	Same as v02 + 60M Multi-Dialect Tweets

Preprocessing

The model is trained on a sequence length of 64, using max length beyond 64 might result in degraded performance

It is recommended to apply our preprocessing function before training/testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
  
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")

If you used this model please cite us as :

Google Scholar has our Bibtex wrong (missing name), use this instead

@inproceedings{antoun2020arabert,
  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
  pages={9}
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

Contacts

Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com

Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com