wissamantoun's picture
Update README.md
09ab827
|
raw
history blame
4.58 kB
metadata
language: ar
datasets:
  - wikipedia
  - Osian
  - 1.5B-Arabic-Corpus
  - oscar-arabic-unshuffled
  - Assafir(private)
  - Twitter(private)
widget:
  - text: ' عاصمة لبنان هي [MASK] .'

AraBERTv0.2-Twitter

AraBERTv0.2-Twitter-base/large are two new models for Arabic dialects and tweets, trained by continuing the pre-training using the MLM task on ~60M Arabic tweets (filtered from a collection on 100M).

The two new models have had emojies added to their vocabulary in addition to common words that weren't at first present. The pre-training was done with a max sentence length of 64 only for 1 epoch.

AraBERT is an Arabic pretrained language model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup

Other Models

Model HuggingFace Model Name Size (MB/Params) Pre-Segmentation DataSet (Sentences/Size/nWords)
AraBERTv0.2-base bert-base-arabertv02 543MB / 136M No 200M / 77GB / 8.6B
AraBERTv0.2-large bert-large-arabertv02 1.38G / 371M No 200M / 77GB / 8.6B
AraBERTv2-base bert-base-arabertv2 543MB / 136M Yes 200M / 77GB / 8.6B
AraBERTv2-large bert-large-arabertv2 1.38G / 371M Yes 200M / 77GB / 8.6B
AraBERTv0.1-base bert-base-arabertv01 543MB / 136M No 77M / 23GB / 2.7B
AraBERTv1-base bert-base-arabert 543MB / 136M Yes 77M / 23GB / 2.7B
AraBERTv0.2-Twitter-base bert-base-arabertv02-twitter 543MB / 136M No Same as v02 + 60M Multi-Dialect Tweets
AraBERTv0.2-Twitter-large bert-large-arabertv02-twitter 1.38G / 371M No Same as v02 + 60M Multi-Dialect Tweets

Preprocessing

The model is trained on a sequence length of 64, using max length beyond 64 might result in degraded performance

It is recommended to apply our preprocessing function before training/testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
  
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")

If you used this model please cite us as :

Google Scholar has our Bibtex wrong (missing name), use this instead

@inproceedings{antoun2020arabert,
  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
  pages={9}
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

Contacts

Wissam Antoun: Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com

Fady Baly: Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com