Migrate model card from transformers-repo

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/aubmindlab/bert-base-arabertv01/README.md

Files changed (1) hide show

README.md +144 -0

README.md ADDED Viewed

	@@ -0,0 +1,144 @@

+---
+language: ar
+---
+# AraBERT : Pre-training BERT for Arabic Language Understanding
+<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
+**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
+There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
+The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words. The training corpora are a collection of publically available large scale raw arabic text ([Arabic Wikidumps](https://archive.org/details/arwiki-20190201), [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4), [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619), Assafir news articles, and 4 other manually crawled news websites (Al-Akhbar, Annahar, AL-Ahram, AL-Wafd) from [the Wayback Machine](http://web.archive.org/))
+We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
+**Update 2 (21/5/2020) :**
+Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
+**Update 1 (21/4/2020) :**
+Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
+## Results (Acc.)
+Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
+---|:---:|:---:|:---:|:---:
+HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
+ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
+ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
+AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
+LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
+ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
+ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
+*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
+## How to use
+You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
+To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
+**Not all libraries built on top of transformers support the `never_split` argument**
+```python
+from transformers import AutoTokenizer, AutoModel
+from arabert.preprocess_arabert import never_split_tokens, preprocess
+from farasa.segmenter import FarasaSegmenter
+arabert_tokenizer = AutoTokenizer.from_pretrained(
+    "aubmindlab/bert-base-arabert",
+    do_lower_case=False,
+    do_basic_tokenize=True,
+    never_split=never_split_tokens)
+arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
+#Preprocess the text to make it compatible with AraBERT using farasapy
+farasa_segmenter = FarasaSegmenter(interactive=True)
+#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
+#(see update 2)
+#from py4j.java_gateway import JavaGateway
+#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
+#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+text_preprocessed = preprocess( text,
+                                do_farasa_tokenization = True,
+                                farasa = farasa_segmenter,
+                                use_farasapy = True)
+>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
+arabert_tokenizer.tokenize(text_preprocessed)
+>>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
+```
+**AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
+```python
+from transformers import AutoTokenizer, AutoModel
+arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
+arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+arabert_tokenizer.tokenize(text)
+>>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
+```
+The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
+**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
+**AraBERT on ARCD**
+During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
+```bash
+python arcd_preprocessing.py \
+    --input_file="/PATH_TO/arcd-test.json" \
+    --output_file="arcd-test-pre.json" \
+    --do_farasa_tokenization=True \
+    --use_farasapy=True \
+```
+```bash
+python SOQAL/bert/run_squad.py \
+  --vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
+  --bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
+  --init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
+  --do_train=True \
+  --train_file=turk_combined_all_pre.json \
+  --do_predict=True \
+  --predict_file=arcd-test-pre.json \
+  --train_batch_size=32 \
+  --predict_batch_size=24 \
+  --learning_rate=3e-5 \
+  --num_train_epochs=4 \
+  --max_seq_length=384 \
+  --doc_stride=128 \
+  --do_lower_case=False\
+  --output_dir="/PATH_TO/OUTPUT_PATH"/ \
+  --use_tpu=True \
+  --tpu_name=$TPU_ADDRESS \
+```
+## Model Weights and Vocab Download
+Models | AraBERTv0.1 | AraBERTv1
+---|:---:|:---:
+TensorFlow|[Drive Link](https://drive.google.com/open?id=1-kVmTUZZ4DP2rzeHNjTPkY8OjnQCpomO) | [Drive Link](https://drive.google.com/open?id=1-d7-9ljKgDJP5mx73uBtio-TuUZCqZnt)
+PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7yaJH-IOv)| [Drive_Link](https://drive.google.com/open?id=1-69s6Pxqbi63HOQ1M9wTcr-Ovc6PWLLo)
+**You can find the PyTorch models in HuggingFace's Transformer Library under the ```aubmindlab``` username**
+## If you used this model please cite us as:
+```
+@inproceedings{antoun2020arabert,
+  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
+  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
+  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
+  pages={9}
+}
+```
+## Acknowledgments
+Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
+## Contacts
+**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
+**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>