julien-c HF staff commited on
Commit
50812b5
1 Parent(s): e3b397b

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/aubmindlab/bert-base-arabertv01/README.md

Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ ---
4
+
5
+ # AraBERT : Pre-training BERT for Arabic Language Understanding
6
+ <img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
7
+
8
+ **AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
9
+
10
+ There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
11
+
12
+ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words. The training corpora are a collection of publically available large scale raw arabic text ([Arabic Wikidumps](https://archive.org/details/arwiki-20190201), [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4), [The OSIAN Corpus](https://www.aclweb.org/anthology/W19-4619), Assafir news articles, and 4 other manually crawled news websites (Al-Akhbar, Annahar, AL-Ahram, AL-Wafd) from [the Wayback Machine](http://web.archive.org/))
13
+
14
+ We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
15
+
16
+ **Update 2 (21/5/2020) :**
17
+ Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
18
+
19
+ **Update 1 (21/4/2020) :**
20
+ Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
21
+ ## Results (Acc.)
22
+ Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
23
+ ---|:---:|:---:|:---:|:---:
24
+ HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
25
+ ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
26
+ ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
27
+ AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
28
+ LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
29
+ ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
30
+ ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
31
+
32
+ *If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
33
+
34
+ ## How to use
35
+
36
+ You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
37
+
38
+ To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
39
+ **Not all libraries built on top of transformers support the `never_split` argument**
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModel
42
+ from arabert.preprocess_arabert import never_split_tokens, preprocess
43
+ from farasa.segmenter import FarasaSegmenter
44
+
45
+ arabert_tokenizer = AutoTokenizer.from_pretrained(
46
+ "aubmindlab/bert-base-arabert",
47
+ do_lower_case=False,
48
+ do_basic_tokenize=True,
49
+ never_split=never_split_tokens)
50
+ arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
51
+
52
+ #Preprocess the text to make it compatible with AraBERT using farasapy
53
+ farasa_segmenter = FarasaSegmenter(interactive=True)
54
+
55
+ #or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
56
+ #(see update 2)
57
+ #from py4j.java_gateway import JavaGateway
58
+ #gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
59
+ #farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
60
+
61
+ text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
62
+ text_preprocessed = preprocess( text,
63
+ do_farasa_tokenization = True,
64
+ farasa = farasa_segmenter,
65
+ use_farasapy = True)
66
+
67
+ >>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
68
+
69
+ arabert_tokenizer.tokenize(text_preprocessed)
70
+
71
+ >>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
72
+ ```
73
+
74
+ **AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
75
+ ```python
76
+ from transformers import AutoTokenizer, AutoModel
77
+
78
+ arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
79
+ arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
80
+
81
+ text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
82
+ arabert_tokenizer.tokenize(text)
83
+
84
+ >>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
85
+ ```
86
+
87
+
88
+ The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
89
+
90
+ **Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
91
+
92
+ **AraBERT on ARCD**
93
+ During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
94
+ ```bash
95
+ python arcd_preprocessing.py \
96
+ --input_file="/PATH_TO/arcd-test.json" \
97
+ --output_file="arcd-test-pre.json" \
98
+ --do_farasa_tokenization=True \
99
+ --use_farasapy=True \
100
+ ```
101
+ ```bash
102
+ python SOQAL/bert/run_squad.py \
103
+ --vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
104
+ --bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
105
+ --init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
106
+ --do_train=True \
107
+ --train_file=turk_combined_all_pre.json \
108
+ --do_predict=True \
109
+ --predict_file=arcd-test-pre.json \
110
+ --train_batch_size=32 \
111
+ --predict_batch_size=24 \
112
+ --learning_rate=3e-5 \
113
+ --num_train_epochs=4 \
114
+ --max_seq_length=384 \
115
+ --doc_stride=128 \
116
+ --do_lower_case=False\
117
+ --output_dir="/PATH_TO/OUTPUT_PATH"/ \
118
+ --use_tpu=True \
119
+ --tpu_name=$TPU_ADDRESS \
120
+ ```
121
+ ## Model Weights and Vocab Download
122
+ Models | AraBERTv0.1 | AraBERTv1
123
+ ---|:---:|:---:
124
+ TensorFlow|[Drive Link](https://drive.google.com/open?id=1-kVmTUZZ4DP2rzeHNjTPkY8OjnQCpomO) | [Drive Link](https://drive.google.com/open?id=1-d7-9ljKgDJP5mx73uBtio-TuUZCqZnt)
125
+ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7yaJH-IOv)| [Drive_Link](https://drive.google.com/open?id=1-69s6Pxqbi63HOQ1M9wTcr-Ovc6PWLLo)
126
+
127
+ **You can find the PyTorch models in HuggingFace's Transformer Library under the ```aubmindlab``` username**
128
+
129
+ ## If you used this model please cite us as:
130
+ ```
131
+ @inproceedings{antoun2020arabert,
132
+ title={AraBERT: Transformer-based Model for Arabic Language Understanding},
133
+ author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
134
+ booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
135
+ pages={9}
136
+ }
137
+ ```
138
+ ## Acknowledgments
139
+ Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
140
+
141
+ ## Contacts
142
+ **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
143
+
144
+ **Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>