Go Inoue
Update README.md
a049c33
|
raw
history blame
10.8 kB
metadata
language:
  - ar
license: apache-2.0
widget:
  - text: الهدف من الحياة هو [MASK] .

CAMeLBERT: A collection of pre-trained models for Arabic NLP tasks

Model description

CAMeLBERT is a collection of BERT models pre-trained on Arabic texts with different sizes and variants. The details are described in the paper "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models." We release eight models with different sizes and variants as follows:

Model Variant Size #Word
bert-base-camelbert-mix CA,DA,MSA 167GB 17.3B
bert-base-camelbert-ca CA 6GB 847M
bert-base-camelbert-da DA 54GB 5.8B
bert-base-camelbert-msa MSA 107GB 12.6B
bert-base-camelbert-msa-half MSA 53GB 6.3B
bert-base-camelbert-msa-quarter MSA 27GB 3.1B
bert-base-camelbert-msa-eighth MSA 14GB 1.6B
bert-base-camelbert-msa-sixteenth MSA 6GB 746M

This model card describes CAMeLBERT-Mix (bert-base-camelbert-mix), a model pre-trained on a mixture of these variants: CA, DA, and MSA.

Intended uses

You can use the released model for either masked language modeling or next sentence prediction. However, it is mostly intended to be fine-tuned on an NLP task, such as NER, POS tagging, sentiment analysis, dialect identification, and poetry classification. We release our fine-tuninig code here.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-mix')
>>> unmasker("الهدف من الحياة هو [MASK] .")
[{'sequence': '[CLS] الهدف من الحياة هو النجاح. [SEP]',
  'score': 0.10861027985811234,
  'token': 6232,
  'token_str': 'النجاح'},
 {'sequence': '[CLS] الهدف من الحياة هو.. [SEP]',
  'score': 0.07626965641975403,
  'token': 18,
  'token_str': '.'},
 {'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
  'score': 0.05131986364722252,
  'token': 3696,
  'token_str': 'الحياة'},
 {'sequence': '[CLS] الهدف من الحياة هو الموت. [SEP]',
  'score': 0.03734956309199333,
  'token': 4295,
  'token_str': 'الموت'},
 {'sequence': '[CLS] الهدف من الحياة هو العمل. [SEP]',
  'score': 0.027189988642930984,
  'token': 2854,
  'token_str': 'العمل'}]

Note: to download our models, you would need transformers>=3.5.0. Otherwise, you could download the models manually.

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-mix')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-mix')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-mix')
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-mix')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Training data

Training procedure

We use the original implementation released by Google for pre-training. We follow the original English BERT model's hyperparameters for pre-training, unless otherwise specified.

Preprocessing

  • After extracting the raw text from each corpus, we apply the following pre-processing.
  • We first remove invalid characters and normalize white spaces using the utilities provided by the original BERT implementation.
  • We also remove lines without any Arabic characters.
  • We then remove diacritics and kashida using CAMeL Tools.
  • Finally, we split each line into sentences with a heuristics-based sentence segmenter.
  • We train a WordPiece tokenizer on the entire dataset (167 GB text) with a vocabulary size of 30,000 using HuggingFace's tokenizers.
  • We do not lowercase letters nor strip accents.

Pre-training

  • The model was trained on a single cloud TPU (v3-8) for one million steps in total.
  • The first 90,000 steps were trained with a batch size of 1,024 and the rest was trained with a batch size of 256.
  • The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%.
  • We use whole word masking and a duplicate factor of 10.
  • We set max predictions per sequence to 20 for the dataset with max sequence length of 128 tokens and 80 for the dataset with max sequence length of 512 tokens.
  • We use a random seed of 12345, masked language model probability of 0.15, and short sequence probability of 0.1.
  • The optimizer used is Adam with a learning rate of 1e-4, β1=0.9\beta_{1} = 0.9 and β2=0.999\beta_{2} = 0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

Evaluation results

  • We evaluate our pre-trained language models on five NLP tasks: NER, POS tagging, sentiment analysis, dialect identification, and poetry classification.
  • We fine-tune and evaluate the models using 12 dataset.
  • We used Hugging Face's transformers to fine-tune our CAMeLBERT models.
  • We used transformers v3.1.0 along with PyTorch v1.5.1.
  • The fine-tuning was done by adding a fully connected linear layer to the last hidden state.
  • We use F1F_{1} score as a metric for all tasks.
  • Code used for fine-tuning is available here.

Results

Task Dataset Variant Mix CA DA MSA MSA-1/2 MSA-1/4 MSA-1/8 MSA-1/16
NER ANERcorp MSA 80.2% 66.2% 74.2% 82.4% 82.3% 82.0% 82.3% 80.5%
POS PATB (MSA) MSA 97.3% 96.6% 96.5% 97.4% 97.4% 97.4% 97.4% 97.4%
ARZTB (EGY) DA 90.1% 88.6% 89.4% 90.8% 90.3% 90.5% 90.5% 90.4%
Gumar (GLF) DA 97.3% 96.5% 97.0% 97.1% 97.0% 97.0% 97.1% 97.0%
SA ASTD MSA 76.3% 69.4% 74.6% 76.9% 76.0% 76.8% 76.7% 75.3%
ArSAS MSA 92.7% 89.4% 91.8% 93.0% 92.6% 92.5% 92.5% 92.3%
SemEval MSA 69.0% 58.5% 68.4% 72.1% 70.7% 72.8% 71.6% 71.2%
DID MADAR-26 DA 62.9% 61.9% 61.8% 62.6% 62.0% 62.8% 62.0% 62.2%
MADAR-6 DA 92.5% 91.5% 92.2% 91.9% 91.8% 92.2% 92.1% 92.0%
MADAR-Twitter-5 MSA 75.7% 71.4% 74.2% 77.6% 78.5% 77.3% 77.7% 76.2%
NADI DA 24.7% 17.3% 20.1% 24.9% 24.6% 24.6% 24.9% 23.8%
Poetry APCD CA 79.8% 80.9% 79.6% 79.7% 79.9% 80.0% 79.7% 79.8%

Results (Average)

Variant Mix CA DA MSA MSA-1/2 MSA-1/4 MSA-1/8 MSA-1/16
Variant-wise-average[1] MSA 81.9% 75.3% 79.9% 83.2% 82.9% 83.1% 83.0% 82.1%
DA 73.5% 71.1% 72.1% 73.5% 73.1% 73.4% 73.3% 73.1%
CA 79.8% 80.9% 79.6% 79.7% 79.9% 80.0% 79.7% 79.8%
Macro-Average ALL 78.2% 74.0% 76.6% 78.9% 78.6% 78.8% 78.7% 78.2%

[1]: Variant-wise-average refers to average over a group of tasks in the same language variant.

Acknowledgements

This research was supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

Citation

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}