Model Card for Arabic PL-BERT Models

This model card describes a collection of three Arabic BERT models trained with different objectives and datasets for phoneme-aware language modeling.

Model Details

Model Description

These models are Arabic adaptations of the PL-BERT (Phoneme-aware Language BERT) approach introduced in Ashby et al. (2023). The models incorporate phonemic information to enhance language understanding, with variations in training objectives and data preprocessing.

The collection includes three models:

mlm_p2g_non_diacritics: Trained with both MLM (Masked Language Modeling) and P2G (Phoneme-to-Grapheme) objectives on non-diacritized Arabic text
mlm_only_non_diacritics: Trained with only the MLM objective on non-diacritized Arabic text
mlm_only_with_diacritics: Fine-tuned version of mlm_only_non_diacritics on diacritized Arabic text

Developed by: Fadi (GitHub: Fadi987)
Model type: Transformer-based language models (BERT variants)
Language: Arabic

Model Sources

Paper (PL-BERT approach): Ashby et al. (2023)

Training Details

Training Data

All models were initially trained on a cleaned version of the Arabic Wikipedia dataset. The dataset is available at wikipedia.20231101.ar.

For the mlm_only_with_diacritics model, a random sample of 200,000 entries (out of approximately 1.2 million) was selected from the Wikipedia Arabic dataset and fully diacritized using the state-of-the-art CATT diacritizer (Abjad AI, 2024), introduced in this paper and licensed under CC BY-NC 4.0.

Training Procedure

Model Architecture and Objectives

The models follow different training objectives:

mlm_p2g_non_diacritics:
- Trained with dual objectives similar to the original PL-BERT:
  - Masked Language Modeling (MLM): Standard BERT pre-training objective
  - Phoneme-to-Grapheme (P2G): Predicting token IDs from phonemic representations
- Tokenization was performed using aubmindlab/bert-base-arabertv2, which uses subword tokenization
- Trained for 10 epochs on non-diacritized Wikipedia Arabic
mlm_only_non_diacritics:
- Trained with only the MLM objective
- Removes the P2G objective, which according to ablation studies in the PL-BERT paper minimally affected performance
- This removal eliminated dependence on tokenization, which:
  - Reduced the model size considerably (word/subword tokenization has a much larger vocabulary than phoneme vocabulary)
  - Allowed phonemization of entire sentences at once, resulting in more accurate phonemization
- Trained on non-diacritized Wikipedia Arabic
mlm_only_with_diacritics:
- Fine-tuned version of mlm_only_non_diacritics
- Trained for 10 epochs on diacritized Arabic text
- Uses the same MLM-only objective

Technical Considerations

Tokenization Challenges

For the mlm_p2g_non_diacritics model, a notable limitation was the use of subword tokenization. This approach is not ideal for pronunciation modeling because phonemizing parts of words independently loses the context of the word, which heavily affects pronunciation. The authors of the original PL-BERT paper used a word-level tokenizer for English, but a comparable high-quality word-level tokenizer was not available for Arabic. This limitation was addressed in the subsequent models by removing the P2G objective.

Diacritization

Arabic text can be written with or without diacritics (short vowel marks). The mlm_only_with_diacritics model specifically addresses this by training on fully diacritized text, which provides explicit pronunciation information that is typically absent in standard written Arabic.

Uses

These models can be used for Arabic natural language understanding tasks where phonemic awareness may be beneficial, such as:

Text-to-speech
Speech recognition post-processing
Dialect identification
Pronunciation-sensitive applications

For examples on how these models can be used in code, take a look at: https://github.com/Fadi987/StyleTTS2/blob/main/Utils/PLBERT/util.py

Bias, Risks, and Limitations

The models are trained on Wikipedia data, which may not represent all varieties of Arabic equally. The diacritization process, while state-of-the-art, may introduce some errors or biases in the training data.

The subword tokenization approach used in the mlm_p2g_non_diacritics model has limitations for phonemic modeling as noted above.

Citation

BibTeX:

@article{catt2024,
  title={CATT: Character-based Arabic Tashkeel Transformer},
  author={Alasmary, Faris and Zaafarani, Orjuwan and Ghannam, Ahmad},
  journal={arXiv preprint arXiv:2407.03236},
  year={2024}
}

@article{plbert2023,
  title={Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions},
  author={Li, Yinghao Aaron and Han, Cong and Jiang, Xilin and Mesgarani, Nima},
  journal={arXiv preprint arXiv:2301.08810},
  year={2023}
}