--- license: cc-by-nc-4.0 language: - ar pipeline_tag: token-classification datasets: - guymorlan/levanti - community-datasets/tashkeela --- # Levanti Diacritizer This model adds diacritics to raw text in Palestinian colloquial Arabic. The model is trained on a special subset of the Levanti dataset (to be released later). The model is fine-tuned from the [TavBERT-ar](https://huggingface.co/tau/tavbert-ar) character level encoder LM, with a multi-label token classification head. TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset. Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks. # Transliterator This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic. # Example Usage ```python from transformers import RobertaForTokenClassification, AutoTokenizer model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics") tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics") label2diacritic = {0: 'ّ', # SHADDA 1: 'َ', # FATHA 2: 'ِ', # KASRA 3: 'ُ', # DAMMA 4: 'ْ'} # SUKKUN def arabic2diacritics(text, model, tokenizer): tokens = tokenizer(text, return_tensors="pt") preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS new_text = [] for p, c in zip(preds, text): new_text.append(c) for i in range(1, 5): if p[i]: new_text.append(label2diacritic[i]) # check shadda last if p[0]: new_text.append(label2diacritic[0]) new_text = "".join(new_text) return new_text text = "بديش اروح عالمدرسة بكرا" arabic2diacritics(text, model, tokenizer) ``` ``` Out[1]: 'بِدِّيْش اْرُوْح عَالْمَدْرَسِة بُكْرَا' ``` # Attribution Created by Guy Mor-Lan.
Contact: guy.mor AT mail.huji.ac.il