guymorlan commited on
Commit
056177e
โ€ข
1 Parent(s): c33e789

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -7
README.md CHANGED
@@ -12,9 +12,9 @@ datasets:
12
 
13
  This model adds diacritics to raw text in Palestinian colloquial Arabic.
14
  The model is trained on a special subset of the Levanti dataset (to be released later).
15
- The model is fine-tuned from Google's [CANINE-s](https://huggingface.co/google/canine-s) character level LM with a multi-label token classification head.
16
- CANINE-s is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 5 epochs on the diacritized subset of the Levanti dataset.
17
- Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun (see `model.config.id2label`). A multi-label model is used since a Shadda can accompany other diacritical marks.
18
 
19
  # Transliterator
20
  This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
@@ -22,15 +22,20 @@ This model can be used in conjunction with [Levanti Transliterator](https://hugg
22
  # Example Usage
23
 
24
  ```python
25
- from transformers import CanineForTokenClassification, AutoTokenizer
26
- model = CanineForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
27
  tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
28
 
29
- label2diacritic = {0: 'ู‘', 1: 'ูŽ', 2: 'ู', 3: 'ู', 4: ''}
 
 
 
 
 
30
 
31
  def arabic2diacritics(text, model, tokenizer):
32
  tokens = tokenizer(text, return_tensors="pt")
33
- preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove CLS and SEP
34
  new_text = []
35
  for p, c in zip(preds, text):
36
  new_text.append(c)
@@ -48,6 +53,7 @@ text = "ุจุฏูŠุด ุงุฑูˆุญ ุนุงู„ู…ุฏุฑุณุฉ ุจูƒุฑุง"
48
  arabic2diacritics(text, model, tokenizer)
49
  ```
50
  ```
 
51
  ```
52
 
53
  # Attribution
 
12
 
13
  This model adds diacritics to raw text in Palestinian colloquial Arabic.
14
  The model is trained on a special subset of the Levanti dataset (to be released later).
15
+ The model is fine-tuned from the [TavBERT-ar](https://huggingface.co/tau/tavbert-ar) character level encoder LM, with a multi-label token classification head.
16
+ TavBert-ar is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 8 epochs on the diacritized subset of the Levanti dataset.
17
+ Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun. A multi-label model is used since a Shadda can accompany other diacritical marks.
18
 
19
  # Transliterator
20
  This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
 
22
  # Example Usage
23
 
24
  ```python
25
+ from transformers import RobertaForTokenClassification, AutoTokenizer
26
+ model = RobertaForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
27
  tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
28
 
29
+ label2diacritic = {0: 'ู‘', # SHADDA
30
+ 1: 'ูŽ', # FATHA
31
+ 2: 'ู', # KASRA
32
+ 3: 'ู', # DAMMA
33
+ 4: 'ู’'} # SUKKUN
34
+
35
 
36
  def arabic2diacritics(text, model, tokenizer):
37
  tokens = tokenizer(text, return_tensors="pt")
38
+ preds = (model(**tokens).logits.sigmoid() > 0.5)[0][1:-1] # remove preds for BOS and EOS
39
  new_text = []
40
  for p, c in zip(preds, text):
41
  new_text.append(c)
 
53
  arabic2diacritics(text, model, tokenizer)
54
  ```
55
  ```
56
+ Out[1]: 'ุจูุฏูู‘ูŠู’ุด ุงู’ุฑููˆู’ุญ ุนูŽุงู„ู’ู…ูŽุฏู’ุฑูŽุณูุฉ ุจููƒู’ุฑูŽุง'
57
  ```
58
 
59
  # Attribution