File size: 2,394 Bytes
26882fc
 
 
 
 
 
 
 
 
029b35b
8cc476c
 
 
4017cdf
 
8cc476c
 
cf85362
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
language: 
- en
tags:
- text-classification
metrics:
- accuracy (balanced)
- F1 (weighted)
widget:
- text: "اسعدغيرك  انت مو بس اسعدت العماله ترا اسعدتنا"
  example_title: "خليجي"
- text: "  سبحان الله في الغيوم شكل قلب"
  example_title: "فصحي"
- text: "بلاش تحطي صور متبرجة ع صفحتك..."
  example_title: "خليجي"
- text: "و حضرتك طيبة و شكرا علي الكلام الحلو ده يا مبهجة..."
  example_title: "مصري"
---
# Dialectical-MSA-detection

## Model description

This model was trained on 108,173 manually annotated User-Generated Content (e.g. tweets and online comments) to classify the Arabic language of the text into one of two categories: 'Dialectical', or 'MSA' (i.e. Modern Standard Arabic).


## Training data

Dialectical-MSA-detection was trained on the English-speaking subset of the [The Arabic online commentary dataset (Zaidan, et al 20211)](https://github.com/sjeblee/AOC). 
The AOC dataset was created by crawling the websites of three Arabic newspapers, and extracting online articles and readers' comments.  



## Training procedure

`xlm-roberta-base` was trained using the Hugging Face trainer with the following hyperparameters. 

```
training_args = TrainingArguments(
    num_train_epochs=4,               # total number of training epochs
    learning_rate=2e-5,               # learning rate
    per_device_train_batch_size=32,   # batch size per device during training
    per_device_eval_batch_size=4,     # batch size for evaluation
    warmup_steps=0,                   # number of warmup steps for learning rate scheduler
    weight_decay=0.02,                # strength of weight decay
    
)
```

## Eval results

The model was evaluated using 10% of the sentences (90-10 train-dev split). Accuracy 0.88 on the dev set.


## Limitations and bias

The model was trained on sentences from the online commentary domain. Other forms of UGT such as tweet can be different in the degree of dialectness. 


### BibTeX entry and citation info

```bibtex
@article{saadany2022semi,
  title={A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT},
  author={Saadany, Hadeel and Orasan, Constantin and Mohamed, Emad and Tantawy, Ashraf},
  journal={arXiv preprint arXiv:2210.11899},
  year={2022}
}
```