File size: 3,568 Bytes
7b98338
 
 
 
 
21d186e
 
7b98338
21d186e
 
 
 
 
 
c566d40
9be51b7
21d186e
c566d40
7b98338
21d186e
 
9ec9d19
049fe98
9ec9d19
492c835
 
4982cf0
 
 
 
 
 
 
492c835
 
049fe98
21d186e
 
 
 
 
9d527c8
21d186e
 
 
 
 
049fe98
21d186e
 
 
9d527c8
1e98aa1
21d186e
049fe98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21d186e
 
881051c
21d186e
881051c
 
 
 
 
 
 
 
 
 
 
 
 
 
2ad65fb
881051c
 
 
233d3ea
881051c
 
21d186e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
language:
- ar
metrics:
- bleu
- accuracy
library_name: transformers
pipeline_tag: text-classification
tags:
- t5
- Classification
- ArabicT5
- Text Classification
widget:
- example_title: > 
    الديني
- text: >
    الحمد لله رب العالمين والصلاة والسلام على سيد المرسلين نبينا محمد وآله وصحبه أجمعين،وبعد:فإنه يجب على العبد أن يتجنب الذنوب كلها دقها وجلها صغيرها وكبيرها وأن يتعاهد نفسه بالتوبة الصادقة والإنابة إلى ربه. قال تعالى: (وَتُوبُوا إِلَى اللَّهِ جَمِيعًا أَيُّهَا الْمُؤْمِنُونَ لَعَلَّكُمْ تُفْلِحُونَ)النور 31.
---

# # Arabic text classification using deep learning (ArabicT5)

# # Our experiment 

  - The category mapping:
     category_mapping = {
      'Politics':1,
      'Finance':2,
      'Medical':3,
      'Sports':4,
      'Culture':5,
      'Tech':6,
      'Religion':7
     }
    
  - Training parameters
|                       |              |
| :-------------------: | :-----------:|
|  Training batch size  |     `8`      |
| Evaluation batch size |     `8`      |
|     Learning rate     |    `1e-4`    |
|    Max length input   |    `200`     |
|   Max length target   |     `3`      |
|     Number workers    |     `4`      |
|         Epoch         |     `2`      |
|                       |              |

  - Results
|                         |               |
| :---------------------: | :-----------: | 
|   Validation Loss       |   `0.0479`    |  
|        Accuracy         |   `96.49%`    | 
|          BLeU           |   `96.49%`    |

# # SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
  
  - Paper
    [https://www.researchgate.net/publication/333605992_SANAD_Single-Label_Arabic_News_Articles_Dataset_for_Automatic_Text_Categorization]
  
  - Dataset
    [https://data.mendeley.com/datasets/57zpx667y9/2]

# # Arabic text classification using deep learning models

  - Paper
[https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413]

  - Their experiment'
"Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU."
|         Model           |         Accuracy        | 
| :---------------------: | :---------------------: | 
|           CGRU          |          93.43%         |   
|          HANGRU         |          95.81%         | 

# # Example usage
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه  متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
                    truncation=True,
                    padding="max_length",
                    return_tensors="pt"
                )

output= model.generate(tokens['input_ids'],
                       max_length=3,
                       length_penalty=10)

output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output

```
```bash
['5']
```