# Arabic text classification using deep learning (ArabicT5)
# Our experiment
The category mapping: category_mapping = { 'Politics':1, 'Finance':2, 'Medical':3, 'Sports':4, 'Culture':5, 'Tech':6, 'Religion':7 }
Training parameters
Training batch size 8
Evaluation batch size 8
Learning rate 1e-4
Max length input 200
Max length target 3
Number workers 4
Epoch 2
Results
Validation Loss 0.0479
Accuracy 96.49%
BLeU 96.49%
# SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
# Arabic text classification using deep learning models
Paper [https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413]
Their experiment' "Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU."
Model Accuracy CGRU 93.43% HANGRU 95.81%
# Example usage
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
truncation=True,
padding="max_length",
return_tensors="pt"
)
output= model.generate(tokens['input_ids'],
max_length=3,
length_penalty=10)
output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output
['5']
- Downloads last month
- 6