|
--- |
|
license: mit |
|
datasets: |
|
- tashkeela |
|
language: |
|
- ar |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization |
|
|
|
|
|
## Table of Contents |
|
- [Introduction](#introduction) |
|
- [Models](#models) |
|
- [ByT5](#model-name) |
|
- [Model Description](#model-description) |
|
- [Benchmarks](#benchmarks) |
|
- [Citation](#citation) |
|
- [Contact](#contact) |
|
|
|
## Introduction |
|
|
|
Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We finetune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering, reducing WER by 40%. We release our finetuned models for the greater benefit of the researchers in the community. |
|
|
|
## Model Description |
|
|
|
The ByT5 model, distinguished by its innovative token-free architecture, directly processes raw text to adeptly navigate diverse languages and linguistic nuances. Pre-trained on a comprehensive text corpus mc4, ByT5 excels in understanding and generating text, making it versatile for various NLP tasks. We have further enhanced its capabilities by fine-tuning it on a Tashkeela data set for 13,000 steps, significantly refining its performance in restoring the diacritical marks for Arabic. |
|
|
|
## Benchmarks |
|
|
|
**Note: This model has been trained specifically for use with Classical Arabic.** |
|
|
|
Our model attained a Diacritics Error Rate (DER) of 0.95 and a Word Error Rate (WER) of 2.49. |
|
|
|
|
|
Code sample to use the model. |
|
|
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
import pandas as pd |
|
|
|
|
|
if __name__ == "__main__": |
|
|
|
text = "كيف الحال" |
|
|
|
model_name = "basharalrfooh/Fine-Tashkeel" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
input_ids = tokenizer(text, return_tensors="pt").input_ids |
|
outputs = model.generate(input_ids, max_new_tokens=128) |
|
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False) |
|
print("Generated output:", decoded_output) |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{alrfooh2023finetashkeel, |
|
title={Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization}, |
|
author={Bashar Al-Rfooh and Gheith Abandah and Rami Al-Rfou}, |
|
year={2023}, |
|
eprint={2303.14588}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
|
|
``` |
|
|
|
## Contact |
|
|
|
bashar@alrfou.com |
|
|