ONNX
Arabic
arabic
nlp
diacritization
tashkeel

Rawi V2 — Arabic Diacritizer

A lightweight BiLSTM that restores Arabic diacritics (tashkeel). Version 2 of TigreGotico/rawi.

Credits

  • TigreGotico — author: the corpus, the model, and the training notebook.
  • Mike Hansen — ran the training on his GPUs.

Results

Scored on the entire 817k-sentence held-out test split of TigreGotico/arabic_diacritized_text:

metric value
DER (all positions) 2.29%
DER* (marked positions) 3.37%
WER 8.33%

V2 is trained to abstain on unmarked positions (correct ignore_index masking over the padded loss), so it does not over-mark. INT8 quantization is lossless (2.30% DER, 2.5 MB).

On the corpus: it aggregates many public Arabic sources. rawi-v2 trained only on the train split and never on test.txt, though ~1.7% of test sentences also appear in train via near-duplicate aggregation — a smaller exposure than typical external baselines on the same set.

Architecture

Embedding(236, 128) → 2-layer bidirectional LSTM(256)Linear(512, 75). Normalization is NFD with symbols (Unicode So) dropped; the model predicts one of 75 NFD-based diacritic classes per base character (it also restores hamzas and the superscript alef).

Files

file what
rawi_v2.onnx fp32 ONNX, dynamic sequence length
rawi_v2.int8.onnx INT8 dynamic quantization (2.5 MB, lossless)
rawi_v2.vocab.json char_to_idx / diac_to_idx
diacritization_model_lstm_2.pth original PyTorch checkpoint
rawi_v2_train.ipynb training notebook

Usage

from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v2").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

text2tashkeel bundles this model and reproduces the exact NFD normalization and letter-only decode in pure Python (no PyTorch at inference).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TigreGotico/rawi-v2

Collection including TigreGotico/rawi-v2