Rawi V2 — Arabic Diacritizer
A lightweight BiLSTM that restores Arabic diacritics (tashkeel). Version 2 of
TigreGotico/rawi.
Credits
- TigreGotico — author: the corpus, the model, and the training notebook.
- Mike Hansen — ran the training on his GPUs.
Results
Scored on the entire 817k-sentence held-out test split of
TigreGotico/arabic_diacritized_text:
| metric | value |
|---|---|
| DER (all positions) | 2.29% |
| DER* (marked positions) | 3.37% |
| WER | 8.33% |
V2 is trained to abstain on unmarked positions (correct ignore_index masking over
the padded loss), so it does not over-mark. INT8 quantization is lossless (2.30% DER,
2.5 MB).
On the corpus: it aggregates many public Arabic sources. rawi-v2 trained only on
the train split and never on test.txt, though ~1.7% of test sentences also appear
in train via near-duplicate aggregation — a smaller exposure than typical external
baselines on the same set.
Architecture
Embedding(236, 128) → 2-layer bidirectional LSTM(256) → Linear(512, 75).
Normalization is NFD with symbols (Unicode So) dropped; the model predicts one of
75 NFD-based diacritic classes per base character (it also restores hamzas and the
superscript alef).
Files
| file | what |
|---|---|
rawi_v2.onnx |
fp32 ONNX, dynamic sequence length |
rawi_v2.int8.onnx |
INT8 dynamic quantization (2.5 MB, lossless) |
rawi_v2.vocab.json |
char_to_idx / diac_to_idx |
diacritization_model_lstm_2.pth |
original PyTorch checkpoint |
rawi_v2_train.ipynb |
training notebook |
Usage
from text2tashkeel import Diacritizer # pip install text2tashkeel
Diacritizer("rawi-v2").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
text2tashkeel bundles this model and reproduces the exact NFD normalization and
letter-only decode in pure Python (no PyTorch at inference).