Rawi V2 — Arabic Diacritizer

A lightweight BiLSTM that restores Arabic diacritics (tashkeel). Version 2 of TigreGotico/rawi.

Credits

TigreGotico — author: the corpus, the model, and the training notebook.
Mike Hansen — ran the training on his GPUs.

Results

Scored on the entire 817k-sentence held-out test split of TigreGotico/arabic_diacritized_text:

metric	value
DER (all positions)	2.29%
DER* (marked positions)	3.37%
WER	8.33%

V2 is trained to abstain on unmarked positions (correct ignore_index masking over the padded loss), so it does not over-mark. INT8 quantization is lossless (2.30% DER, 2.5 MB).

On the corpus: it aggregates many public Arabic sources. rawi-v2 trained only on the train split and never on test.txt, though ~1.7% of test sentences also appear in train via near-duplicate aggregation — a smaller exposure than typical external baselines on the same set.

Architecture

Embedding(236, 128) → 2-layer bidirectional LSTM(256) → Linear(512, 75). Normalization is NFD with symbols (Unicode So) dropped; the model predicts one of 75 NFD-based diacritic classes per base character (it also restores hamzas and the superscript alef).

Files

file	what
`rawi_v2.onnx`	fp32 ONNX, dynamic sequence length
`rawi_v2.int8.onnx`	INT8 dynamic quantization (2.5 MB, lossless)
`rawi_v2.vocab.json`	`char_to_idx` / `diac_to_idx`
`diacritization_model_lstm_2.pth`	original PyTorch checkpoint
`rawi_v2_train.ipynb`	training notebook

Usage

from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v2").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

text2tashkeel bundles this model and reproduces the exact NFD normalization and letter-only decode in pure Python (no PyTorch at inference).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TigreGotico/rawi-v2

Collection including TigreGotico/rawi-v2

Arabic Diacritizers (tashkeel)

Collection

ONNX Arabic diacritization models used by text2tashkeel: the rawi family plus mirrored bilstm and libtashkeel. • 7 items • Updated about 2 hours ago