Instructions to use Omarrran/koshur-diacritizer-byt5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Omarrran/koshur-diacritizer-byt5-small with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Omarrran/koshur-diacritizer-byt5-small") model = AutoModelForMultimodalLM.from_pretrained("Omarrran/koshur-diacritizer-byt5-small") - Notebooks
- Google Colab
- Kaggle
Koshur Diacritizer ByT5 Small
A ByT5-small model fine-tuned for Kashmiri/Koshur diacritic restoration: non-diacritic Kashmiri text → diacritic Kashmiri text. the average reviewer-rated accuracy of our model is approximately 77.5%. That's a reasonable first-model score for a low-resource diacritization task , the model captures most patterns but still has room to improve on edge cases and truncation issues.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
repo_id = "Omarrran/koshur-diacritizer-byt5-small"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)
text = "کاشر زبان"
inputs = tokenizer(text, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Final metrics
{
"validation": {
"loss": 0.061068128794431686,
"der_marked": 0.10005247507433969,
"der_all": 0.03764245052568145,
"wer": 0.1231150319412455,
"exact_match": 0.24492979719188768,
"runtime": 156.3775,
"samples_per_second": 8.198,
"steps_per_second": 0.262,
"epoch": 9.992486851990984
},
"test": {
"der_marked": 0.2011514510633298,
"der_all": 0.14687684306471502,
"wer": 0.21588209414870216,
"exact_match": 0.12782608695652173,
"n_sentences": 1150,
"n_units": 93255,
"n_marked": 17022
}
}
Generated training report
Kashmiri Diacritic Restoration — Run byt5small-ksdiac-extra-20260613
1. Method
We cast diacritic restoration as byte-level sequence-to-sequence transduction and fine-tune the latest released model, with the retained training checkpoint at training-checkpoints/checkpoint-6650. The extra-dataset run was initialized from an earlier trained model during training, but only the final retained checkpoint is kept in the Hub repo. Byte-level modelling avoids subword tokenisers that corrupt Perso-Arabic combining marks. Input is the un-diacritised (bare) skeleton; the target is the fully diacritised form. At inference a skeleton guard rejects any output that alters the consonant skeleton, so the model can only add marks.
2. Data understanding
- Column diacritic densities:
{'input_text': 0.0, 'target_text': 0.13561} - Chosen input column:
input_text; target column:target_text - Learned letter fold (10 entries):
{'ٲ': 'ا', 'ؤ': 'و', 'آ': 'ا', 'ئ': 'ی', 'أ': 'ا', 'ۂ': 'ہ', 'ۓ': 'ے', 'ٳ': 'ا', 'ٱ': 'ا', 'إ': 'ا'} - Alignment survival: 82.1% (kept 23727/28891; misaligned=16, dup=1974, len=3174)
Split statistics
| split | rows | mean chars | p95 chars | diac. density |
|---|---|---|---|---|
| train | 21295 | 113.6 | 191 | 0.1243 |
| validation | 1282 | 116.3 | 193 | 0.1246 |
| test | 1150 | 114.7 | 193 | 0.1237 |
3. Training configuration
| setting | value |
|---|---|
| model | Omarrran/koshur-diacritizer-byt5-small; retained checkpoint: training-checkpoints/checkpoint-6650 |
| epochs | 10.0 |
| lr | 0.0005 |
| effective batch | 32 |
| precision | bf16 |
| scheduler | cosine |
| max len (bytes) | 256 |
| GPU | NVIDIA L4 |
| torch | 2.4.1+cu121 |
| transformers | 4.44.2 |
4. Results
| metric | validation | test |
|---|---|---|
| DER (marked) | 0.1001 | 0.2012 |
| DER (all) | 0.0376 | 0.1469 |
| WER | 0.1231 | 0.2159 |
| Exact match | 0.2449 | 0.1278 |
Test set: 1150 sentences, 93255 letters (17022 diacritised).
DER (marked) — the headline metric — is the error rate over letters that carry a diacritic in the reference. Lower is better.
6. Qualitative samples
7. Reproducibility
All artefacts are in this directory: run_config.json, dataset_stats.json, history.jsonl, metrics.json, predictions.jsonl, confusion.json, and model/.
- Downloads last month
- 474
Model tree for Omarrran/koshur-diacritizer-byt5-small
Base model
google/byt5-smallDataset used to train Omarrran/koshur-diacritizer-byt5-small
Space using Omarrran/koshur-diacritizer-byt5-small 1
Evaluation results
- Test DER-marked on Combined Kashmiri Non-Diacritic to Diacritic Parallel Datasetself-reported0.2012 (lower is better)
- Human review eval on Combined Kashmiri Non-Diacritic to Diacritic Parallel Datasetself-reported77.6%
