arxiv:2606.15883

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Published on Jun 14

Upvote

Authors:

Haq Nawaz Malik ,

Nahfid Nissar ,

Faizan Iqbal

Abstract

A ByT5-small sequence-to-sequence model is developed for Kashmiri diacritic restoration, featuring a novel dataset and evaluation metrics demonstrating effective performance in handling low-resource language processing challenges.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.15883

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15883 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15883 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15883 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.