⚠️ Koshur OCR v3 (Beta Baseline Model)

THIS IS A PRELIMINARY BASELINE MODEL AND IS NOT RECOMMENDED FOR PRODUCTION USE.

Performance: This model (Version 3) has a high Character Error Rate (CER) of ~79%. It struggles with sentence context, frequently omits characters, and fails on complex cursive ligatures.

Next Version Coming Soon: Koshur OCR v4 is actively training and will offer a massive drop in CER, vastly superior accuracy, and full Nastaliq/Naskh support. Please wait for the v4 release!

Model Overview

Koshur OCR v3 is a fine-tuned Vision Encoder-Decoder (TrOCR) model designed to transcribe printed and handwritten Kashmiri text (written in the Perso-Arabic script) from images into digital Unicode text.

Kashmiri presents unique challenges for OCR because it contains letters and vowel markings not found in standard Arabic or Urdu (such as ۆ, ۄ, and ؠ). Version 3 serves as the initial pilot implementation extending TrOCR to handle these character mappings.

Key Details

Architecture: TrOCR (Vision Transformer (ViT) encoder + RoBERTa-based text decoder).
Base Model: Initialized from RayR1/trocr-base-arabic-handwritten to leverage its strong baseline understanding of cursive Perso-Arabic script lines and characters.
Tokenizer: Custom-extended vocabulary containing all 47 letters of the Kashmiri alphabet, including the unique Kashmiri characters.
Language: Kashmiri (ks) written in the modified Perso-Arabic script.

Training Dataset

This model was trained on a multi-granularity mixture of datasets:

600k Word-Segmented Kashmiri Dataset (Omarrran/600k_KS_OCR_Word_Segmented_Dataset): A collection of synthetic, single-word cropped images in diverse fonts (Naskh, Nastaliq, Nakash) and background textures.
Koshur Pixel Dataset (Omarrran/Koshur_Pixel): Sentence-line and mixed paragraph text lines to teach the decoder spelling, grammar, and RTL reading order.

Training Details

Optimizer: AdamW
Learning Rate: $1 \times 10^{-5}$ (fine-tuning)
Learning Rate Scheduler: Cosine Annealing with Warmup
Batch Size: 8 with Gradient Accumulation of 4 (effective batch size: 32)
Mixed Precision: FP16 via PyTorch GradScaler
Validation Interval: End of each epoch

Limitations

Greedy Decoding Errors: The model is highly sensitive to the decoding method. In simple greedy decoding (num_beams=1), it makes spelling mistakes.
Vowel Alignment: Struggles with placing small diacritics (Zabar, Zer, Peshe) correctly above or below characters.
Deletions: Occasionally skips words in a sentence if they are tightly packed or handwritten.

How to Use

You can load and test this model using the Hugging Face transformers library:

import torch
from PIL import Image
from transformers import VisionEncoderDecoderModel, TrOCRProcessor

# 1. Load the model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "Faizaniqbal/Koshur-OCR-v3-checkpoints"

processor = TrOCRProcessor.from_pretrained(model_name)
model = VisionEncoderDecoderModel.from_pretrained(model_name).to(device)

# 2. Prepare an image
# Make sure your image is cropped to a line of Kashmiri text
image_path = "path/to/your/kashmiri_text_line.png"
image = Image.open(image_path).convert("RGB")

# 3. Preprocess and generate text
pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)

# We recommend using Beam Search (num_beams=4) for better results in V3
generated_ids = model.generate(
    pixel_values,
    max_length=128,
    num_beams=4,
    early_stopping=True
)

predicted_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Predicted Kashmiri Text:", predicted_text.strip())

Citation & License

This model is released for research purposes under the Apache 2.0 License. If you use these checkpoints or datasets in your work, please cite the original Koshur OCR project.

Downloads last month: -; Downloads are not tracked for this model. How to track