Muharaf Arabic Handwriting OCR โ€” CNN-Transformer-CTC

Line-level handwritten Arabic OCR for the Muharaf corpus. The recogniser is a custom character-level CNN-Transformer-CTC model with attention-based height pooling, paired with a calibrated CTC decoder and a learned N-best reranker.

Headline result (test set, 1,334 lines): CER 0.1331 ยท WER 0.3921 with the full pipeline (backbone + calibrated decode + neural-LM rerank + risk-aware reranker). The published backbone alone with plain greedy CTC decoding scores CER ~0.170.

This is a ~80% relative CER reduction over the conventional CRNN baseline (CER 0.706) and a large improvement over every TrOCR transfer-learning attempt (best TrOCR CER ~0.78). The main finding of the project: for this Arabic handwriting task, direct character-level CTC alignment beats subword/token encoderโ€“decoder generation (TrOCR).


Quickstart (greedy inference)

pip install torch pillow numpy huggingface_hub
from huggingface_hub import snapshot_download
import sys
from PIL import Image

# Download the repo (weights + helper code)
local = snapshot_download("sdkv2/muharaf-arabic-ocr")
sys.path.insert(0, local)

from submission_code.torch_models import load_ctc_backbone, recognise_line

model, charset, cfg = load_ctc_backbone(f"{local}/models/ctc_backbone/model.pt")

image = Image.open(f"{local}/data/demo_samples/sample_02_test-00000-of-00001_1230.png")
print(recognise_line(model, charset, image))
# -> ู‚ุจู„ุงุช ุญุงุฑุฉ ู„ุงู…ุชู‡ ู„ู‡ุง ูˆุงู„ู ุชุญูŠุฉ ูˆุณู„ุงู… ูˆุจุนุฏู‡

A full walkthrough is in notebooks/05_final_model_inference.ipynb.


Repository structure

config.json                # root model config (also the Hub download-count query file)
models/
  ctc_backbone/            # the visual recogniser (this is the model you load for inference)
    model.pt               #   torch checkpoint: {model_state, charset, model_config}
    config.json            #   human/machine-readable architecture + metrics
    training_metrics.json  #   full training history for this checkpoint
  char_ngram_lm/
    char_lm.json           # character 5-gram LM used by the calibrated beam decoder
  neural_char_lm/
    char_transformer_lm.pt # neural character-level Transformer LM for N-best reranking
    lm_metrics.json
    rerank_summary.json    # selected rerank weights + test metrics (V7 stage)
  risk_reranker/
    risk_reranker.pt       # supervised risk-aware N-best reranker (final stage)
    summary.json           # selected config + final test/oracle metrics (V9 stage)
  decode_config/
    v5_antideletion_summary.json  # calibrated-decode hyperparameters (blank penalty etc.)

submission_code/           # small, dependency-light helper package
  torch_models.py          #   CNNTransformerCTC architecture + load/inference helpers
  ocr_helpers.py           #   preprocessing, charset, greedy CTC decode, CER/WER

notebooks/                 # Jupyter notebooks (training + inference)
data/
  demo_samples/            # example line images + references for the inference demo
  results/                 # final metric table + report figures
  manifests/               # CSV manifests (train/val/test splits) for reproduction

The final system

The headline number is produced by a four-stage pipeline. Only the backbone is needed for a usable demo; the remaining stages are decoding/reranking refinements that do not retrain the visual model.

1. Visual backbone โ€” models/ctc_backbone/model.pt (4.25M params)

  • Grayscale line image, resized to height 96 (aspect preserved), pixels normalised to [0, 1].
  • CNN frontend (4 conv blocks, BatchNorm + GELU, two 2ร— max-pools โ†’ 4ร— width stride).
  • Attention-based height pooling collapses the vertical axis into a width-wise sequence.
  • Linear projection โ†’ sinusoidal positional encoding โ†’ 4-layer Transformer encoder (d_model 256, 8 heads, FFN 1024, pre-norm).
  • Linear CTC classifier over a 166-class vocabulary (165 characters + CTC blank at index 0).
  • Trained with a small train-time blank-logit penalty (0.2) to counter CTC's deletion bias.

2. Calibrated CTC decoder โ€” models/char_ngram_lm/char_lm.json + decode_config/

Beam search (width 25, top-k 40) with a character 5-gram LM and anti-deletion calibration: lm_weight 0.2, length_bonus 0.1, blank_penalty 0.6, score_mode sum. Backbone + this decoder โ‰ˆ CER 0.1416.

3. Neural-LM N-best reranking โ€” models/neural_char_lm/char_transformer_lm.pt

A 4-layer character Transformer LM (d_model 256, 8 heads, val perplexity โ‰ˆ 7.37) rescores the 50-best list: acoustic 1.0, ngram_lm 0.15, neural_lm 0.2, length_bonus 0.14.

4. Risk-aware learned reranker โ€” models/risk_reranker/risk_reranker.pt (final stage)

A small supervised scorer (feature_dim 81, hidden 64) trained on validation N-best lists with soft edit-risk supervision (risk_weight 0.2, target_temperature 2.0). Selects the final hypothesis โ†’ CER 0.1331.

Note: stages 2โ€“4 are documented here with their exact selected configs and the weights are included, but the full beam-decode + N-best + rerank driver code lives in the original project scripts. For most uses the backbone + greedy (or a small beam) is the practical path; reproducing the exact 0.1331 requires the offline decode/rerank pipeline.


Notebooks

Notebook Contents
01_preprocessing_and_dataset_audit.ipynb Image preprocessing + text normalization policy, manifest checks
02_ahcd_cnn_baseline.ipynb AHCD isolated-character CNN baseline
03_muharaf_crnn_ctc_baseline.ipynb Line-level CRNN/CTC baseline workflow
04_results_summary.ipynb Final metric table + report figures
05_final_model_inference.ipynb Load the published backbone and run OCR on demo line images

Experiment ladder

Stage Method Test CER Test WER
CRNN baseline CNN/RNN/CTC baseline 0.7060 1.0215
TrOCR transfer Best retained TrOCR run ~0.7752 ~0.9913
V1 CNN-Transformer-CTC New backbone, greedy 0.2732 0.7603
V2 aug + long Light aug, longer training, beam 0.1753 0.5621
Calibrated CTC decoder Blank/length calibration (V5/V6) 0.1484 0.4392
Neural-LM rerank (V7) Neural char-LM N-best reranking 0.1379โ€“0.1455 0.408โ€“0.423
Learned reranker (V8) Supervised N-best selection 0.1395โ€“0.1403 0.403โ€“0.405
Final (V13 + risk reranker) Attention pooling + calibrated decode + risk-aware rerank 0.1331 0.3921

CER progression WER progression Decoder gains Oracle headroom


Training details (backbone)

  • Data: Muharaf line images. 22,091 train / 1,069 val / 1,334 test lines.
  • Input: grayscale, height 96, aspect-preserving resize, white-padded in batches.
  • Text: crnn_baseline normalization (NFC, tatweel removed, spaces collapsed); 165-character vocabulary, zero OOV on val/test.
  • Optimisation: the final checkpoint warm-starts from the V6 anti-deletion fine-tune, 20 epochs, AdamW, OneCycleLR, lr 5e-5, weight decay 1e-4, grad clip 5.0, effective batch size 8 (batch 4 ร— accum 2), fp16 mixed precision.
  • Loss: CTC (zero_infinity=True), blank id 0, train-time blank-logit penalty 0.2.

Full per-epoch history is in models/ctc_backbone/training_metrics.json.


Intended use & limitations

  • Intended use: research and educational OCR of Muharaf-style handwritten Arabic lines. Best on single, reasonably-segmented text lines similar to the training distribution.
  • Out of scope: printed Arabic, full-page layout/segmentation, other scripts, diacritized religious text, and modern documents far from the Muharaf domain. The model recognises a pre-segmented line; it does not detect lines.
  • Known weakness โ€” under-generation: the dominant error is deletion. Predictions average ~46.7 chars vs ~49.0 reference chars; the model omits characters in dense/difficult handwriting. The anti-deletion decoder and reranker reduce but do not eliminate this.
  • Headroom: the N-best oracle reaches CER 0.110 vs the selected 0.133, so better candidate selection (not just better vision) is the most promising next step.

Dataset & license

The model is trained on the Muharaf handwritten Arabic dataset, which is governed by its own license and terms โ€” please consult the dataset providers before commercial or redistributive use. The code and model weights in this repository are released under the MIT license (see YAML header); adjust if your dataset terms require otherwise.

Citation

@misc{muharaf_arabic_ocr_2026,
  title  = {Muharaf Arabic Handwriting OCR: a character-level CNN-Transformer-CTC pipeline},
  author = {Shihadeh, Aiden},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/sdkv2/muharaf-arabic-ocr}}
}
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support