Muharaf Arabic Handwriting OCR — CNN-Transformer-CTC

Line-level handwritten Arabic OCR for the Muharaf corpus. The recogniser is a custom character-level CNN-Transformer-CTC model with attention-based height pooling, paired with a calibrated CTC decoder and a learned N-best reranker.

Headline result (test set, 1,334 lines): CER 0.1331 · WER 0.3921 with the full pipeline (backbone + calibrated decode + neural-LM rerank + risk-aware reranker). The published backbone alone with plain greedy CTC decoding scores CER ~0.170.

This is a ~80% relative CER reduction over the conventional CRNN baseline (CER 0.706) and a large improvement over every TrOCR transfer-learning attempt (best TrOCR CER ~0.78). The main finding of the project: for this Arabic handwriting task, direct character-level CTC alignment beats subword/token encoder–decoder generation (TrOCR).

Quickstart (greedy inference)

pip install torch pillow numpy huggingface_hub

from huggingface_hub import snapshot_download
import sys
from PIL import Image

# Download the repo (weights + helper code)
local = snapshot_download("sdkv2/muharaf-arabic-ocr")
sys.path.insert(0, local)

from submission_code.torch_models import load_ctc_backbone, recognise_line

model, charset, cfg = load_ctc_backbone(f"{local}/models/ctc_backbone/model.pt")

image = Image.open(f"{local}/data/demo_samples/sample_02_test-00000-of-00001_1230.png")
print(recognise_line(model, charset, image))
# -> قبلات حارة لامته لها والف تحية وسلام وبعده

A full walkthrough is in notebooks/05_final_model_inference.ipynb.

Repository structure

config.json                # root model config (also the Hub download-count query file)
models/
  ctc_backbone/            # the visual recogniser (this is the model you load for inference)
    model.pt               #   torch checkpoint: {model_state, charset, model_config}
    config.json            #   human/machine-readable architecture + metrics
    training_metrics.json  #   full training history for this checkpoint
  char_ngram_lm/
    char_lm.json           # character 5-gram LM used by the calibrated beam decoder
  neural_char_lm/
    char_transformer_lm.pt # neural character-level Transformer LM for N-best reranking
    lm_metrics.json
    rerank_summary.json    # selected rerank weights + test metrics (V7 stage)
  risk_reranker/
    risk_reranker.pt       # supervised risk-aware N-best reranker (final stage)
    summary.json           # selected config + final test/oracle metrics (V9 stage)
  decode_config/
    v5_antideletion_summary.json  # calibrated-decode hyperparameters (blank penalty etc.)

submission_code/           # small, dependency-light helper package
  torch_models.py          #   CNNTransformerCTC architecture + load/inference helpers
  ocr_helpers.py           #   preprocessing, charset, greedy CTC decode, CER/WER

notebooks/                 # Jupyter notebooks (training + inference)
data/
  demo_samples/            # example line images + references for the inference demo
  results/                 # final metric table + report figures
  manifests/               # CSV manifests (train/val/test splits) for reproduction

The final system

The headline number is produced by a four-stage pipeline. Only the backbone is needed for a usable demo; the remaining stages are decoding/reranking refinements that do not retrain the visual model.

1. Visual backbone — models/ctc_backbone/model.pt (4.25M params)

Grayscale line image, resized to height 96 (aspect preserved), pixels normalised to [0, 1].
CNN frontend (4 conv blocks, BatchNorm + GELU, two 2× max-pools → 4× width stride).
Attention-based height pooling collapses the vertical axis into a width-wise sequence.
Linear projection → sinusoidal positional encoding → 4-layer Transformer encoder (d_model 256, 8 heads, FFN 1024, pre-norm).
Linear CTC classifier over a 166-class vocabulary (165 characters + CTC blank at index 0).
Trained with a small train-time blank-logit penalty (0.2) to counter CTC's deletion bias.

2. Calibrated CTC decoder — models/char_ngram_lm/char_lm.json + decode_config/

Beam search (width 25, top-k 40) with a character 5-gram LM and anti-deletion calibration: lm_weight 0.2, length_bonus 0.1, blank_penalty 0.6, score_mode sum. Backbone + this decoder ≈ CER 0.1416.

3. Neural-LM N-best reranking — models/neural_char_lm/char_transformer_lm.pt

A 4-layer character Transformer LM (d_model 256, 8 heads, val perplexity ≈ 7.37) rescores the 50-best list: acoustic 1.0, ngram_lm 0.15, neural_lm 0.2, length_bonus 0.14.

4. Risk-aware learned reranker — models/risk_reranker/risk_reranker.pt (final stage)

A small supervised scorer (feature_dim 81, hidden 64) trained on validation N-best lists with soft edit-risk supervision (risk_weight 0.2, target_temperature 2.0). Selects the final hypothesis → CER 0.1331.

Note: stages 2–4 are documented here with their exact selected configs and the weights are included, but the full beam-decode + N-best + rerank driver code lives in the original project scripts. For most uses the backbone + greedy (or a small beam) is the practical path; reproducing the exact 0.1331 requires the offline decode/rerank pipeline.

Notebooks

Notebook	Contents
`01_preprocessing_and_dataset_audit.ipynb`	Image preprocessing + text normalization policy, manifest checks
`02_ahcd_cnn_baseline.ipynb`	AHCD isolated-character CNN baseline
`03_muharaf_crnn_ctc_baseline.ipynb`	Line-level CRNN/CTC baseline workflow
`04_results_summary.ipynb`	Final metric table + report figures
`05_final_model_inference.ipynb`	Load the published backbone and run OCR on demo line images

Experiment ladder

Stage	Method	Test CER	Test WER
CRNN baseline	CNN/RNN/CTC baseline	0.7060	1.0215
TrOCR transfer	Best retained TrOCR run	~0.7752	~0.9913
V1 CNN-Transformer-CTC	New backbone, greedy	0.2732	0.7603
V2 aug + long	Light aug, longer training, beam	0.1753	0.5621
Calibrated CTC decoder	Blank/length calibration (V5/V6)	0.1484	0.4392
Neural-LM rerank (V7)	Neural char-LM N-best reranking	0.1379–0.1455	0.408–0.423
Learned reranker (V8)	Supervised N-best selection	0.1395–0.1403	0.403–0.405
Final (V13 + risk reranker)	Attention pooling + calibrated decode + risk-aware rerank	0.1331	0.3921

Training details (backbone)

Data: Muharaf line images. 22,091 train / 1,069 val / 1,334 test lines.
Input: grayscale, height 96, aspect-preserving resize, white-padded in batches.
Text: crnn_baseline normalization (NFC, tatweel removed, spaces collapsed); 165-character vocabulary, zero OOV on val/test.
Optimisation: the final checkpoint warm-starts from the V6 anti-deletion fine-tune, 20 epochs, AdamW, OneCycleLR, lr 5e-5, weight decay 1e-4, grad clip 5.0, effective batch size 8 (batch 4 × accum 2), fp16 mixed precision.
Loss: CTC (zero_infinity=True), blank id 0, train-time blank-logit penalty 0.2.

Full per-epoch history is in models/ctc_backbone/training_metrics.json.

Intended use & limitations

Intended use: research and educational OCR of Muharaf-style handwritten Arabic lines. Best on single, reasonably-segmented text lines similar to the training distribution.
Out of scope: printed Arabic, full-page layout/segmentation, other scripts, diacritized religious text, and modern documents far from the Muharaf domain. The model recognises a pre-segmented line; it does not detect lines.
Known weakness — under-generation: the dominant error is deletion. Predictions average ~46.7 chars vs ~49.0 reference chars; the model omits characters in dense/difficult handwriting. The anti-deletion decoder and reranker reduce but do not eliminate this.
Headroom: the N-best oracle reaches CER 0.110 vs the selected 0.133, so better candidate selection (not just better vision) is the most promising next step.

Dataset & license

The model is trained on the Muharaf handwritten Arabic dataset, which is governed by its own license and terms — please consult the dataset providers before commercial or redistributive use. The code and model weights in this repository are released under the MIT license (see YAML header); adjust if your dataset terms require otherwise.

Citation

@misc{muharaf_arabic_ocr_2026,
  title  = {Muharaf Arabic Handwriting OCR: a character-level CNN-Transformer-CTC pipeline},
  author = {Shihadeh, Aiden},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/sdkv2/muharaf-arabic-ocr}}
}

Downloads last month: 23