Muharaf Arabic Handwriting OCR โ CNN-Transformer-CTC
Line-level handwritten Arabic OCR for the Muharaf corpus. The recogniser is a custom character-level CNN-Transformer-CTC model with attention-based height pooling, paired with a calibrated CTC decoder and a learned N-best reranker.
Headline result (test set, 1,334 lines): CER 0.1331 ยท WER 0.3921 with the full pipeline (backbone + calibrated decode + neural-LM rerank + risk-aware reranker). The published backbone alone with plain greedy CTC decoding scores CER ~0.170.
This is a ~80% relative CER reduction over the conventional CRNN baseline (CER 0.706) and a large improvement over every TrOCR transfer-learning attempt (best TrOCR CER ~0.78). The main finding of the project: for this Arabic handwriting task, direct character-level CTC alignment beats subword/token encoderโdecoder generation (TrOCR).
Quickstart (greedy inference)
pip install torch pillow numpy huggingface_hub
from huggingface_hub import snapshot_download
import sys
from PIL import Image
# Download the repo (weights + helper code)
local = snapshot_download("sdkv2/muharaf-arabic-ocr")
sys.path.insert(0, local)
from submission_code.torch_models import load_ctc_backbone, recognise_line
model, charset, cfg = load_ctc_backbone(f"{local}/models/ctc_backbone/model.pt")
image = Image.open(f"{local}/data/demo_samples/sample_02_test-00000-of-00001_1230.png")
print(recognise_line(model, charset, image))
# -> ูุจูุงุช ุญุงุฑุฉ ูุงู
ุชู ููุง ูุงูู ุชุญูุฉ ูุณูุงู
ูุจุนุฏู
A full walkthrough is in notebooks/05_final_model_inference.ipynb.
Repository structure
config.json # root model config (also the Hub download-count query file)
models/
ctc_backbone/ # the visual recogniser (this is the model you load for inference)
model.pt # torch checkpoint: {model_state, charset, model_config}
config.json # human/machine-readable architecture + metrics
training_metrics.json # full training history for this checkpoint
char_ngram_lm/
char_lm.json # character 5-gram LM used by the calibrated beam decoder
neural_char_lm/
char_transformer_lm.pt # neural character-level Transformer LM for N-best reranking
lm_metrics.json
rerank_summary.json # selected rerank weights + test metrics (V7 stage)
risk_reranker/
risk_reranker.pt # supervised risk-aware N-best reranker (final stage)
summary.json # selected config + final test/oracle metrics (V9 stage)
decode_config/
v5_antideletion_summary.json # calibrated-decode hyperparameters (blank penalty etc.)
submission_code/ # small, dependency-light helper package
torch_models.py # CNNTransformerCTC architecture + load/inference helpers
ocr_helpers.py # preprocessing, charset, greedy CTC decode, CER/WER
notebooks/ # Jupyter notebooks (training + inference)
data/
demo_samples/ # example line images + references for the inference demo
results/ # final metric table + report figures
manifests/ # CSV manifests (train/val/test splits) for reproduction
The final system
The headline number is produced by a four-stage pipeline. Only the backbone is needed for a usable demo; the remaining stages are decoding/reranking refinements that do not retrain the visual model.
1. Visual backbone โ models/ctc_backbone/model.pt (4.25M params)
- Grayscale line image, resized to height 96 (aspect preserved), pixels normalised to
[0, 1]. - CNN frontend (4 conv blocks, BatchNorm + GELU, two 2ร max-pools โ 4ร width stride).
- Attention-based height pooling collapses the vertical axis into a width-wise sequence.
- Linear projection โ sinusoidal positional encoding โ 4-layer Transformer encoder (d_model 256, 8 heads, FFN 1024, pre-norm).
- Linear CTC classifier over a 166-class vocabulary (165 characters + CTC blank at index 0).
- Trained with a small train-time blank-logit penalty (0.2) to counter CTC's deletion bias.
2. Calibrated CTC decoder โ models/char_ngram_lm/char_lm.json + decode_config/
Beam search (width 25, top-k 40) with a character 5-gram LM and anti-deletion calibration:
lm_weight 0.2, length_bonus 0.1, blank_penalty 0.6, score_mode sum. Backbone + this decoder โ CER 0.1416.
3. Neural-LM N-best reranking โ models/neural_char_lm/char_transformer_lm.pt
A 4-layer character Transformer LM (d_model 256, 8 heads, val perplexity โ 7.37) rescores the 50-best list:
acoustic 1.0, ngram_lm 0.15, neural_lm 0.2, length_bonus 0.14.
4. Risk-aware learned reranker โ models/risk_reranker/risk_reranker.pt (final stage)
A small supervised scorer (feature_dim 81, hidden 64) trained on validation N-best lists with soft
edit-risk supervision (risk_weight 0.2, target_temperature 2.0). Selects the final hypothesis โ CER 0.1331.
Note: stages 2โ4 are documented here with their exact selected configs and the weights are included, but the full beam-decode + N-best + rerank driver code lives in the original project scripts. For most uses the backbone + greedy (or a small beam) is the practical path; reproducing the exact 0.1331 requires the offline decode/rerank pipeline.
Notebooks
| Notebook | Contents |
|---|---|
01_preprocessing_and_dataset_audit.ipynb |
Image preprocessing + text normalization policy, manifest checks |
02_ahcd_cnn_baseline.ipynb |
AHCD isolated-character CNN baseline |
03_muharaf_crnn_ctc_baseline.ipynb |
Line-level CRNN/CTC baseline workflow |
04_results_summary.ipynb |
Final metric table + report figures |
05_final_model_inference.ipynb |
Load the published backbone and run OCR on demo line images |
Experiment ladder
| Stage | Method | Test CER | Test WER |
|---|---|---|---|
| CRNN baseline | CNN/RNN/CTC baseline | 0.7060 | 1.0215 |
| TrOCR transfer | Best retained TrOCR run | ~0.7752 | ~0.9913 |
| V1 CNN-Transformer-CTC | New backbone, greedy | 0.2732 | 0.7603 |
| V2 aug + long | Light aug, longer training, beam | 0.1753 | 0.5621 |
| Calibrated CTC decoder | Blank/length calibration (V5/V6) | 0.1484 | 0.4392 |
| Neural-LM rerank (V7) | Neural char-LM N-best reranking | 0.1379โ0.1455 | 0.408โ0.423 |
| Learned reranker (V8) | Supervised N-best selection | 0.1395โ0.1403 | 0.403โ0.405 |
| Final (V13 + risk reranker) | Attention pooling + calibrated decode + risk-aware rerank | 0.1331 | 0.3921 |
Training details (backbone)
- Data: Muharaf line images. 22,091 train / 1,069 val / 1,334 test lines.
- Input: grayscale, height 96, aspect-preserving resize, white-padded in batches.
- Text:
crnn_baselinenormalization (NFC, tatweel removed, spaces collapsed); 165-character vocabulary, zero OOV on val/test. - Optimisation: the final checkpoint warm-starts from the V6 anti-deletion fine-tune, 20 epochs, AdamW, OneCycleLR, lr 5e-5, weight decay 1e-4, grad clip 5.0, effective batch size 8 (batch 4 ร accum 2), fp16 mixed precision.
- Loss: CTC (
zero_infinity=True), blank id 0, train-time blank-logit penalty 0.2.
Full per-epoch history is in models/ctc_backbone/training_metrics.json.
Intended use & limitations
- Intended use: research and educational OCR of Muharaf-style handwritten Arabic lines. Best on single, reasonably-segmented text lines similar to the training distribution.
- Out of scope: printed Arabic, full-page layout/segmentation, other scripts, diacritized religious text, and modern documents far from the Muharaf domain. The model recognises a pre-segmented line; it does not detect lines.
- Known weakness โ under-generation: the dominant error is deletion. Predictions average ~46.7 chars vs ~49.0 reference chars; the model omits characters in dense/difficult handwriting. The anti-deletion decoder and reranker reduce but do not eliminate this.
- Headroom: the N-best oracle reaches CER 0.110 vs the selected 0.133, so better candidate selection (not just better vision) is the most promising next step.
Dataset & license
The model is trained on the Muharaf handwritten Arabic dataset, which is governed by its own license and terms โ please consult the dataset providers before commercial or redistributive use. The code and model weights in this repository are released under the MIT license (see YAML header); adjust if your dataset terms require otherwise.
Citation
@misc{muharaf_arabic_ocr_2026,
title = {Muharaf Arabic Handwriting OCR: a character-level CNN-Transformer-CTC pipeline},
author = {Shihadeh, Aiden},
year = {2026},
howpublished = {\url{https://huggingface.co/sdkv2/muharaf-arabic-ocr}}
}
- Downloads last month
- 23



