WordDetector β€” Word-Level Bounding Box Detection for Handwritten Text

A word-detection model that locates individual handwritten words in document images. It produces axis-aligned bounding boxes β€” no transcription or labels. Part of the Xournal++ HTR project.

Model details

Property Value
Architecture Modified ResNet-18 encoder + U-Net-style decoder
Input Grayscale image, resized to 448Γ—448
Output 7 feature maps at 224Γ—224 (segmentation + geometry)
Format ONNX (softmax baked in, opset 17)
Parameters ~11.2M
Training data IAM Handwriting Database
Best val F1 0.88 (lr=0.001, bs=16, 200 epochs)
License MIT

Usage

from xournalpp_htr.inference_models import WordDetectorModel

model = WordDetectorModel.from_pretrained()
boxes = model.detect(grayscale_image)  # list[BoundingBox]

Each BoundingBox has x_min, y_min, x_max, y_max in the original image's pixel coordinates.

Requires pip install xournalpp-htr (pulls onnxruntime and huggingface-hub, no PyTorch needed).

How it works

The model outputs 7 maps per image:

  • Segmentation (3 channels): word / surrounding margin / background (softmax classification)
  • Geometry (4 channels): per-pixel distance to the top, bottom, left, and right edges of the enclosing word bounding box

Post-processing decodes these maps into bounding boxes via connected-component analysis and DBSCAN clustering.

Training

Trained on the IAM Handwriting Database with an 80/20 random split. The best model was selected via a hyperparameter grid search over learning rates (0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping (patience=50).

Hyperparameter Value
Optimizer Adam
Learning rate 0.001
Batch size 16
Max epochs 200
Loss Cross-entropy (segmentation) + IoU (geometry)

Full training instructions: README.

Intended use

This model is the detection stage in a handwriting recognition pipeline. It is designed to run on personal devices (laptops, edge) via ONNX Runtime β€” no GPU required for inference. A separate transcription model (not yet available) would read the detected word regions.

Limitations

  • Detection only β€” no text transcription.
  • Grayscale input required.
  • Fixed 448Γ—448 resize may distort aspect ratio on non-square images.
  • No training-time data augmentation (planned improvement).
  • Validated on IAM-style handwriting; performance on other styles (e.g. historical documents) may vary.

Citation

The architecture is based on WordDetectorNN by Harald Scheidl.

Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using PellelNitram/xournalpp-htr-word-detector 1