WordDetector — Word-Level Bounding Box Detection for Handwritten Text

A word-detection model that locates individual handwritten words in document images. It produces axis-aligned bounding boxes — no transcription or labels. Part of the Xournal++ HTR project.

Model details

Property	Value
Architecture	Modified ResNet-18 encoder + U-Net-style decoder
Input	Grayscale image, resized to 448×448
Output	7 feature maps at 224×224 (segmentation + geometry)
Format	ONNX (softmax baked in, opset 17)
Parameters	~11.2M
Training data	IAM Handwriting Database
Best val F1	0.88 (lr=0.001, bs=16, 200 epochs)
License	MIT

Usage

from xournalpp_htr.inference_models import WordDetectorModel

model = WordDetectorModel.from_pretrained()
boxes = model.detect(grayscale_image)  # list[BoundingBox]

Each BoundingBox has x_min, y_min, x_max, y_max in the original image's pixel coordinates.

Requires pip install xournalpp-htr (pulls onnxruntime and huggingface-hub, no PyTorch needed).

How it works

The model outputs 7 maps per image:

Segmentation (3 channels): word / surrounding margin / background (softmax classification)
Geometry (4 channels): per-pixel distance to the top, bottom, left, and right edges of the enclosing word bounding box

Post-processing decodes these maps into bounding boxes via connected-component analysis and DBSCAN clustering.

Training

Trained on the IAM Handwriting Database with an 80/20 random split. The best model was selected via a hyperparameter grid search over learning rates (0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping (patience=50).

Hyperparameter	Value
Optimizer	Adam
Learning rate	0.001
Batch size	16
Max epochs	200
Loss	Cross-entropy (segmentation) + IoU (geometry)

Full training instructions: README.

Intended use

This model is the detection stage in a handwriting recognition pipeline. It is designed to run on personal devices (laptops, edge) via ONNX Runtime — no GPU required for inference. A separate transcription model (not yet available) would read the detected word regions.

Limitations

Detection only — no text transcription.
Grayscale input required.
Fixed 448×448 resize may distort aspect ratio on non-square images.
No training-time data augmentation (planned improvement).
Validated on IAM-style handwriting; performance on other styles (e.g. historical documents) may vary.

Citation

The architecture is based on WordDetectorNN by Harald Scheidl.

Downloads last month: 35

PellelNitram
/

xournalpp-htr-word-detector