WordDetector β Word-Level Bounding Box Detection for Handwritten Text
A word-detection model that locates individual handwritten words in document images. It produces axis-aligned bounding boxes β no transcription or labels. Part of the Xournal++ HTR project.
Model details
| Property | Value |
|---|---|
| Architecture | Modified ResNet-18 encoder + U-Net-style decoder |
| Input | Grayscale image, resized to 448Γ448 |
| Output | 7 feature maps at 224Γ224 (segmentation + geometry) |
| Format | ONNX (softmax baked in, opset 17) |
| Parameters | ~11.2M |
| Training data | IAM Handwriting Database |
| Best val F1 | 0.88 (lr=0.001, bs=16, 200 epochs) |
| License | MIT |
Usage
from xournalpp_htr.inference_models import WordDetectorModel
model = WordDetectorModel.from_pretrained()
boxes = model.detect(grayscale_image) # list[BoundingBox]
Each BoundingBox has x_min, y_min, x_max, y_max in the original
image's pixel coordinates.
Requires pip install xournalpp-htr (pulls onnxruntime and
huggingface-hub, no PyTorch needed).
How it works
The model outputs 7 maps per image:
- Segmentation (3 channels): word / surrounding margin / background (softmax classification)
- Geometry (4 channels): per-pixel distance to the top, bottom, left, and right edges of the enclosing word bounding box
Post-processing decodes these maps into bounding boxes via connected-component analysis and DBSCAN clustering.
Training
Trained on the IAM Handwriting Database with an 80/20 random split. The best model was selected via a hyperparameter grid search over learning rates (0.0005, 0.001, 0.002) and batch sizes (16, 32, 64, 128) with early stopping (patience=50).
| Hyperparameter | Value |
|---|---|
| Optimizer | Adam |
| Learning rate | 0.001 |
| Batch size | 16 |
| Max epochs | 200 |
| Loss | Cross-entropy (segmentation) + IoU (geometry) |
Full training instructions: README.
Intended use
This model is the detection stage in a handwriting recognition pipeline. It is designed to run on personal devices (laptops, edge) via ONNX Runtime β no GPU required for inference. A separate transcription model (not yet available) would read the detected word regions.
Limitations
- Detection only β no text transcription.
- Grayscale input required.
- Fixed 448Γ448 resize may distort aspect ratio on non-square images.
- No training-time data augmentation (planned improvement).
- Validated on IAM-style handwriting; performance on other styles (e.g. historical documents) may vary.
Citation
The architecture is based on WordDetectorNN by Harald Scheidl.
- Downloads last month
- 35