HTRflow - A tool for HTR and OCR

Community Article Published October 1, 2024

image/png

The National Archives of Sweden (Riksarkivet) preserves an extensive collection of archival records, encompassing a diverse range of data from various sources, with a significant portion in handwritten form. However, these digital materials are currently just images and unlikely to foster new research and knowledge unless transcribed into searchable text, thus limiting the potential for data-driven studies and large-scale analysis.

Furthermore, the challenge today is to facilitate scalable and accessible Handwritten Text Recognition (HTR) and Optical Character Recognition (OCR) across different types of materials. At Riksarkivet, we often find ourselves reinventing the wheel to meet these goals. To address this, the AI Lab at Riksarkivet (AIRA) has released HTRflow, an open-source package designed to simplify and streamline the use of HTR and OCR for everyone.

What is HTR?

HTR is a technology that enables the automatic conversion of handwritten text in images or scanned documents into machine-readable text. Unlike traditional OCR, which is optimized for printed or typewritten text, HTR focuses on interpreting the nuances and variations of human handwriting. This technology is essential for digitizing historical documents, making them searchable and accessible for modern research and analysis.

Below you can see an example of material that has been transcribed using HTR:

The tool used above is a spaces demo using
The tool used above is a spaces demo, HTRflow app

You can perform HTR using Python and the Transformer-based Optical Character Recognition (TrOCR) model from Hugging Face 🤗. Specifically, the model Riksarkivet/trocr-base-handwritten-hist-swe-2 is trained for recognizing handwritten Swedish historical texts. See the example below on how to use this model with Transformers:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# Load the processor and model
processor = TrOCRProcessor.from_pretrained('Riksarkivet/trocr-base-handwritten-hist-swe-2')
model = VisionEncoderDecoderModel.from_pretrained('Riksarkivet/trocr-base-handwritten-hist-swe-2')

# Load an image containing handwritten text, e.g local image or an image URL
image_url = 'https://example.com/your_handwritten_image.jpg'  # Replace with your image URL
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

# Preprocess the image
pixel_values = processor(images=image, return_tensors="pt").pixel_values

# Generate transcription (you can adjust parameters like max_length and num_beams)
generated_ids = model.generate(pixel_values, max_length=512)

# Decode the generated IDs to text
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Transcription:", transcription)

What is HTRflow

HTRflow is an open-source tool that simplifies the process of performing HTR and OCR tasks using pre-trained models in a "pipeline/blueprint" style. Key features include:

  • Flexibility: Customize the HTR/OCR process for different kinds of materials.
  • Compatibility: HTRflow supports all models trained by the AI lab - and more!
  • YAML pipelines: HTRflow YAML pipelines are easy to create, modify and share.
  • Export: Export results as Alto XML, Page XML, plain text or JSON.
  • Evaluation: Compare results from different pipelines with ground truth.
Illustration of HTRflow pipeline
Illustration of HTRflow pipeline

At its core, HTRflow uses a pipeline pattern that operates on Collection instances, which act as the data structure. HTRflow's data handling is based on a hierarchical tree composed of interconnected nodes. This structure mirrors the physical layout of a document with pages, paragraphs, lines, marginalia and words, see below:

Collection         # <-- Document
├── Page           # <-- A page from the document
│   ├── Node       # <-- A paragraph from the page
│   │   ├── Node   # <-- A text line with transcription
│   │   └── Node  
│   └── Node      
│       ├── Node   
│       └── Node
│
└── Page (Parent)
    ├── Node (Child)
    └── Node (Child)

Each node in this hierarchical tree knows its parent and its children, effectively modeling the document hierarchy (e.g. Document > Page > Paragraph > Line > Word). This parent-child relationship allows for efficient organization and processing of the document's elements.

Processing results, such as recognized text, are stored at the appropriate node level. For example, text recognized from a text line in the image is stored in that line's SegmentNode, as illustrated below:

Collection                        
 └── Page                    
      └── SegmentNode ── TextRec             

To interact with the Collection, you can traverse through the nodes down to the leaves of the tree. Methods like traverse() and leaves() enable easy navigation through the tree, allowing you to apply functions to all nodes, specific levels, or leaf nodes only. Each pipeline step takes a Collection and returns an updated Collection. Essentially, you declaratively define at each step what type of model or function should run before updating the Collection instances.

Here's an illustration of how the tree structure is updated during the use of a HTR (TextRecognition) and a reading order (OrderLines) step:

# Before Processing:           # After Processing:

Collection                   Collection
 ├── Page                     ├── Page ── ReadOrder1
 |    └── SegmentNode         |    └── SegmentNode1 ── TextRec 
 └── Page                ->   └── Page ── ReadOrder2
      ├── SegmentNode              ├── SegmentNode21 ── TextRec 
      └── SegmentNode              └── SegmentNode22 ── TextRec 

By organizing the document into a hierarchical tree structure, HTRflow enables efficient processing and manipulation of data, ensuring that each element is handled appropriately. This structure not only preserves the document's natural layout but also allows for the results to be easily exported in various formats such as ALTO XML, PAGE XML, plain text, or JSON, facilitating further analysis and integration into other workflows.

Note that HTRflow is not currently an integrated library with Hugging Face 🤗; however, you can download models as you would with Transformers.

Using HTRflow

To use HTRflow simply install with pip:

pip install htrflow

Once HTRflow is installed, run it with:

htrflow pipeline <path/to/pipeline.yaml> <path/to/image>

The pipeline sub-command tells HTRflow to apply the pipeline defined in pipeline.yaml on image.jpg. To get started, try the example pipeline in the next section.

An example pipeline

Here is an example of an HTRflow pipeline:

steps:
- step: Segmentation
  settings:
    model: yolo
    model_settings:
      model: Riksarkivet/yolov9-lines-within-regions-1
- step: TextRecognition
  settings:
    model: TrOCR
    model_settings:
      model: Riksarkivet/trocr-base-handwritten-hist-swe-2
- step: OrderLines
- step: Export
  settings:
    format: txt
    dest: outputs

This pipeline consists of four steps:

  1. Segmentation: Segments the image into lines using a YOLO model.
  2. TextRecognition: Transcribes the segmented lines using TrOCR.
  3. OrderLines: Orders the lines according to the reading order.
  4. Export: Exports the result as a text file to a directory called outputs.

To run the demo pipeline on your selected image, paste the pipeline content into an empty text file and save it as pipeline.yaml. Assuming the input image is called image.jpg, run HTRflow with:

htrflow pipeline pipeline.yaml image.jpg

Here are the segmentation results:

A page from Bergskollegium dated 1698
A page from Bergskollegium dated 1698. Source

And here is the transcribed text:

Mästaren med det lofliga hammarsmeds hämbetet bör¬
römliga Sven Svensson Hjerpe, som med sin hustru rör¬
bära och dygdesanna Lisa Jansdotter bortflyttrande
till Lungsundt Sochn; bekoma härifrån följande
bewis, at mannen är född 1746, hustrun 1754.
begge i sina Christandoms stycken grundrade och i
sin lofnad förelgifwande. warit till hittwarden
sin 25/4 och wid Förhören på behörig tid.
Carlskoja fyld 21 Sept: 1773. Bengt Forsman
adj: Past:
Sundberg

And that's how you make an HTRflow pipeline! 🎉


For more details, see the HTRflow documentation.