ufal
/

Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion

Model description πŸ“‡

πŸ”² Fine-tuned model repository: UFAL's vit-historical-page ^1 πŸ”—

πŸ”³ Base model repository: Google's vit-base-patch16-224 ^2 πŸ”—

The model was trained on the manually annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form. The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories 🏷️ described below were formed based on those archival documents.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline. In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“ format text, as well as to mark presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.

Data πŸ“œ

Training set of the model: 8950 images

Evaluation set (10% of the all, with the same proportions as below) model_EVAL.csv πŸ“Ž: 995 images

Manual ✍ annotation were performed beforehand and took some time βŒ›, the categories 🏷️ were formed from different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is NOT intentional, but rather a result of the source data nature.

In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the source data is irrelevant considering the model's vision resolution, however all of the data samples were from archaeological reports which may somehow affect the drawings detection due to common form objects being ceramic pieces, arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.

Categories 🏷️

Label️ Ratio Description
DRAW 11.89% πŸ“ˆ - drawings, maps, paintings with text
DRAW_L 8.17% πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms
LINE_HW 5.99% βœοΈπŸ“ - handwritten text lines inside tabular layout / forms
LINE_P 6.06% πŸ“ - printed text lines inside tabular layout / forms
LINE_T 13.39% πŸ“ - machine typed text lines inside tabular layout / forms
PHOTO 10.21% πŸŒ„ - photos with text
PHOTO_L 7.86% πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation
TEXT 8.58% πŸ“° - mixed types of printed and handwritten texts
TEXT_HW 7.36% βœοΈπŸ“„ - only handwritten text
TEXT_P 6.95% πŸ“„ - only printed text
TEXT_T 13.53% πŸ“„ - only machine typed text

The categories were chosen to sort the pages by the following criterion:

  • presence of graphical elements (drawings πŸ“ˆ OR photos πŸŒ„)
  • type of text πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
  • presence of tabular layout / forms πŸ“

The reasons for such distinction are different processing pipelines for different types of pages, that would be applied after the classification.

Training

During training image transformations were applied sequentially with a 50% chance.

Image preprocessing steps πŸ‘€
  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.

Training hyperparameters πŸ‘€
  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Results πŸ“Š

Evaluation set's accuracy (Top-3): 99.6%

TOP-3 confusion matrix - trained ViT

Evaluation set's accuracy (Top-1): 97.3%

TOP-1 confusion matrix - trained ViT

Result tables

Table columns

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category 🏷️, guess TOP-N
  • SCORE-N - score of the category 🏷️, guess TOP-N
  • TRUE - actual label of the category 🏷️

Contacts πŸ“§

For support write to πŸ“§ lutsai.k@gmail.com πŸ“§

Official repository: UFAL ^3

Acknowledgements πŸ™

  • Developed by UFAL ^5 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^5
  • Model type: fine-tuned ViT ^2 with a 224x224 resolution size

©️ 2022 UFAL & ATRIUM

Downloads last month
4
Safetensors
Model size
85.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for ufal/vit-historical-page

Finetuned
(606)
this model