Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model

Model description πŸ“‡

πŸ”² Fine-tuned model repository: vit-historical-page ^1 πŸ”—

πŸ”³ Base model repository: google's vit-base-patch16-224 ^2 πŸ”—

Data πŸ“œ

Training set of the model: 8950 images

Categories 🏷️

Label️ Ratio Description
DRAW 11.89% πŸ“ˆ - drawings, maps, paintings with text
DRAW_L 8.17% πŸ“ˆπŸ“ - drawings ... with a table legend or inside tabular layout / forms
LINE_HW 5.99% βœοΈπŸ“ - handwritten text lines inside tabular layout / forms
LINE_P 6.06% πŸ“ - printed text lines inside tabular layout / forms
LINE_T 13.39% πŸ“ - machine typed text lines inside tabular layout / forms
PHOTO 10.21% πŸŒ„ - photos with text
PHOTO_L 7.86% πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation
TEXT 8.58% πŸ“° - mixed types of printed and handwritten texts
TEXT_HW 7.36% βœοΈπŸ“„ - only handwritten text
TEXT_P 6.95% πŸ“„ - only printed text
TEXT_T 13.53% πŸ“„ - only machine typed text

Evaluation set (same proportions): 995 images

Data preprocessing

During training the following transforms were applied randomly with a 50% chance:

  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training Hyperparameters

  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Results πŸ“Š

Evaluation set's accuracy (Top-3): 99.6%

TOP-3 confusion matrix - trained ViT

Evaluation set's accuracy (Top-1): 97.3%

TOP-1 confusion matrix - trained ViT

Result tables

Table columns

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category 🏷️, guess TOP-N
  • SCORE-N - score of the category 🏷️, guess TOP-N
  • TRUE - actual label of the category 🏷️

Contacts πŸ“§

For support write to πŸ“§ lutsai.k@gmail.com πŸ“§

Official repository: UFAL ^3

Acknowledgements πŸ™

  • Developed by UFAL ^5 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^5
  • Model type: fine-tuned ViT ^2 with a 224x224 resolution size

©️ 2022 UFAL & ATRIUM

Downloads last month
43
Safetensors
Model size
85.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for k4tel/vit-historical-page

Finetuned
(608)
this model