Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model

Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest v2.0 is considered to be default.

Version	Pages	N-page files	PDFs	Description
`v1.0`	10073	~104	3896	annotations with mistakes, more heterogenous data
`v1.0`	11940	~509	5002	more diverse pages in each category, less annotation mistakes

Model description 📇

🔲 Fine-tuned model repository: vit-historical-page ^1 🔗

🔳 Base model repository: google's vit-base-patch16-224 ^2 🔗

Data 📜

Training set of the model: 8950 images for v1.0

Training set of the model: 10745 images for v2.0

Categories 🏷️

v1.0 version Categories 🪧:

Label️	Ratio	Description
`DRAW`	11.89%	📈 - drawings, maps, paintings with text
`DRAW_L`	8.17%	📈📏 - drawings, etc with a table legend or inside tabular layout / forms
`LINE_HW`	5.99%	✏️📏 - handwritten text lines inside tabular layout / forms
`LINE_P`	6.06%	📏 - printed text lines inside tabular layout / forms
`LINE_T`	13.39%	📏 - machine typed text lines inside tabular layout / forms
`PHOTO`	10.21%	🌄 - photos with text
`PHOTO_L`	7.86%	🌄📏 - photos inside tabular layout / forms or with a tabular annotation
`TEXT`	8.58%	📰 - mixed types of printed and handwritten texts
`TEXT_HW`	7.36%	✏️📄 - only handwritten text
`TEXT_P`	6.95%	📄 - only printed text
`TEXT_T`	13.53%	📄 - only machine typed text

v2.0 version Categories 🪧:

Label️	Ratio	Description
`DRAW`	9.12%	📈 - drawings, maps, paintings with text
`DRAW_L`	9.14%	📈📏 - drawings, etc with a table legend or inside tabular layout / forms
`LINE_HW`	8.84%	✏️📏 - handwritten text lines inside tabular layout / forms
`LINE_P`	9.15%	📏 - printed text lines inside tabular layout / forms
`LINE_T`	9.2%	📏 - machine typed text lines inside tabular layout / forms
`PHOTO`	9.05%	🌄 - photos with text
`PHOTO_L`	9.1%	🌄📏 - photos inside tabular layout / forms or with a tabular annotation
`TEXT`	9.14%	📰 - mixed types of printed and handwritten texts
`TEXT_HW`	9.14%	✏️📄 - only handwritten text
`TEXT_P`	9.07%	📄 - only printed text
`TEXT_T`	9.05%	📄 - only machine typed text

Evaluation set (same proportions): 995 images for v1.0

Evaluation set (same proportions): 1194 images for v2.0

Data preprocessing

During training the following transforms were applied randomly with a 50% chance:

transforms.ColorJitter(brightness 0.5)
transforms.ColorJitter(contrast 0.5)
transforms.ColorJitter(saturation 0.5)
transforms.ColorJitter(hue 0.5)
transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training Hyperparameters

eval_strategy "epoch"
save_strategy "epoch"
learning_rate 5e-5
per_device_train_batch_size 8
per_device_eval_batch_size 8
num_train_epochs 3
warmup_ratio 0.1
logging_steps 10
load_best_model_at_end True
metric_for_best_model "accuracy"

Results 📊

v1.0 Evaluation set's accuracy (Top-3): 99.6%

v2.0 Evaluation set's accuracy (Top-3): 99.75%

v1.0 Evaluation set's accuracy (Top-1): 97.3%

v2.0 Evaluation set's accuracy (Top-1): 96.82%

Result tables

v1.0 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v1.0 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv 🔗
v2.0 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v2.0 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv 🔗

Table columns

FILE - name of the file
PAGE - number of the page
CLASS-N - label of the category 🏷️, guess TOP-N
SCORE-N - score of the category 🏷️, guess TOP-N
TRUE - actual label of the category 🏷️

Contacts 📧

For support write to 📧 lutsai.k@gmail.com 📧

Official repository: UFAL ^3

Acknowledgements 🙏

Developed by UFAL ^5 👥
Funded by ATRIUM ^4 💰
Shared by ATRIUM ^4 & UFAL ^5
Model type: fine-tuned ViT ^2 with a 224x224 resolution size

ufal
/

vit-historical-page