Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
Goal: solve a task of archive page images sorting (for their further content-based processing)
Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class π·οΈ (category) results of top N predictions output, predictions summarizing into a tabular format, HF π hub support for the model
Versions π
There are currently 2 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest v2.0
is considered to be default.
Version | Pages | N-page files | PDFs | Description |
---|---|---|---|---|
v1.0 |
10073 | ~104 | 3896 | annotations with mistakes, more heterogenous data |
v1.0 |
11940 | ~509 | 5002 | more diverse pages in each category, less annotation mistakes |
Model description π
π² Fine-tuned model repository: vit-historical-page ^1 π
π³ Base model repository: google's vit-base-patch16-224 ^2 π
Data π
Training set of the model: 8950 images for v1.0
Training set of the model: 10745 images for v2.0
Categories π·οΈ
v1.0 version Categories πͺ§:
LabelοΈ | Ratio | Description |
---|---|---|
DRAW |
11.89% | π - drawings, maps, paintings with text |
DRAW_L |
8.17% | ππ - drawings, etc with a table legend or inside tabular layout / forms |
LINE_HW |
5.99% | βοΈπ - handwritten text lines inside tabular layout / forms |
LINE_P |
6.06% | π - printed text lines inside tabular layout / forms |
LINE_T |
13.39% | π - machine typed text lines inside tabular layout / forms |
PHOTO |
10.21% | π - photos with text |
PHOTO_L |
7.86% | ππ - photos inside tabular layout / forms or with a tabular annotation |
TEXT |
8.58% | π° - mixed types of printed and handwritten texts |
TEXT_HW |
7.36% | βοΈπ - only handwritten text |
TEXT_P |
6.95% | π - only printed text |
TEXT_T |
13.53% | π - only machine typed text |
v2.0 version Categories πͺ§:
LabelοΈ | Ratio | Description |
---|---|---|
DRAW |
9.12% | π - drawings, maps, paintings with text |
DRAW_L |
9.14% | ππ - drawings, etc with a table legend or inside tabular layout / forms |
LINE_HW |
8.84% | βοΈπ - handwritten text lines inside tabular layout / forms |
LINE_P |
9.15% | π - printed text lines inside tabular layout / forms |
LINE_T |
9.2% | π - machine typed text lines inside tabular layout / forms |
PHOTO |
9.05% | π - photos with text |
PHOTO_L |
9.1% | ππ - photos inside tabular layout / forms or with a tabular annotation |
TEXT |
9.14% | π° - mixed types of printed and handwritten texts |
TEXT_HW |
9.14% | βοΈπ - only handwritten text |
TEXT_P |
9.07% | π - only printed text |
TEXT_T |
9.05% | π - only machine typed text |
Evaluation set (same proportions): 995 images for v1.0
Evaluation set (same proportions): 1194 images for v2.0
Data preprocessing
During training the following transforms were applied randomly with a 50% chance:
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training Hyperparameters
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Results π
v1.0 Evaluation set's accuracy (Top-3): 99.6%
v2.0 Evaluation set's accuracy (Top-3): 99.75%
v1.0 Evaluation set's accuracy (Top-1): 97.3%
v2.0 Evaluation set's accuracy (Top-1): 96.82%
Result tables
v1.0 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v1.0 Manually β checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv π
v2.0 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v2.0 Manually β checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv π
Table columns
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category π·οΈ, guess TOP-N
- SCORE-N - score of the category π·οΈ, guess TOP-N
- TRUE - actual label of the category π·οΈ
Contacts π§
For support write to π§ lutsai.k@gmail.com π§
Official repository: UFAL ^3
Acknowledgements π
- Developed by UFAL ^5 π₯
- Funded by ATRIUM ^4 π°
- Shared by ATRIUM ^4 & UFAL ^5
- Model type: fine-tuned ViT ^2 with a 224x224 resolution size
Β©οΈ 2022 UFAL & ATRIUM
- Downloads last month
- 50
Model tree for k4tel/vit-historical-page
Base model
google/vit-base-patch16-224