metadata

license: apache-2.0
tags:
  - image-classification
  - generated_from_trainer
metrics:
  - f1
model-index:
  - name: vit_tickers_binaryclf
    results: []

vit_tickers_binaryclf

This model is a fine-tuned version of google/vit-base-patch16-224-in21k on the cord dataset. It achieves the following results on the evaluation set:

Loss: 0.0116
F1: 0.9991

Model description

This model is a Binary Classifier finetuned version of ViT, to predict if an input image is a picture / scan of ticket(s) o something else.

Intended uses & limitations

Use this model to classify your images into tickets or not tickers. WIth the tickets group, you can use Multimodal Information Extraction, as Visual Named Entity Recognition, to extract the ticket items, amounts, total, etc. Check the Cord dataset for more information.

Training and evaluation data

This model used 2 datasets as positive class (ticket):

cord
https://expressexpense.com/blog/free-receipt-images-ocr-machine-learning-dataset/

For the negative class (no_ticket), the following datasets were used:

A subset of RVL-CDIP
A subset of visual-genome

Training procedure

Datasets were loaded with different distributions of data for positive and negative classes. Then, normalization and resizing is carried out to adapt it to ViT expected input.

Different runs were carried out changing the data distribution and the hyperparameters to maximize F1.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	F1
0.0026	0.28	500	0.0187	0.9982
0.0186	0.56	1000	0.0116	0.9991
0.0006	0.84	1500	0.0044	0.9997

Framework versions

Transformers 4.21.2
Pytorch 1.11.0+cu102
Datasets 2.4.0
Tokenizers 0.12.1