File size: 2,575 Bytes
6a3f653
 
 
 
 
 
 
85cc9f1
6a3f653
c1f238c
6a3f653
 
 
c1f238c
6a3f653
c1f238c
 
 
d26b16c
 
6a3f653
 
 
 
 
c1f238c
6a3f653
 
 
2a70f18
6a3f653
 
 
2a70f18
 
 
 
 
 
 
6a3f653
 
 
2a70f18
 
 
 
6a3f653
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
tags:
- image-classification
- generated_from_trainer
metrics:
- f1
base_model: google/vit-base-patch16-224-in21k
model-index:
- name: vit_receipts_classifier
  results: []
---

# vit_receipts_classifier

This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) on the cord, rvl-cdip, visual-genome and an external receipt dataset to carry out Binary Classification (`ticket` vs `no_ticket`). 

Ticket here is used as a synonym to "receipt".

It achieves the following results on the evaluation set, which contain pictures from the above datasets in scanned, photography or mobile picture formats (color and grayscale):
- Loss: 0.0116
- F1: 0.9991

## Model description

This model is a Binary Classifier finetuned version of ViT, to predict if an input image is a picture / scan of receipts(s) o something else.

## Intended uses & limitations

Use this model to classify your images into tickets or not tickers. WIth the tickets group, you can use Multimodal Information Extraction, as Visual Named Entity Recognition, to extract the ticket items, amounts, total, etc. Check the Cord dataset for more information.

## Training and evaluation data

This model used 2 datasets as positive class (`ticket`):
- `cord` 
- `https://expressexpense.com/blog/free-receipt-images-ocr-machine-learning-dataset/`

For the negative class (`no_ticket`), the following datasets were used:
- A subset of `RVL-CDIP`
- A subset of `visual-genome`

## Training procedure

Datasets were loaded with different distributions of data for positive and negative classes. Then, normalization and resizing is carried out to adapt it to ViT expected input. 

Different runs were carried out changing the data distribution and the hyperparameters to maximize F1.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0026        | 0.28  | 500  | 0.0187          | 0.9982 |
| 0.0186        | 0.56  | 1000 | 0.0116          | 0.9991 |
| 0.0006        | 0.84  | 1500 | 0.0044          | 0.9997 |


### Framework versions

- Transformers 4.21.2
- Pytorch 1.11.0+cu102
- Datasets 2.4.0
- Tokenizers 0.12.1