File size: 2,902 Bytes
05d39bf
 
 
54e537f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a28da73
 
 
 
505495c
a28da73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a15635e
a28da73
 
 
 
 
 
 
 
a15635e
a28da73
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
datasets:
- Matthijs/snacks
model-index:
- name: matteopilotto/vit-base-patch16-224-in21k-snacks
  results:
  - task:
      type: image-classification
      name: Image Classification
    dataset:
      name: Matthijs/snacks
      type: Matthijs/snacks
      config: default
      split: test
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.8928571428571429
      verified: true
    - name: Precision Macro
      type: precision
      value: 0.8990033704680036
      verified: true
    - name: Precision Micro
      type: precision
      value: 0.8928571428571429
      verified: true
    - name: Precision Weighted
      type: precision
      value: 0.8972398709051788
      verified: true
    - name: Recall Macro
      type: recall
      value: 0.8914608843537415
      verified: true
    - name: Recall Micro
      type: recall
      value: 0.8928571428571429
      verified: true
    - name: Recall Weighted
      type: recall
      value: 0.8928571428571429
      verified: true
    - name: F1 Macro
      type: f1
      value: 0.892544821273258
      verified: true
    - name: F1 Micro
      type: f1
      value: 0.8928571428571429
      verified: true
    - name: F1 Weighted
      type: f1
      value: 0.8924168605019522
      verified: true
    - name: loss
      type: loss
      value: 0.479541540145874
      verified: true
---

# Vision Transformer fine-tuned on `Matthijs/snacks` dataset

Vision Transformer (ViT) model pre-trained on ImageNet-21k and fine-tuned on [**Matthijs/snacks**](https://huggingface.co/datasets/Matthijs/snacks) for 5 epochs using various data augmentation transformations from `torchvision`.

The model achieves a **94.97%** and **94.43%** accuracy on the validation and test set, respectively.

## Data augmentation pipeline

The code block below shows the various transformations applied during pre-processing to augment the original dataset.
The augmented images where generated on-the-fly with the `set_transform` method.

```python
from transformers import ViTFeatureExtractor
from torchvision.transforms import (
    Compose,
    Normalize,
    Resize,
    RandomResizedCrop,
    RandomHorizontalFlip,
    RandomAdjustSharpness,
    ToTensor
)

checkpoint = 'google/vit-base-patch16-224-in21k'
feature_extractor = ViTFeatureExtractor.from_pretrained(checkpoint)

# transformations on the training set
train_aug_transforms = Compose([
    RandomResizedCrop(size=feature_extractor.size),
    RandomHorizontalFlip(p=0.5),
    RandomAdjustSharpness(sharpness_factor=5, p=0.5),
    ToTensor(),
    Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])

# transformations on the validation/test set
valid_aug_transforms = Compose([
    Resize(size=(feature_extractor.size, feature_extractor.size)),
    ToTensor(),
    Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])
```