metadata

datasets:
  - Matthijs/snacks
model-index:
  - name: matteopilotto/vit-base-patch16-224-in21k-snacks
    results:
      - task:
          type: image-classification
          name: Image Classification
        dataset:
          name: Matthijs/snacks
          type: Matthijs/snacks
          config: default
          split: test
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.8928571428571429
            verified: true
          - name: Precision Macro
            type: precision
            value: 0.8990033704680036
            verified: true
          - name: Precision Micro
            type: precision
            value: 0.8928571428571429
            verified: true
          - name: Precision Weighted
            type: precision
            value: 0.8972398709051788
            verified: true
          - name: Recall Macro
            type: recall
            value: 0.8914608843537415
            verified: true
          - name: Recall Micro
            type: recall
            value: 0.8928571428571429
            verified: true
          - name: Recall Weighted
            type: recall
            value: 0.8928571428571429
            verified: true
          - name: F1 Macro
            type: f1
            value: 0.892544821273258
            verified: true
          - name: F1 Micro
            type: f1
            value: 0.8928571428571429
            verified: true
          - name: F1 Weighted
            type: f1
            value: 0.8924168605019522
            verified: true
          - name: loss
            type: loss
            value: 0.479541540145874
            verified: true

Vision Transformer fine-tuned on `Matthijs/snacks` dataset

Vision Transformer (ViT) model pre-trained on ImageNet-21k and fine-tuned on Matthijs/snacks for 5 epochs using various data augmentation transformations from torchvision.

The model achieves a 94.97% and 94.43% accuracy on the validation and test set, respectively.

Data augmentation pipeline

The code block below shows the various transformations applied during pre-processing to augment the original dataset. The augmented images where generated on-the-fly with the set_transform method.

from transformers import ViTFeatureExtractor
from torchvision.transforms import (
    Compose,
    Normalize,
    Resize,
    RandomResizedCrop,
    RandomHorizontalFlip,
    RandomAdjustSharpness,
    ToTensor
)

checkpoint = 'google/vit-base-patch16-224-in21k'
feature_extractor = ViTFeatureExtractor.from_pretrained(checkpoint)

# transformations on the training set
train_aug_transforms = Compose([
    RandomResizedCrop(size=feature_extractor.size),
    RandomHorizontalFlip(p=0.5),
    RandomAdjustSharpness(sharpness_factor=5, p=0.5),
    ToTensor(),
    Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])

# transformations on the validation/test set
valid_aug_transforms = Compose([
    Resize(size=(feature_extractor.size, feature_extractor.size)),
    ToTensor(),
    Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])

Vision Transformer fine-tuned on Matthijs/snacks dataset

Data augmentation pipeline

Vision Transformer fine-tuned on `Matthijs/snacks` dataset