README.md · wesleyacheng/dog-breeds-multiclass-image-classification-with-vit at 78b9803601beea5c25dd7733bb42ef084ede36e3

metadata

license: mit
metrics:
  - accuracy
  - f1
pipeline_tag: image-classification

Recently, someone asked me if you can classify dog images into their respective dog breeds instead just differentiating from cats vs dogs like my last notebook. I say YES!

Due to the complexity of the problem, we will be using the most advanced computer vision architecture released in the 2020 Google paper, the Vision Transformer.

The difference between the Vision Transformer and the traditional Convolutional Neural Network (CNN) is how it treats an image. In Vision Transformers, we take the input as a patch of the original image, say 16 x 16, and feed in into the Transformer as a sequence with positional embeddings and self-attention, while in the Convolutional Neural Network (CNN), we use the same patch of original image as an input, but use convolutions and pooling layers as inductive biases. What this means is that Vision Transformer can use it's judgement to attend any particular patch of the image in a global fashion using it's self-attention mechanism without having us to guide the neural network like a CNN with local centering/cropping/bounding box our images to help its convolutions.

This allows the Vision Transformer architecture to be more flexible and scalable in nature, allowing us to create foundation models in computer vision, similar to the NLP foundational models like BERT and GPT, with pre-training self-supervised/supervised on massive amount of image data that would generalize to different computer vision tasks such as image classification, recognition, segmentation, etc. This cross-pollination helps us move closer towards the goal of Artificial General Intelligence.

One thing about Vision Transformers are it has weaker inductive biases compared to Convolutional Neural Networks that enables it's scalability and flexibility. This feature/bug depending on who you ask will require most well-performing pre-trained models to require more data despite having less parameters compared to it's CNN counterparts.

Luckily, in this model, we will used a Vision Transformer from Google hosted at HuggingFace pre-trained on the ImageNet-21k dataset (14 million images, 21k classes) with 16x16 patches, 224x224 resolution to bypass that data limitation. We will be fine-tuning this model to our "small" dog breeds dataset of around 20 thousand images from the Stanford Dogs dataset imported by Jessica Li into Kaggle to classify dog images into 120 types of dog breeds!