Model description

This model is implementation of the distillation recipe proposed in DeiT.
Visit Keras example on Distilling Vision Transformers.

Full credits to: Sayak Paul

In the original Vision Transformers (ViT) paper (Dosovitskiy et al.), the authors concluded that to perform on par with Convolutional Neural Networks (CNNs), ViTs need to be pre-trained on larger datasets. The larger the better. This is mainly due to the lack of inductive biases in the ViT architecture -- unlike CNNs, they don't have layers that exploit locality.

Many groups have proposed different ways to deal with the problem of data-intensiveness of ViT training. One such way was shown in the Data-efficient image Transformers, (DeiT) paper (Touvron et al.). The authors introduced a distillation technique that is specific to transformer-based vision models. DeiT is among the first works to show that it's possible to train ViTs well without using larger datasets.

Intended uses & limitations

The model is trained for demonstrative purposes and does not guarantee the best results in production.
For better results, follow & optimize the Keras example as per your need.

Training and evaluation data

The model is trained and evaluated on TF Flowers dataset

Training procedure

Training procedure is followed exactly as from the keras example.
The batch size is however decreased to 16 from the original 256 for accomodating the model in a single V100 GPU memory.

Training hyperparameters

The following hyperparameters were used during training:

name learning_rate decay beta_1 beta_2 epsilon amsgrad weight_decay exclude_from_weight_decay training_precision
AdamW 6.25000029685907e-05 0.0 0.8999999761581421 0.9990000128746033 1e-07 False 9.999999747378752e-05 None float32

Model Plot

View Model Plot

Model Image

Downloads last month
6
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Space using keras-io/deit 1