keras-io/deit · Hugging Face

Model description

This model is implementation of the distillation recipe proposed in DeiT.
Visit Keras example on Distilling Vision Transformers.

Full credits to: Sayak Paul

In the original Vision Transformers (ViT) paper (Dosovitskiy et al.), the authors concluded that to perform on par with Convolutional Neural Networks (CNNs), ViTs need to be pre-trained on larger datasets. The larger the better. This is mainly due to the lack of inductive biases in the ViT architecture -- unlike CNNs, they don't have layers that exploit locality.

Many groups have proposed different ways to deal with the problem of data-intensiveness of ViT training. One such way was shown in the Data-efficient image Transformers, (DeiT) paper (Touvron et al.). The authors introduced a distillation technique that is specific to transformer-based vision models. DeiT is among the first works to show that it's possible to train ViTs well without using larger datasets.

Intended uses & limitations

The model is trained for demonstrative purposes and does not guarantee the best results in production.
For better results, follow & optimize the Keras example as per your need.

Training and evaluation data

The model is trained and evaluated on TF Flowers dataset

Training procedure

Training procedure is followed exactly as from the keras example.
The batch size is however decreased to 16 from the original 256 for accomodating the model in a single V100 GPU memory.

Training hyperparameters

The following hyperparameters were used during training:

name	learning_rate	decay	beta_1	beta_2	epsilon	amsgrad	weight_decay	exclude_from_weight_decay	training_precision
AdamW	6.25000029685907e-05	0.0	0.8999999761581421	0.9990000128746033	1e-07	False	9.999999747378752e-05	None	float32

Model Plot

View Model Plot