File size: 2,307 Bytes
5f29925
c6f4c30
5f29925
 
 
37ad0ba
 
 
 
 
 
 
 
5f29925
 
 
37ad0ba
 
5f29925
 
 
37ad0ba
5f29925
 
 
37ad0ba
 
 
5f29925
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
library_name: tf-keras
---

## Model description
**This model is implementation of the distillation recipe proposed in DeiT.**   
Visit Keras example on [Distilling Vision Transformers](https://keras.io/examples/vision/deit/).   
   
Full credits to: [Sayak Paul](https://twitter.com/RisingSayak)   
   
In the original Vision Transformers (ViT) paper (Dosovitskiy et al.), the authors concluded that to perform on par with Convolutional Neural Networks (CNNs), ViTs need to be pre-trained on larger datasets. The larger the better. This is mainly due to the lack of inductive biases in the ViT architecture -- unlike CNNs, they don't have layers that exploit locality.  
   
Many groups have proposed different ways to deal with the problem of data-intensiveness of ViT training. One such way was shown in the Data-efficient image Transformers, (DeiT) paper (Touvron et al.). The authors introduced a distillation technique that is specific to transformer-based vision models. DeiT is among the first works to show that it's possible to train ViTs well without using larger datasets.

## Intended uses & limitations

The model is trained for demonstrative purposes and does not guarantee the best results in production.   
For better results, follow & optimize the [Keras example](https://keras.io/examples/vision/deit/) as per your need.

## Training and evaluation data

The model is trained and evaluated on [TF Flowers dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers)

## Training procedure

Training procedure is followed exactly as from the [keras example](https://keras.io/examples/vision/deit/).   
The batch size is however decreased to 16 from the original 256 for accomodating the model in a single V100 GPU memory.

### Training hyperparameters

The following hyperparameters were used during training:

| name | learning_rate | decay | beta_1 | beta_2 | epsilon | amsgrad | weight_decay | exclude_from_weight_decay | training_precision |
|----|-------------|-----|------|------|-------|-------|------------|-------------------------|------------------|
|AdamW|6.25000029685907e-05|0.0|0.8999999761581421|0.9990000128746033|1e-07|False|9.999999747378752e-05|None|float32|

 ## Model Plot

<details>
<summary>View Model Plot</summary>

![Model Image](./model.png)

</details>