joheras commited on
Commit
db5e1cd
1 Parent(s): 3a54f67

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-classification
4
+ - keras
5
+ license: apache-2.0
6
+ ---
7
+ # Train a Vision Transformer on small datasets
8
+
9
+ Author: [Jónathan Heras](https://twitter.com/_Jonathan_Heras)
10
+
11
+ [Keras Blog](https://keras.io/examples/vision/vit_small_ds/) | [Colab Notebook](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/vit_small_ds.ipynb)
12
+
13
+ In the academic paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929), the authors mention that Vision Transformers (ViT) are data-hungry. Therefore, pretraining a ViT on a large-sized dataset like JFT300M and fine-tuning it on medium-sized datasets (like ImageNet) is the only way to beat state-of-the-art Convolutional Neural Network models.
14
+
15
+ The self-attention layer of ViT lacks locality inductive bias (the notion that image pixels are locally correlated and that their correlation maps are translation-invariant). This is the reason why ViTs need more data. On the other hand, CNNs look at images through spatial sliding windows, which helps them get better results with smaller datasets.
16
+
17
+ In the academic paper [Vision Transformer for Small-Size Datasets](https://arxiv.org/abs/2112.13492v1), the authors set out to tackle the problem of locality inductive bias in ViTs.
18
+
19
+ The main ideas are:
20
+
21
+ - Shifted Patch Tokenization
22
+ - Locality Self Attention
23
+
24
+ # Use the pre-trained model
25
+
26
+ The model is pre-trained on the CIFAR100 dataset with the following hyperparameters:
27
+ ```python
28
+ # DATA
29
+ NUM_CLASSES = 100
30
+ INPUT_SHAPE = (32, 32, 3)
31
+ BUFFER_SIZE = 512
32
+ BATCH_SIZE = 256
33
+
34
+ # AUGMENTATION
35
+ IMAGE_SIZE = 72
36
+ PATCH_SIZE = 6
37
+ NUM_PATCHES = (IMAGE_SIZE // PATCH_SIZE) ** 2
38
+
39
+ # OPTIMIZER
40
+ LEARNING_RATE = 0.001
41
+ WEIGHT_DECAY = 0.0001
42
+
43
+ # TRAINING
44
+ EPOCHS = 50
45
+
46
+ # ARCHITECTURE
47
+ LAYER_NORM_EPS = 1e-6
48
+ TRANSFORMER_LAYERS = 8
49
+ PROJECTION_DIM = 64
50
+ NUM_HEADS = 4
51
+ TRANSFORMER_UNITS = [
52
+ PROJECTION_DIM * 2,
53
+ PROJECTION_DIM,
54
+ ]
55
+ MLP_HEAD_UNITS = [
56
+ 2048,
57
+ 1024
58
+ ]
59
+ ```
60
+ I have used the `AdamW` optimizer with cosine decay learning schedule. You can find the entire implementation in the keras blog post.
61
+
62
+ To use the pretrained model:
63
+ ```python
64
+ loaded_model = from_pretrained_keras("keras-io/vit_small_ds_v2")
65
+ ```