EdoAbati commited on
Commit
1e90e58
1 Parent(s): 9953f90

Added 'intended uses & limitations'

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -11,15 +11,16 @@ tags:
11
 
12
  As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
13
 
14
- In [Escaping the Big Data Paradigm with Compact Transformers](https://arxiv.org/abs/2104.05704), Hassani et al. present an approach for doing exactly this. They proposed the Compact Convolutional Transformer (CCT) architecture. This example is an implementation of CCT.
15
 
16
  ## Intended uses & limitations
17
 
18
- More information needed
 
19
 
20
  ## Training and evaluation data
21
 
22
- The model is trained using the CIFAR-10 dataset.
23
 
24
  ## Training procedure
25
 
 
11
 
12
  As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
13
 
14
+ In [Escaping the Big Data Paradigm with Compact Transformers](https://arxiv.org/abs/2104.05704), Hassani et al. present an approach for doing exactly this. They proposed the Compact Convolutional Transformer (CCT) architecture.
15
 
16
  ## Intended uses & limitations
17
 
18
+ - In the original paper, the authors use _AutoAugment_ to induce stronger regularization. In this example, the standard geometric augmentations (like random cropping and flipping) are used.
19
+ - The CCT model was trained for 30 epochs. Its plot in the 'Training Metrics' tab shows no signs of overfitting. This means that this network can be trained for longer (perhaps with a bit more regularization) and better performance may be obtained. This performance can further be improved by additional recipes like cosine decay learning rate schedule, other data augmentation techniques like AutoAugment, MixUp or Cutmix.
20
 
21
  ## Training and evaluation data
22
 
23
+ The model is trained using the CIFAR-10 dataset. 10% of the data is used for validation.
24
 
25
  ## Training procedure
26