Update README.md
Browse files
README.md
CHANGED
@@ -5,9 +5,13 @@ tags:
|
|
5 |
- vision
|
6 |
---
|
7 |
|
|
|
|
|
8 |
## Model description
|
9 |
|
10 |
-
|
|
|
|
|
11 |
|
12 |
## Intended uses & limitations
|
13 |
|
@@ -15,7 +19,7 @@ More information needed
|
|
15 |
|
16 |
## Training and evaluation data
|
17 |
|
18 |
-
|
19 |
|
20 |
## Training procedure
|
21 |
|
|
|
5 |
- vision
|
6 |
---
|
7 |
|
8 |
+
# Compact Convolutional Transformers
|
9 |
+
|
10 |
## Model description
|
11 |
|
12 |
+
As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
|
13 |
+
|
14 |
+
In [Escaping the Big Data Paradigm with Compact Transformers](https://arxiv.org/abs/2104.05704), Hassani et al. present an approach for doing exactly this. They proposed the Compact Convolutional Transformer (CCT) architecture. This example is an implementation of CCT.
|
15 |
|
16 |
## Intended uses & limitations
|
17 |
|
|
|
19 |
|
20 |
## Training and evaluation data
|
21 |
|
22 |
+
The model is trained using the CIFAR-10 dataset.
|
23 |
|
24 |
## Training procedure
|
25 |
|