Update model description
Browse files
README.md
CHANGED
@@ -6,6 +6,14 @@ This model contains just the `IPUConfig` files for running the ViT base model (e
|
|
6 |
|
7 |
**This model contains no model weights, only an IPUConfig.**
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
## Usage
|
10 |
|
11 |
```
|
|
|
6 |
|
7 |
**This model contains no model weights, only an IPUConfig.**
|
8 |
|
9 |
+
## Model description
|
10 |
+
|
11 |
+
The Vision Transformer (ViT) is a model for image recognition that employs a Transformer-like architecture over patches of the image which was widely used for NLP pretraining.
|
12 |
+
|
13 |
+
It uses a standard Transformer encoder as used in NLP and simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large amounts of dataset and tranferred to multiple size image recognition benchmarks while requiring substantially fewer computational resources to train.
|
14 |
+
|
15 |
+
Paper link : [AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE'](https://arxiv.org/pdf/2010.11929.pdf)
|
16 |
+
|
17 |
## Usage
|
18 |
|
19 |
```
|