File size: 1,624 Bytes
5783307 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# vit_base_patch16_384
Implementation of Vision Transformer (ViT) proposed in [An Image Is
Worth 16x16 Words: Transformers For Image Recognition At
Scale](https://arxiv.org/pdf/2010.11929.pdf)
The following image from the authors shows the architecture.
![image](https://github.com/FrancescoSaverioZuppichini/glasses/blob/develop/docs/_static/images/ViT.png?raw=true)
``` python
ViT.vit_small_patch16_224()
ViT.vit_base_patch16_224()
ViT.vit_base_patch16_384()
ViT.vit_base_patch32_384()
ViT.vit_huge_patch16_224()
ViT.vit_huge_patch32_384()
ViT.vit_large_patch16_224()
ViT.vit_large_patch16_384()
ViT.vit_large_patch32_384()
```
Examples:
``` python
# change activation
ViT.vit_base_patch16_224(activation = nn.SELU)
# change number of classes (default is 1000 )
ViT.vit_base_patch16_224(n_classes=100)
# pass a different block, default is TransformerEncoderBlock
ViT.vit_base_patch16_224(block=MyCoolTransformerBlock)
# get features
model = ViT.vit_base_patch16_224
# first call .features, this will activate the forward hooks and tells the model you'll like to get the features
model.encoder.features
model(torch.randn((1,3,224,224)))
# get the features from the encoder
features = model.encoder.features
print([x.shape for x in features])
#[[torch.Size([1, 197, 768]), torch.Size([1, 197, 768]), ...]
# change the tokens, you have to subclass ViTTokens
class MyTokens(ViTTokens):
def __init__(self, emb_size: int):
super().__init__(emb_size)
self.my_new_token = nn.Parameter(torch.randn(1, 1, emb_size))
ViT(tokens=MyTokens)
```
|