google
/

vit-hybrid-base-bit-384

Image Classification

Inference Endpoints

Model card Files Files and versions Community

ybelkada commited on Dec 7, 2022

Commit

6b9e8f1

·

1 Parent(s): e449556

Update README.md

Files changed (1) hide show

README.md +0 -4

README.md CHANGED Viewed

@@ -8,10 +8,6 @@ datasets:
 ---
 # Vision Transformer (base-sized model) - Hybrid
-| ![Pull figure](https://s3.amazonaws.com/moonup/production/uploads/1670350379252-62441d1d9fdefb55a0b7d12c.png) |
-|:--:|
-| <b> Figure 1 from the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) describing the model's architecture </b>|
 The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures. ViT hybrid is a slight variant of the [plain Vision Transformer](vit), by leveraging a convolutional backbone (specifically, [BiT](bit)) whose features are used as initial "tokens" for the Transformer.

 ---
 # Vision Transformer (base-sized model) - Hybrid
 The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures. ViT hybrid is a slight variant of the [plain Vision Transformer](vit), by leveraging a convolutional backbone (specifically, [BiT](bit)) whose features are used as initial "tokens" for the Transformer.