This model is a TensorFlow port of DINO  ViT B-16 . The backbone of this model was pre-trained using the DINO pretext task. After that its head layer was trained by keeping the backbone frozen. ImageNet-1k dataset was used for training purposes. You can refer to this notebook to know how the porting was done.
 Emerging Properties in Self-Supervised Vision Transformers: https://arxiv.org/abs/2104.14294
 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929
- Downloads last month
Unable to determine this model’s pipeline type. Check the docs .