--- library_name: keras tags: - tokenization --- ## Tensorflow Keras implementation of Learning to tokenize in Vision Transformers Full credits to [Sayak Paul](https://twitter.com/RisingSayak) and [Aritra Roy Gosthipaty](https://twitter.com/ariG23498) for this work. ## Intended uses & limitations Vision Transformers ([Dosovitskiy et al.](https://arxiv.org/abs/2010.11929)) and many other Transformer-based architectures ([Liu et al.](https://arxiv.org/abs/2103.14030), [Yuan et al.](https://arxiv.org/abs/2101.11986), etc.) have shown strong results in image recognition. The following provides a brief overview of the components involved in the Vision Transformer architecture for image classification: * Extract small patches from input images. * Linearly project those patches. * Add positional embeddings to these linear projections. * Run these projections through a series of Transformer ([Vaswani et al.](https://arxiv.org/abs/1706.03762)) blocks. * Finally, take the representation from the final Transformer block and add a classification head. If we take 224x224 images and extract 16x16 patches, we get a total of 196 patches (also called tokens) for each image. The number of patches increases as we increase the resolution, leading to higher memory footprint. Could we use a reduced number of patches without having to compromise performance? Ryoo et al. investigate this question in [TokenLearner: Adaptive Space-Time Tokenization for Videos](https://openreview.net/forum?id=z-l1kpDXs88). They introduce a novel module called **TokenLearner** that can help reduce the number of patches used by a Vision Transformer (ViT) in an adaptive manner. With TokenLearner incorporated in the standard ViT architecture, they are able to reduce the amount of compute (measured in FLOPS) used by the model. In this example, we implement the TokenLearner module and demonstrate its performance with a mini ViT and the CIFAR-10 dataset. We make use of the following references: * [Official TokenLearner code](https://github.com/google-research/scenic/blob/main/scenic/projects/token_learner/model.py) * [Image Classification with ViTs on keras.io](https://keras.io/examples/vision/image_classification_with_vision_transformer/) * [TokenLearner slides from NeurIPS 2021](https://nips.cc/media/neurips-2021/Slides/26578.pdf) ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: | name | learning_rate | decay | beta_1 | beta_2 | epsilon | amsgrad | weight_decay | exclude_from_weight_decay | training_precision | |----|-------------|-----|------|------|-------|-------|------------|-------------------------|------------------| |AdamW|0.0010000000474974513|0.0|0.8999999761581421|0.9990000128746033|1e-07|False|9.999999747378752e-05|None|float32|