---
library_name: keras
tags:
- tokenization
---

## Tensorflow Keras implementation of Learning to tokenize in Vision Transformers

Full credits to [Sayak Paul](https://twitter.com/RisingSayak) and [Aritra Roy Gosthipaty](https://twitter.com/ariG23498) for this work. 

## Intended uses & limitations


Vision Transformers ([Dosovitskiy et al.](https://arxiv.org/abs/2010.11929)) and many
other Transformer-based architectures ([Liu et al.](https://arxiv.org/abs/2103.14030),
[Yuan et al.](https://arxiv.org/abs/2101.11986), etc.) have shown strong results in
image recognition. The following provides a brief overview of the components involved in the
Vision Transformer architecture for image classification:
* Extract small patches from input images.
* Linearly project those patches.
* Add positional embeddings to these linear projections.
* Run these projections through a series of Transformer ([Vaswani et al.](https://arxiv.org/abs/1706.03762))
blocks.
* Finally, take the representation from the final Transformer block and add a
classification head.
If we take 224x224 images and extract 16x16 patches, we get a total of 196 patches (also
called tokens) for each image. The number of patches increases as we increase the
resolution, leading to higher memory footprint. Could we use a reduced
number of patches without having to compromise performance?
Ryoo et al. investigate this question in
[TokenLearner: Adaptive Space-Time Tokenization for Videos](https://openreview.net/forum?id=z-l1kpDXs88).
They introduce a novel module called **TokenLearner** that can help reduce the number
of patches used by a Vision Transformer (ViT) in an adaptive manner. With TokenLearner
incorporated in the standard ViT architecture, they are able to reduce the amount of
compute (measured in FLOPS) used by the model.
In this example, we implement the TokenLearner module and demonstrate its
performance with a mini ViT and the CIFAR-10 dataset. We make use of the following
references:
* [Official TokenLearner code](https://github.com/google-research/scenic/blob/main/scenic/projects/token_learner/model.py)
* [Image Classification with ViTs on keras.io](https://keras.io/examples/vision/image_classification_with_vision_transformer/)
* [TokenLearner slides from NeurIPS 2021](https://nips.cc/media/neurips-2021/Slides/26578.pdf)

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:

| name | learning_rate | decay | beta_1 | beta_2 | epsilon | amsgrad | weight_decay | exclude_from_weight_decay | training_precision |
|----|-------------|-----|------|------|-------|-------|------------|-------------------------|------------------|
|AdamW|0.0010000000474974513|0.0|0.8999999761581421|0.9990000128746033|1e-07|False|9.999999747378752e-05|None|float32|