# Vision Transformer (ViT)¶

Note

This is a recently introduced model so the API hasn’t been tested extensively. There may be some bugs or slight breaking changes to fix it in the future. If you see something strange, file a Github Issue.

## Overview¶

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It’s the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures.

The abstract from the paper is the following:

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Tips:

• To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.

• The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to use a higher resolution than pre-training (Touvron et al., 2019), (Kolesnikov et al., 2020). The authors report the best results with a resolution of 384x384 during fine-tuning.

• As the Vision Transformer expects each image to be of the same size (resolution), one can use ViTFeatureExtractor to resize (or rescale) and normalize images for the model.

• Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of each checkpoint. For example, google/vit-base-patch16-224 refers to a base-sized architecture with patch resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the hub.

• The available checkpoints are either (1) pre-trained on ImageNet-21k (a collection of 14 million images and 21k classes) only, or (2) also fine-tuned on ImageNet (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).

• The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

The original code (written in JAX) can be found here.

Note that we converted the weights from Ross Wightman’s timm library, who already converted the weights from JAX to PyTorch. Credits go to him!

## ViTConfig¶

class transformers.ViTConfig(hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.0, attention_probs_dropout_prob=0.0, initializer_range=0.02, layer_norm_eps=1e-12, is_encoder_decoder=False, image_size=224, patch_size=16, num_channels=3, **kwargs)[source]

This is the configuration class to store the configuration of a ViTModel. It is used to instantiate an ViT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ViT google/vit-base-patch16-224 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Parameters
• hidden_size (int, optional, defaults to 768) – Dimensionality of the encoder layers and the pooler layer.

• num_hidden_layers (int, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

• num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

• intermediate_size (int, optional, defaults to 3072) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

• hidden_act (str or function, optional, defaults to "gelu") – The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

• hidden_dropout_prob (float, optional, defaults to 0.1) – The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.

• attention_probs_dropout_prob (float, optional, defaults to 0.1) – The dropout ratio for the attention probabilities.

• initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

• layer_norm_eps (float, optional, defaults to 1e-12) – The epsilon used by the layer normalization layers.

• gradient_checkpointing (bool, optional, defaults to False) – If True, use gradient checkpointing to save memory at the expense of slower backward pass.

• image_size (int, optional, defaults to 224) – The size (resolution) of each image.

• patch_size (int, optional, defaults to 16) – The size (resolution) of each patch.

• num_channels (int, optional, defaults to 3) – The number of input channels.

Example:

>>> from transformers import ViTModel, ViTConfig

>>> # Initializing a ViT vit-base-patch16-224 style configuration
>>> configuration = ViTConfig()

>>> # Initializing a model from the vit-base-patch16-224 style configuration
>>> model = ViTModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config


## ViTFeatureExtractor¶

class transformers.ViTFeatureExtractor(image_mean=None, image_std=None, do_normalize=True, do_resize=True, size=224, **kwargs)[source]

Constructs a ViT feature extractor.

This feature extractor inherits from FeatureExtractionMixin which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
• image_mean (int, defaults to [0.5, 0.5, 0.5]) – The sequence of means for each channel, to be used when normalizing images.

• image_std (int, defaults to [0.5, 0.5, 0.5]) – The sequence of standard deviations for each channel, to be used when normalizing images.

• do_normalize (bool, optional, defaults to True) – Whether or not to normalize the input with mean and standard deviation.

• do_resize (bool, optional, defaults to True) – Whether to resize the input to a certain size.

• size (int, optional, defaults to 224) – Resize the input to the given size. Only has an effect if do_resize is set to True.

__call__(images: Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, List[PIL.Image.Image], List[numpy.ndarray], List[torch.Tensor]], return_tensors: Optional[Union[str, transformers.file_utils.TensorType]] = None, **kwargs) → transformers.feature_extraction_utils.BatchFeature[source]

Main method to prepare for the model one or several image(s).

Warning

NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass PIL images.

Parameters
• images (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor]) – The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.

• return_tensors (str or TensorType, optional) –

If set, will return tensors instead of list of python integers. Acceptable values are:

• 'tf': Return TensorFlow tf.constant objects.

• 'pt': Return PyTorch torch.Tensor objects.

• 'np': Return Numpy np.ndarray objects.s

• 'jax': Return JAX jnp.ndarray objects.

Returns

A BatchFeature with the following fields:

• pixel_values – Pixel values to be fed to a model.

Return type

BatchFeature

## ViTModel¶

class transformers.ViTModel(config, add_pooling_layer=True)[source]

The bare ViT Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config (ViTConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

forward(pixel_values=None, head_mask=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]

The ViTModel forward method, overrides the __call__() special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) – Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using ViTFeatureExtractor. See transformers.ViTFeatureExtractor.__call__() for details.

• head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) –

Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:

• output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

• output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

• return_dict (bool, optional) – Whether or not to return a ModelOutput instead of a plain tuple.

Returns

A BaseModelOutputWithPooling (if return_dict=True is passed or when config.return_dict=True) or a tuple of torch.FloatTensor comprising various elements depending on the configuration (ViTConfig) and inputs.

• last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

• pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

• hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Examples:

>>> from transformers import ViTFeatureExtractor, ViTModel
>>> from PIL import Image
>>> import requests

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = feature_extractor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state


Return type

BaseModelOutputWithPooling or tuple(torch.FloatTensor)

## ViTForImageClassification¶

class transformers.ViTForImageClassification(config)[source]

ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.

This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters

config (ViTConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

forward(pixel_values=None, head_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]

The ViTForImageClassification forward method, overrides the __call__() special method.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Parameters
• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) – Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using ViTFeatureExtractor. See transformers.ViTFeatureExtractor.__call__() for details.

• head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) –

Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:

• output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

• output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

• return_dict (bool, optional) – Whether or not to return a ModelOutput instead of a plain tuple.

• labels (torch.LongTensor of shape (batch_size,), optional) – Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns

A SequenceClassifierOutput (if return_dict=True is passed or when config.return_dict=True) or a tuple of torch.FloatTensor comprising various elements depending on the configuration (ViTConfig) and inputs.

• loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Classification (or regression if config.num_labels==1) loss.

• logits (torch.FloatTensor of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

• hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Examples:

>>> from transformers import ViTFeatureExtractor, ViTForImageClassification
>>> from PIL import Image
>>> import requests

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)


SequenceClassifierOutput or tuple(torch.FloatTensor)