Transformers documentation

Perceiver

Transformers

You are viewing v4.13.0 version. A newer version v4.48.2 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Perceiver

Overview

The Perceiver IO model was proposed in Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.

Perceiver IO is a generalization of Perceiver to handle arbitrary outputs in addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio. This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example, Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.

The abstract from the paper is the following:

The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation.

Here’s a TLDR explaining how Perceiver works:

The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512 tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don’t depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are randomly initialized, after which they are trained end-to-end using backpropagation.

Internally, PerceiverModel will create the latents, which is a tensor of shape (batch_size, num_latents, d_latents). One must provide inputs (which could be text, images, audio, you name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project the d_latents to num_labels.

This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the last hidden states of the latents, using the outputs as queries, and the latents as keys and values.

So let’s say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver’s input length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes, providing inputs of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define the outputs as being of shape: (batch_size, 2048, 768). Next, one performs cross-attention with the final hidden states of the latents to update the outputs tensor. After cross-attention, one still has a tensor of shape (batch_size, 2048, 768). One can then place a regular language modeling head on top, to project the last dimension to the vocabulary size of the model, i.e. creating logits of shape (batch_size, 2048, 262) (as Perceiver uses a vocabulary size of 262 byte IDs).

This model was contributed by <nielsr>. The original code can be found here.

Perceiver specific outputs

class transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput < source >

( logits: FloatTensor = None last_hidden_state: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

logits (torch.FloatTensor of shape (batch_size, num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Base class for Perceiver base model’s outputs, with potential hidden states, attentions and cross-attentions.

class transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput < source >

( logits: FloatTensor = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

logits (torch.FloatTensor of shape (batch_size, num_labels)) — Output of the basic decoder.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Base class for Perceiver decoder outputs, with potential cross-attentions.

class transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput < source >

( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Masked language modeling (MLM) loss.
logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, num_latents, num_latents). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Base class for Perceiver’s masked language model outputs.

class transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput < source >

Parameters

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Base class for Perceiver’s outputs of sequence/image classification models, optical flow and multimodal autoencoding.

PerceiverConfig

class transformers.PerceiverConfig < source >

( num_latents = 256 d_latents = 1280 d_model = 768 num_blocks = 1 num_self_attends_per_block = 26 num_self_attention_heads = 8 num_cross_attention_heads = 8 qk_channels = None v_channels = None cross_attention_shape_for_attention = 'kv' self_attention_widening_factor = 1 cross_attention_widening_factor = 1 hidden_act = 'gelu' attention_probs_dropout_prob = 0.1 position_embedding_init_scale = 0.02 initializer_range = 0.02 layer_norm_eps = 1e-12 is_encoder_decoder = False use_query_residual = True vocab_size = 262 max_position_embeddings = 2048 image_size = 56 train_size = [368, 496] num_frames = 16 audio_samples_per_frame = 1920 samples_per_patch = 16 output_shape = [1, 16, 224, 224] **kwargs )

Parameters

num_latents (int, optional, defaults to 256) — The number of latents.
d_latents (int, optional, defaults to 1280) — Dimension of the latent embeddings.
d_model (int, optional, defaults to 768) — Dimension of the inputs.
num_blocks (int, optional, defaults to 1) — Number of blocks in the Transformer encoder.
num_self_attends_per_block (int, optional, defaults to 26) — The number of self-attention layers per block.
num_self_attention_heads (int, optional, defaults to 8) — Number of attention heads for each self-attention layer in the Transformer encoder.
num_cross_attention_heads (int, optional, defaults to 8) — Number of attention heads for each cross-attention layer in the Transformer encoder.
qk_channels (int, optional) — Dimension to project the queries + keys before applying attention in the cross-attention and self-attention layers of the encoder. Will default to preserving the dimension of the queries if not specified.
v_channels (int, optional) — Dimension to project the values before applying attention in the cross-attention and self-attention layers of the encoder. Will default to preserving the dimension of the queries if not specified.
cross_attention_shape_for_attention (str, optional, defaults to 'kv') — Dimension to use when downsampling the queries and keys in the cross-attention layer of the encoder.
self_attention_widening_factor (int, optional, defaults to 1) — Dimension of the feed-forward layer in the cross-attention layer of the Transformer encoder.
cross_attention_widening_factor (int, optional, defaults to 1) — Dimension of the feed-forward layer in the self-attention layers of the Transformer encoder.
hidden_act (str or function, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.
attention_probs_dropout_prob (float, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
use_query_residual (float, optional, defaults to True) — Whether to add a query residual in the cross-attention layer of the encoder.
vocab_size (int, optional, defaults to 262) — Vocabulary size for the masked language modeling model.
max_position_embeddings (int, optional, defaults to 2048) — The maximum sequence length that the masked language modeling model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
image_size (int, optional, defaults to 56) — Size of the images after preprocessing, for PerceiverForImageClassificationLearned.
train_size (List[int], optional, defaults to [368, 496]) — Training size of the images for the optical flow model.
num_frames (int, optional, defaults to 16) — Number of video frames used for the multimodal autoencoding model.
audio_samples_per_frame (int, optional, defaults to 1920) — Number of audio samples per frame for the multimodal autoencoding model.
samples_per_patch (int, optional, defaults to 16) — Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
output_shape (List[int], optional, defaults to [1, 16, 224, 224]) — Shape of the output for the multimodal autoencoding model.

This is the configuration class to store the configuration of a PerceiverModel. It is used to instantiate an Perceiver model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Perceiver deepmind/language-perceiver architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import PerceiverModel, PerceiverConfig

>>> # Initializing a Perceiver deepmind/language-perceiver style configuration
>>> configuration = PerceiverConfig()

>>> # Initializing a model from the deepmind/language-perceiver style configuration
>>> model = PerceiverModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

PerceiverTokenizer

class transformers.PerceiverTokenizer < source >

( pad_token = '[PAD]' bos_token = '[BOS]' eos_token = '[EOS]' mask_token = '[MASK]' cls_token = '[CLS]' sep_token = '[SEP]' model_max_length = 2048 **kwargs )

Parameters

pad_token (str, optional, defaults to "[PAD]") — The token used for padding, for example when batching sequences of different lengths.
bos_token (str, optional, defaults to "[BOS]") — The BOS token (reserved in the vocab, but not actually used).
eos_token (str, optional, defaults to "[EOS]") — The end of sequence token (reserved in the vocab, but not actually used).

Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

build_inputs_with_special_tokens < source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

token_ids_0 (List[int]) — List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks. A sequence has the following format:

single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP]

get_special_tokens_mask < source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → List[int]

Parameters

token_ids_0 (List[int]) — List of IDs.
token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.
already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

create_token_type_ids_from_sequences < source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

token_ids_0 (List[int]) — The first tokenized sequence.
token_ids_1 (List[int], optional) — The second tokenized sequence.

Returns

List[int]

The token type ids.

Create the token type IDs corresponding to the sequences passed. What are token type IDs?

Should be overridden in a subclass if the model has a special way of building those.

PerceiverFeatureExtractor

class transformers.PerceiverFeatureExtractor < source >

( do_center_crop = True crop_size = 256 do_resize = True size = 224 resample = 3 do_normalize = True image_mean = None image_std = None **kwargs )

Parameters

do_center_crop (bool, optional, defaults to True) — Whether to crop the input at the center. If the input size is smaller than crop_size along any edge, the image is padded with 0’s and then center cropped.
crop_size (int, optional, defaults to 256) — Desired output size when applying center-cropping. Only has an effect if do_center_crop is set to True.
do_resize (bool, optional, defaults to True) — Whether to resize the input to a certain size.
size (int or Tuple(int), optional, defaults to 224) — Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an integer is provided, then the input will be resized to (size, size). Only has an effect if do_resize is set to True.
resample (int, optional, defaults to PIL.Image.BICUBIC) — An optional resampling filter. This can be one of PIL.Image.NEAREST, PIL.Image.BOX, PIL.Image.BILINEAR, PIL.Image.HAMMING, PIL.Image.BICUBIC or PIL.Image.LANCZOS. Only has an effect if do_resize is set to True.
do_normalize (bool, optional, defaults to True) — Whether or not to normalize the input with image_mean and image_std.
image_mean (List[int], defaults to [0.485, 0.456, 0.406]) — The sequence of means for each channel, to be used when normalizing images.
image_std (List[int], defaults to [0.229, 0.224, 0.225]) — The sequence of standard deviations for each channel, to be used when normalizing images.

Constructs a Perceiver feature extractor.

This feature extractor inherits from ImageFeatureExtractionMixin which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

center_crop < source >

( image )

Parameters

image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to resize.

Crops image to self.crop_size using a center crop. Note that if the image is too small to be cropped to the size given, it will be padded (so the returned result has the size asked).

PerceiverTextPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor < source >

( config )

Text preprocessing for Perceiver Encoder.

PerceiverImagePreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor < source >

( config prep_type = 'conv' spatial_downsample: int = 4 temporal_downsample: int = 1 position_encoding_type: str = 'fourier' in_channels: int = 3 out_channels: int = 64 conv_after_patching: bool = False conv_after_patching_in_channels: int = 54 conv2d_use_batchnorm: bool = True concat_or_add_pos: str = 'concat' project_pos_dim: int = -1 **position_encoding_kwargs )

Parameters

config (PerceiverConfig) — Model configuration.
prep_type (str, optional, defaults to "conv") — Preprocessing type. Can be “conv1x1”, “conv”, “patches”, “pixels”.
spatial_downsample (int, optional, defaults to 4) — Spatial downsampling factor.
temporal_downsample (int, optional, defaults to 1) — Temporal downsampling factor (only relevant in case a time dimension is present).
position_encoding_type (str, optional, defaults to "fourier") — Position encoding type. Can be “fourier” or “trainable”.
in_channels (int, optional, defaults to 3) — Number of channels in the input.
out_channels (int, optional, defaults to 64) — Number of channels in the output.
conv_after_patching (bool, optional, defaults to False) — Whether to apply a convolutional layer after patching.
conv_after_patching_in_channels (int, optional, defaults to 54) — Number of channels in the input of the convolutional layer after patching.
conv2d_use_batchnorm (bool, optional, defaults to True) — Whether to use batch normalization in the convolutional layer.
concat_or_add_pos (str, optional, defaults to "concat") — How to concatenate the position encoding to the input. Can be “concat” or “add”.
project_pos_dim (int, optional, defaults to -1) — Dimension of the position encoding to project to. If -1, no projection is applied.
**position_encoding_kwargs (Dict, optional) — Keyword arguments for the position encoding.

Image preprocessing for Perceiver Encoder.

Note: the out_channels argument refers to the output channels of a convolutional layer, if prep_type is set to “conv1x1” or “conv”. If one adds absolute position embeddings, one must make sure the num_channels of the position encoding kwargs are set equal to the out_channels.

PerceiverOneHotPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration.

One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.

PerceiverAudioPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor < source >

( config prep_type: str = 'patches' samples_per_patch: int = 96 position_encoding_type: str = 'fourier' concat_or_add_pos: str = 'concat' out_channels = 64 project_pos_dim = -1 **position_encoding_kwargs )

Parameters

config (PerceiverConfig) — Model configuration.
prep_type (str, optional, defaults to "patches") — Preprocessor type to use. Only “patches” is supported.
samples_per_patch (int, optional, defaults to 96) — Number of samples per patch.
position_encoding_type (str, optional, defaults to "fourier") — Type of position encoding to use. Can be “trainable” or “fourier”.
concat_or_add_pos (str, optional, defaults to "concat") — How to concatenate the position encoding to the input. Can be “concat” or “add”.
out_channels (int, optional, defaults to 64) — Number of channels in the output.
project_pos_dim (int, optional, defaults to -1) — Dimension of the position encoding to project to. If -1, no projection is applied.
**position_encoding_kwargs (Dict, optional) — Keyword arguments for the position encoding.

Audio preprocessing for Perceiver Encoder.

PerceiverMultimodalPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor < source >

( modalities: typing.Mapping[str, typing.Callable[..., typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor], torch.Tensor]]] mask_probs: typing.Union[typing.Mapping[str, float], NoneType] = None min_padding_size: int = 2 )

Parameters

modalities (Dict[str, PreprocessorType]) — Dict mapping modality name to preprocessor.
mask_probs (Dict[str, float]) — Dict mapping modality name to masking probability of that modality.
min_padding_size (int, optional, defaults to 2) — The minimum padding size for all modalities. The final output will have num_channels equal to the maximum channels across all modalities plus min_padding_size.

Multimodal preprocessing for Perceiver Encoder.

Inputs for each modality are preprocessed, then padded with trainable position embeddings to have the same number of channels.

PerceiverProjectionPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor < source >

( in_channels out_channels )

Parameters

in_channels (int) — Number of channels in the input.
out_channels (int) — Number of channels in the output.

Projection postprocessing for Perceiver. Can be used to convert the project the channels of the decoder output to a lower dimension.

PerceiverAudioPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor < source >

( config in_channels postproc_type: str = 'patches' )

Parameters

config (PerceiverConfig) — Model configuration.
in_channels (int) — Number of channels in the input.
postproc_type (str, optional, defaults to "patches") — Postprocessor type to use. Currently, only “patches” is supported.

Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.

PerceiverClassificationPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor < source >

( config in_channels )

Parameters

config (PerceiverConfig) — Model configuration.
in_channels (int) — Number of channels in the input.

Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.

PerceiverMultimodalPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor < source >

( modalities: typing.Mapping[str, typing.Callable[..., typing.Any]] input_is_dict: bool = False )

Parameters

modalities (Dict[str, PostprocessorType]) — Dictionary mapping modality name to postprocessor class for that modality.
input_is_dict (bool, optional, defaults to False) — If True, input is assumed to be dictionary structured, and outputs keep the same dictionary shape. If False, input is a tensor which is sliced up during postprocessing by modality_sizes.

Multimodal postprocessing for Perceiver.

PerceiverModel

class transformers.PerceiverModel < source >

( config decoder = None input_preprocessor: typing.Callable[..., typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor], torch.Tensor]] = None output_postprocessor: typing.Callable[..., typing.Any] = None )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
decoder (DecoderType, optional) — Optional decoder to use to decode the latent representation of the encoder. Examples include transformers.models.perceiver.modeling_perceiver.PerceiverBasicDecoder, transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder, transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder.
input_preprocessor (PreprocessorType, optional) — Optional input preprocessor to use. Examples include transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor.
output_postprocessor (PostprocessorType, optional) — Optional output postprocessor to use. Examples include transformers.models.perceiver.modeling_perceiver.PerceiverImagePostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor.
Note that you can define your own decoders, preprocessors and/or postprocessors to fit your use-case. —

The Perceiver: a scalable, fully attentional architecture. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( inputs attention_mask = None subsampled_output_points = None head_mask = None output_attentions = None output_hidden_states = None return_dict = None ) → PerceiverModelOutput or tuple(torch.FloatTensor)

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

Returns

PerceiverModelOutput or tuple(torch.FloatTensor)

A PerceiverModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

logits (torch.FloatTensor of shape (batch_size, num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import PerceiverTokenizer, PerceiverModel
>>> import torch

>>> tokenizer = PerceiverTokenizer.from_pretrained('deepmind/language-perceiver')
>>> model = PerceiverModel.from_pretrained('deepmind/language-perceiver')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

PerceiverForMaskedLM

class transformers.PerceiverForMaskedLM < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for masked language modeling. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( inputs = None attention_mask = None head_mask = None output_attentions = None output_hidden_states = None labels = None return_dict = None ) → PerceiverMaskedLMOutput or tuple(torch.FloatTensor)

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

Returns

PerceiverMaskedLMOutput or tuple(torch.FloatTensor)

A PerceiverMaskedLMOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Masked language modeling (MLM) loss.
logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, num_latents, num_latents). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForMaskedLM forward method, overrides the __call__ special method.

Example:

>>> from transformers import PerceiverTokenizer, PerceiverForMaskedLM
>>> import torch

>>> tokenizer = PerceiverTokenizer.from_pretrained('deepmind/language-perceiver')
>>> model = PerceiverForMaskedLM.from_pretrained('deepmind/language-perceiver')

>>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
>>> labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

PerceiverForSequenceClassification

class transformers.PerceiverForSequenceClassification < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for text classification. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( inputs = None attention_mask = None head_mask = None output_attentions = None output_hidden_states = None labels = None return_dict = None ) → PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns —

Examples —:

from transformers import PerceiverTokenizer, PerceiverForSequenceClassification

tokenizer = PerceiverTokenizer.from_pretrained(‘deepmind/language-perceiver’) model = PerceiverForSequenceClassification.from_pretrained(‘deepmind/language-perceiver’)

text = “hello world” inputs = tokenizer(images=image, return_tensors=“pt”).input_ids outputs = model(inputs=inputs) logits = outputs.logits

Returns

PerceiverClassifierOutput or tuple(torch.FloatTensor)

A PerceiverClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForSequenceClassification forward method, overrides the __call__ special method.

Example of single-label classification:

>>> from transformers import PerceiverTokenizer, PerceiverForSequenceClassification
>>> import torch

>>> tokenizer = PerceiverTokenizer.from_pretrained('deepmind/language-perceiver')
>>> model = PerceiverForSequenceClassification.from_pretrained('deepmind/language-perceiver')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

Example of multi-label classification:

>>> from transformers import PerceiverTokenizer, PerceiverForSequenceClassification
>>> import torch

>>> tokenizer = PerceiverTokenizer.from_pretrained('deepmind/language-perceiver')
>>> model = PerceiverForSequenceClassification.from_pretrained('deepmind/language-perceiver', problem_type="multi_label_classification")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([[1, 1]], dtype=torch.float) # need dtype=float for BCEWithLogitsLoss
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

PerceiverForImageClassificationLearned

class transformers.PerceiverForImageClassificationLearned < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for image classification, for tasks such as ImageNet.

This model uses learned position embeddings. In other words, this model is not given any privileged information about the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.

PerceiverForImageClassificationLearned uses transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor (with prep_type = “conv1x1”) to preprocess the input images, and transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder to decode the latent representation of ~transformers.PerceiverModel into classification logits.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns

PerceiverClassifierOutput or tuple(torch.FloatTensor)

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForImageClassificationLearned forward method, overrides the __call__ special method.

Examples:

>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationLearned
>>> from PIL import Image
>>> import requests

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained('deepmind/vision-perceiver-learned')
>>> model = PerceiverForImageClassificationLearned.from_pretrained('deepmind/vision-perceiver-learned')

>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])

PerceiverForImageClassificationFourier

class transformers.PerceiverForImageClassificationFourier < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for image classification, for tasks such as ImageNet.

This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of 79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).

PerceiverForImageClassificationLearned uses transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor (with prep_type = “pixels”) to preprocess the input images, and transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder to decode the latent representation of ~transformers.PerceiverModel into classification logits.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns

PerceiverClassifierOutput or tuple(torch.FloatTensor)

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForImageClassificationFourier forward method, overrides the __call__ special method.

Examples:

>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationFourier
>>> from PIL import Image
>>> import requests

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained('deepmind/vision-perceiver-fourier')
>>> model = PerceiverForImageClassificationFourier.from_pretrained('deepmind/vision-perceiver-fourier')

>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])

PerceiverForImageClassificationConvProcessing

class transformers.PerceiverForImageClassificationConvProcessing < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for image classification, for tasks such as ImageNet.

This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy of 82.1 on ImageNet.

PerceiverForImageClassificationLearned uses transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor (with prep_type = “conv”) to preprocess the input images, and transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder to decode the latent representation of ~transformers.PerceiverModel into classification logits.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns

PerceiverClassifierOutput or tuple(torch.FloatTensor)

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForImageClassificationConvProcessing forward method, overrides the __call__ special method.

Examples:

>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationConvProcessing
>>> from PIL import Image
>>> import requests

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained('deepmind/vision-perceiver-conv')
>>> model = PerceiverForImageClassificationConvProcessing.from_pretrained('deepmind/vision-perceiver-conv')

>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])

PerceiverForOpticalFlow

class transformers.PerceiverForOpticalFlow < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. PerceiverForOpticalFlow uses transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor (with prep_type = “patches”) to preprocess the input images, and transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder to decode the latent representation of ~transformers.PerceiverModel.

As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel (leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position of each pixel in the patch. Next, one applies the Perceiver encoder. To decode, one queries the latent representation using the same encoding used for the input.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the optical flow loss. Indices should be in [0, ..., config.num_labels - 1].

Returns

PerceiverClassifierOutput or tuple(torch.FloatTensor)

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForOpticalFlow forward method, overrides the __call__ special method.

Examples:

>>> from transformers import PerceiverForOpticalFlow
>>> import torch

>>> model = PerceiverForOpticalFlow.from_pretrained('deepmind/optical-flow-perceiver')

>>> # in the Perceiver IO paper, the authors extract a 3 x 3 patch around each pixel,
>>> # leading to 3 x 3 x 3 = 27 values for each pixel (as each pixel also has 3 color channels)
>>> # patches have shape (batch_size, num_frames, num_channels, height, width)
>>> # the authors train on resolutions of 368 x 496
>>> patches = torch.randn(1, 2, 27, 368, 496)
>>> outputs = model(inputs=patches)
>>> logits = outputs.logits

PerceiverForMultimodalAutoencoding

class transformers.PerceiverForMultimodalAutoencoding < source >

( config )

Parameters

config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.

PerceiverForMultimodalAutoencoding uses transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor to preprocess the 3 modalities: images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.

transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder is used to decode the latent representation of ~transformers.PerceiverModel. This decoder uses each modality-specific decoder to construct queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which can be provided as additional input to the forward pass of PerceiverForMultimodalAutoencoding.

transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder also pads the decoder queries of the different modalities to the same number of channels, in order to concatenate them along the time dimension. Next, cross-attention is performed with the latent representation of PerceiverModel.

Finally, transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor is used to turn this tensor into an actual video. It first splits up the output into the different modalities, and then applies the respective postprocessor for each modality.

Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the “label” modality), this auto-encoding model becomes a Kinetics 700 video classifier.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( inputs = None attention_mask = None subsampled_output_points = None head_mask = None output_attentions = None output_hidden_states = None labels = None return_dict = None ) → PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

inputs (torch.FloatTensor) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns

PerceiverClassifierOutput or tuple(torch.FloatTensor)

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

The PerceiverForMultimodalAutoencoding forward method, overrides the __call__ special method.

Examples:

>>> from transformers import PerceiverForMultimodalAutoencoding
>>> import torch

>>> images = torch.randn((1, 16, 3, 224, 224))
>>> audio = torch.randn((1, 30720, 1))
>>> inputs = dict(image=images, audio=audio, label=torch.zeros((images.shape[0], 700)))

>>> model = PerceiverForMultimodalAutoencoding.from_pretrained('deepmind/multimodal-perceiver')

>>> outputs = model(inputs=inputs)
>>> logits = outputs.logits

←Hubert Pegasus→