Model outputs¶

All models have outputs that are instances of subclasses of ModelOutput. Those are data structures containing all the information returned by the model, but that can also be used as tuples or dictionaries.

Let’s see of this looks on an example:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)

The outputs object is a SequenceClassifierOutput, as we can see in the documentation of that class below, it means it has an optional loss, a logits an optional hidden_states and an optional attentions attribute. Here we have the loss since we passed along labels, but we don’t have hidden_states and attentions because we didn’t pass output_hidden_states=True or output_attentions=True.

You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you will get None. Here for instance outputs.loss is the loss computed by the model, and outputs.attentions is None.

When considering our outputs object as tuple, it only considers the attributes that don’t have None values. Here for instance, it has two elements, loss then logits, so

outputs[:2]

will return the tuple (outputs.loss, outputs.logits) for instance.

When considering our outputs object as dictionary, it only considers the attributes that don’t have None values. Here for instance, it has two keys that are loss and logits.

We document here the generic model outputs that are used by more than one model type. Specific output types are documented on their corresponding model page.

ModelOutput¶

class transformers.file_utils.ModelOutput[source]¶

Base class for all model outputs as dataclass. Has a __getitem__ that allows indexing by integer or slice (like a tuple) or strings (like a dictionary) that will ignore the None attributes. Otherwise behaves like a regular python dictionary.

Warning

You can’t unpack a ModelOutput directly. Use the to_tuple() method to convert it to a tuple before.

to_tuple() → Tuple[Any][source]¶

Convert self to a tuple containing all the attributes/keys that are not None.

BaseModelOutput¶

class transformers.modeling_outputs.BaseModelOutput(last_hidden_state: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model’s outputs, with potential hidden states and attentions.

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BaseModelOutputWithPooling¶

class transformers.modeling_outputs.BaseModelOutputWithPooling(last_hidden_state: torch.FloatTensor = None, pooler_output: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model’s outputs that also contains a pooling of the last hidden states.

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BaseModelOutputWithCrossAttentions¶

class transformers.modeling_outputs.BaseModelOutputWithCrossAttentions(last_hidden_state: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model’s outputs, with potential hidden states and attentions.

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

BaseModelOutputWithPoolingAndCrossAttentions¶

class transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state: torch.FloatTensor = None, pooler_output: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model’s outputs that also contains a pooling of the last hidden states.

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

BaseModelOutputWithPast¶

class transformers.modeling_outputs.BaseModelOutputWithPast(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BaseModelOutputWithPastAndCrossAttentions¶

class transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Seq2SeqModelOutput¶

class transformers.modeling_outputs.Seq2SeqModelOutput(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for model encoder’s outputs that also contains : pre-computed hidden states that can speed up sequential decoding.

Parameters
  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the decoder of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

CausalLMOutput¶

class transformers.modeling_outputs.CausalLMOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

CausalLMOutputWithCrossAttentions¶

class transformers.modeling_outputs.CausalLMOutputWithCrossAttentions(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of torch.FloatTensor tuples of length config.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if config.is_decoder = True.

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

CausalLMOutputWithPast¶

class transformers.modeling_outputs.CausalLMOutputWithPast(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head))

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

MaskedLMOutput¶

class transformers.modeling_outputs.MaskedLMOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for masked language models outputs.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Masked language modeling (MLM) loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Seq2SeqLMOutput¶

class transformers.modeling_outputs.Seq2SeqLMOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for sequence-to-sequence language models outputs.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

NextSentencePredictorOutput¶

class transformers.modeling_outputs.NextSentencePredictorOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of models predicting if two sentences are consecutive or not.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when next_sentence_label is provided) – Next sequence prediction (classification) loss.

  • logits (torch.FloatTensor of shape (batch_size, 2)) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

SequenceClassifierOutput¶

class transformers.modeling_outputs.SequenceClassifierOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of sentence classification models.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Seq2SeqSequenceClassifierOutput¶

class transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of sequence-to-sequence sentence classification models.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when label is provided) – Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

MultipleChoiceModelOutput¶

class transformers.modeling_outputs.MultipleChoiceModelOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of multiple choice models.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Classification loss.

  • logits (torch.FloatTensor of shape (batch_size, num_choices)) –

    num_choices is the second dimension of the input tensors. (see input_ids above).

    Classification scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TokenClassifierOutput¶

class transformers.modeling_outputs.TokenClassifierOutput(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of token classification models.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Classification loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)) – Classification scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

QuestionAnsweringModelOutput¶

class transformers.modeling_outputs.QuestionAnsweringModelOutput(loss: Optional[torch.FloatTensor] = None, start_logits: torch.FloatTensor = None, end_logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of question answering models.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

  • start_logits (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax).

  • end_logits (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-end scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Seq2SeqQuestionAnsweringModelOutput¶

class transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput(loss: Optional[torch.FloatTensor] = None, start_logits: torch.FloatTensor = None, end_logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]¶

Base class for outputs of sequence-to-sequence question answering models.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

  • start_logits (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax).

  • end_logits (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-end scores (before SoftMax).

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

TFBaseModelOutput¶

class transformers.modeling_tf_outputs.TFBaseModelOutput(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for model’s outputs, with potential hidden states and attentions.

Parameters
  • last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(tf.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFBaseModelOutputWithPooling¶

class transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, pooler_output: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for model’s outputs that also contains a pooling of the last hidden states.

Parameters
  • last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (tf.Tensor of shape (batch_size, hidden_size)) –

    Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

    This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFBaseModelOutputWithPoolingAndCrossAttentions¶

class transformers.modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, pooler_output: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for model’s outputs that also contains a pooling of the last hidden states.

Parameters
  • last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (tf.Tensor of shape (batch_size, hidden_size)) –

    Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

    This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

TFBaseModelOutputWithPast¶

class transformers.modeling_tf_outputs.TFBaseModelOutputWithPast(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).

Parameters
  • last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFBaseModelOutputWithPastAndCrossAttentions¶

class transformers.modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).

Parameters
  • last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(tf.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

TFSeq2SeqModelOutput¶

class transformers.modeling_tf_outputs.TFSeq2SeqModelOutput(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for model encoder’s outputs that also contains : pre-computed hidden states that can speed up sequential decoding.

Parameters
  • last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the decoder of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

TFCausalLMOutput¶

class transformers.modeling_tf_outputs.TFCausalLMOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of non-masked labels, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (tf.Tensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFCausalLMOutputWithCrossAttentions¶

class transformers.modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of non-masked labels, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (tf.Tensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

TFCausalLMOutputWithPast¶

class transformers.modeling_tf_outputs.TFCausalLMOutputWithPast(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of non-masked labels, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (tf.Tensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFMaskedLMOutput¶

class transformers.modeling_tf_outputs.TFMaskedLMOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for masked language models outputs.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of non-masked labels, returned when labels is provided) – Masked language modeling (MLM) loss.

  • logits (tf.Tensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFSeq2SeqLMOutput¶

class transformers.modeling_tf_outputs.TFSeq2SeqLMOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for sequence-to-sequence language models outputs.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of non-masked labels, returned when labels is provided) – Language modeling loss.

  • logits (tf.Tensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

TFNextSentencePredictorOutput¶

class transformers.modeling_tf_outputs.TFNextSentencePredictorOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of models predicting if two sentences are consecutive or not.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of non-masked labels, returned when next_sentence_label is provided) – Next sentence prediction loss.

  • logits (tf.Tensor of shape (batch_size, 2)) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFSequenceClassifierOutput¶

class transformers.modeling_tf_outputs.TFSequenceClassifierOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of sentence classification models.

Parameters
  • loss (tf.Tensor of shape (batch_size, ), optional, returned when labels is provided) – Classification (or regression if config.num_labels==1) loss.

  • logits (tf.Tensor of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFSeq2SeqSequenceClassifierOutput¶

class transformers.modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of sequence-to-sequence sentence classification models.

Parameters
  • loss (tf.Tensor of shape (1,), optional, returned when label is provided) – Classification (or regression if config.num_labels==1) loss.

  • logits (tf.Tensor of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • encoder_last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

TFMultipleChoiceModelOutput¶

class transformers.modeling_tf_outputs.TFMultipleChoiceModelOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of multiple choice models.

Parameters
  • loss (tf.Tensor of shape (batch_size, ), optional, returned when labels is provided) – Classification loss.

  • logits (tf.Tensor of shape (batch_size, num_choices)) –

    num_choices is the second dimension of the input tensors. (see input_ids above).

    Classification scores (before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFTokenClassifierOutput¶

class transformers.modeling_tf_outputs.TFTokenClassifierOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of token classification models.

Parameters
  • loss (tf.Tensor of shape (n,), optional, where n is the number of unmasked labels, returned when labels is provided) – Classification loss.

  • logits (tf.Tensor of shape (batch_size, sequence_length, config.num_labels)) – Classification scores (before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFQuestionAnsweringModelOutput¶

class transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, start_logits: tensorflow.python.framework.ops.Tensor = None, end_logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of question answering models.

Parameters
  • loss (tf.Tensor of shape (batch_size, ), optional, returned when start_positions and end_positions are provided) – Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

  • start_logits (tf.Tensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax).

  • end_logits (tf.Tensor of shape (batch_size, sequence_length)) – Span-end scores (before SoftMax).

  • hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

TFSeq2SeqQuestionAnsweringModelOutput¶

class transformers.modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, start_logits: tensorflow.python.framework.ops.Tensor = None, end_logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]¶

Base class for outputs of sequence-to-sequence question answering models.

Parameters
  • loss (tf.Tensor of shape (1,), optional, returned when labels is provided) – Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

  • start_logits (tf.Tensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax).

  • end_logits (tf.Tensor of shape (batch_size, sequence_length)) – Span-end scores (before SoftMax).

  • past_key_values (List[tf.Tensor], optional, returned when use_cache=True is passed or when config.use_cache=True) –

    List of tf.Tensor of length config.n_layers, with each tensor of shape (2, batch_size, num_heads, sequence_length, embed_size_per_head)).

    Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • encoder_last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(tf.Tensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(tf.Tensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of tf.Tensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxBaseModelOutput¶

class transformers.modeling_flax_outputs.FlaxBaseModelOutput(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for model’s outputs, with potential hidden states and attentions.

Parameters
  • last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxBaseModelOutputWithPast¶

class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPast(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Dict[str, jax._src.numpy.lax_numpy.ndarray]] = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for model’s outputs, with potential hidden states and attentions.

Parameters
  • last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • past_key_values (Dict[str, jnp.ndarray]) – Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape [batch_size, max_length].

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxBaseModelOutputWithPooling¶

class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, pooler_output: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for model’s outputs that also contains a pooling of the last hidden states.

Parameters
  • last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (jnp.ndarray of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxBaseModelOutputWithPastAndCrossAttentions¶

class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).

Parameters
  • last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(jnp.ndarray)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(jnp.ndarray) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

FlaxSeq2SeqModelOutput¶

class transformers.modeling_flax_outputs.FlaxSeq2SeqModelOutput(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for model encoder’s outputs that also contains : pre-computed hidden states that can speed up sequential decoding.

Parameters
  • last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size)) –

    Sequence of hidden-states at the output of the last layer of the decoder of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(jnp.ndarray)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(jnp.ndarray) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxCausalLMOutputWithCrossAttentions¶

class transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions(logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for causal language model (or autoregressive) outputs.

Parameters
  • logits (jnp.ndarray of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • past_key_values (tuple(tuple(jnp.ndarray)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of jnp.ndarray tuples of length config.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if config.is_decoder = True.

    Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

FlaxMaskedLMOutput¶

class transformers.modeling_flax_outputs.FlaxMaskedLMOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for masked language models outputs.

Parameters
  • logits (jnp.ndarray of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxSeq2SeqLMOutput¶

class transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for sequence-to-sequence language models outputs.

Parameters
  • logits (jnp.ndarray of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (tuple(tuple(jnp.ndarray)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(jnp.ndarray) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxNextSentencePredictorOutput¶

class transformers.modeling_flax_outputs.FlaxNextSentencePredictorOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of models predicting if two sentences are consecutive or not.

Parameters
  • logits (jnp.ndarray of shape (batch_size, 2)) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxSequenceClassifierOutput¶

class transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of sentence classification models.

Parameters
  • logits (jnp.ndarray of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxSeq2SeqSequenceClassifierOutput¶

class transformers.modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of sequence-to-sequence sentence classification models.

Parameters
  • logits (jnp.ndarray of shape (batch_size, config.num_labels)) – Classification (or regression if config.num_labels==1) scores (before SoftMax).

  • past_key_values (tuple(tuple(jnp.ndarray)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(jnp.ndarray) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxMultipleChoiceModelOutput¶

class transformers.modeling_flax_outputs.FlaxMultipleChoiceModelOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of multiple choice models.

Parameters
  • logits (jnp.ndarray of shape (batch_size, num_choices)) –

    num_choices is the second dimension of the input tensors. (see input_ids above).

    Classification scores (before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxTokenClassifierOutput¶

class transformers.modeling_flax_outputs.FlaxTokenClassifierOutput(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of token classification models.

Parameters
  • logits (jnp.ndarray of shape (batch_size, sequence_length, config.num_labels)) – Classification scores (before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxQuestionAnsweringModelOutput¶

class transformers.modeling_flax_outputs.FlaxQuestionAnsweringModelOutput(start_logits: jax._src.numpy.lax_numpy.ndarray = None, end_logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of question answering models.

Parameters
  • start_logits (jnp.ndarray of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax).

  • end_logits (jnp.ndarray of shape (batch_size, sequence_length)) – Span-end scores (before SoftMax).

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

FlaxSeq2SeqQuestionAnsweringModelOutput¶

class transformers.modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput(start_logits: jax._src.numpy.lax_numpy.ndarray = None, end_logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]¶

Base class for outputs of sequence-to-sequence question answering models.

Parameters
  • start_logits (jnp.ndarray of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax).

  • end_logits (jnp.ndarray of shape (batch_size, sequence_length)) – Span-end scores (before SoftMax).

  • past_key_values (tuple(tuple(jnp.ndarray)), optional, returned when use_cache=True is passed or when config.use_cache=True) –

    Tuple of tuple(jnp.ndarray) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size), optional) – Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.