Model outputs
All models have outputs that are instances of subclasses of ModelOutput. Those are data structures containing all the information returned by the model, but that can also be used as tuples or dictionaries.
Let’s see of this looks on an example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained("bertbaseuncased")
model = BertForSequenceClassification.from_pretrained("bertbaseuncased")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(**inputs, labels=labels)
The outputs
object is a SequenceClassifierOutput, as we can see in the
documentation of that class below, it means it has an optional loss
, a logits
an optional hidden_states
and
an optional attentions
attribute. Here we have the loss
since we passed along labels
, but we don’t have
hidden_states
and attentions
because we didn’t pass output_hidden_states=True
or
output_attentions=True
.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
will get None
. Here for instance outputs.loss
is the loss computed by the model, and outputs.attentions
is
None
.
When considering our outputs
object as tuple, it only considers the attributes that don’t have None
values.
Here for instance, it has two elements, loss
then logits
, so
outputs[:2]
will return the tuple (outputs.loss, outputs.logits)
for instance.
When considering our outputs
object as dictionary, it only considers the attributes that don’t have None
values. Here for instance, it has two keys that are loss
and logits
.
We document here the generic model outputs that are used by more than one model type. Specific output types are documented on their corresponding model page.
ModelOutput
Base class for all model outputs as dataclass. Has a __getitem__
that allows indexing by integer or slice (like a
tuple) or strings (like a dictionary) that will ignore the None
attributes. Otherwise behaves like a regular
python dictionary.
You can’t unpack a ModelOutput
directly. Use the to_tuple() method to convert it to a tuple
before.
Convert self to a tuple containing all the attributes/keys that are not None
.
BaseModelOutput
class transformers.modeling_outputs.BaseModelOutput
< source >( last_hidden_state: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs, with potential hidden states and attentions.
BaseModelOutputWithPooling
class transformers.modeling_outputs.BaseModelOutputWithPooling
< source >( last_hidden_state: FloatTensor = None pooler_output: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model. 
pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) — Last layer hiddenstate of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERTfamily of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs that also contains a pooling of the last hidden states.
BaseModelOutputWithCrossAttentions
class transformers.modeling_outputs.BaseModelOutputWithCrossAttentions
< source >( last_hidden_state: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Base class for model’s outputs, with potential hidden states and attentions.
BaseModelOutputWithPoolingAndCrossAttentions
class transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
< source >( last_hidden_state: FloatTensor = None pooler_output: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model. 
pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) — Last layer hiddenstate of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERTfamily of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.

past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.
Base class for model’s outputs that also contains a pooling of the last hidden states.
BaseModelOutputWithPast
class transformers.modeling_outputs.BaseModelOutputWithPast
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
BaseModelOutputWithPastAndCrossAttentions
class transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
Seq2SeqModelOutput
class transformers.modeling_outputs.Seq2SeqModelOutput
< source >( last_hidden_state: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
 last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the decoder of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the optional initial embedding outputs.

decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the optional initial embedding outputs.

encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model encoder’s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
CausalLMOutput
class transformers.modeling_outputs.CausalLMOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for nexttoken prediction). 
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for causal language model (or autoregressive) outputs.
CausalLMOutputWithCrossAttentions
class transformers.modeling_outputs.CausalLMOutputWithCrossAttentions
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for nexttoken prediction). 
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.

past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftorch.FloatTensor
tuples of lengthconfig.n_layers
, with each tuple containing the cached key, value states of the selfattention and the crossattention layers if model is used in encoderdecoder setting. Only relevant ifconfig.is_decoder = True
.Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
Base class for causal language model (or autoregressive) outputs.
CausalLMOutputWithPast
class transformers.modeling_outputs.CausalLMOutputWithPast
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for nexttoken prediction). 
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains precomputed hiddenstates (key and values in the selfattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for causal language model (or autoregressive) outputs.
MaskedLMOutput
class transformers.modeling_outputs.MaskedLMOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Masked language modeling (MLM) loss. 
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for masked language models outputs.
Seq2SeqLMOutput
class transformers.modeling_outputs.Seq2SeqLMOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss. 
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for sequencetosequence language models outputs.
NextSentencePredictorOutput
class transformers.modeling_outputs.NextSentencePredictorOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whennext_sentence_label
is provided) — Next sequence prediction (classification) loss. 
logits (
torch.FloatTensor
of shape(batch_size, 2)
) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of models predicting if two sentences are consecutive or not.
SequenceClassifierOutput
class transformers.modeling_outputs.SequenceClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Classification (or regression if config.num_labels==1) loss. 
logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sentence classification models.
Seq2SeqSequenceClassifierOutput
class transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided) — Classification (or regression if config.num_labels==1) loss. 
logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax). 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sequencetosequence sentence classification models.
MultipleChoiceModelOutput
class transformers.modeling_outputs.MultipleChoiceModelOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape (1,), optional, returned whenlabels
is provided) — Classification loss. 
logits (
torch.FloatTensor
of shape(batch_size, num_choices)
) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax).
 hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of multiple choice models.
TokenClassifierOutput
class transformers.modeling_outputs.TokenClassifierOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Classification loss. 
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of token classification models.
QuestionAnsweringModelOutput
class transformers.modeling_outputs.QuestionAnsweringModelOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None start_logits: FloatTensor = None end_logits: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Total span extraction loss is the sum of a CrossEntropy for the start and end positions. 
start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Spanstart scores (before SoftMax). 
end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Spanend scores (before SoftMax).  hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the optional initial embedding outputs.

attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of question answering models.
Seq2SeqQuestionAnsweringModelOutput
class transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput
< source >( loss: typing.Optional[torch.FloatTensor] = None start_logits: FloatTensor = None end_logits: FloatTensor = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None decoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None encoder_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters

loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Total span extraction loss is the sum of a CrossEntropy for the start and end positions. 
start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Spanstart scores (before SoftMax). 
end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Spanend scores (before SoftMax). 
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sequencetosequence question answering models.
TFBaseModelOutput
class transformers.modeling_tf_outputs.TFBaseModelOutput
< source >( last_hidden_state: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
 last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.  hidden_states (
tuple(tf.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs, with potential hidden states and attentions.
TFBaseModelOutputWithPooling
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling
< source >( last_hidden_state: Tensor = None pooler_output: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
 last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model. 
pooler_output (
tf.Tensor
of shape(batch_size, hidden_size)
) — Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hiddenstates for the whole input sequence.
 hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs that also contains a pooling of the last hidden states.
TFBaseModelOutputWithPoolingAndCrossAttentions
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions
< source >( last_hidden_state: Tensor = None pooler_output: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
 last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model. 
pooler_output (
tf.Tensor
of shape(batch_size, hidden_size)
) — Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hiddenstates for the whole input sequence.

past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Base class for model’s outputs that also contains a pooling of the last hidden states.
TFBaseModelOutputWithPast
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPast
< source >( last_hidden_state: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
 last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
TFBaseModelOutputWithPastAndCrossAttentions
class transformers.modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions
< source >( last_hidden_state: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
 last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(tf.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
TFSeq2SeqModelOutput
class transformers.modeling_tf_outputs.TFSeq2SeqModelOutput
< source >( last_hidden_state: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters
 last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the decoder of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model encoder’s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
TFCausalLMOutput
class transformers.modeling_tf_outputs.TFCausalLMOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) — Language modeling loss (for nexttoken prediction). 
logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for causal language model (or autoregressive) outputs.
TFCausalLMOutputWithCrossAttentions
class transformers.modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) — Language modeling loss (for nexttoken prediction). 
logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.

past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
Base class for causal language model (or autoregressive) outputs.
TFCausalLMOutputWithPast
class transformers.modeling_tf_outputs.TFCausalLMOutputWithPast
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) — Language modeling loss (for nexttoken prediction). 
logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for causal language model (or autoregressive) outputs.
TFMaskedLMOutput
class transformers.modeling_tf_outputs.TFMaskedLMOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) — Masked language modeling (MLM) loss. 
logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for masked language models outputs.
TFSeq2SeqLMOutput
class transformers.modeling_tf_outputs.TFSeq2SeqLMOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None cross_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) — Language modeling loss. 
logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for sequencetosequence language models outputs.
TFNextSentencePredictorOutput
class transformers.modeling_tf_outputs.TFNextSentencePredictorOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whennext_sentence_label
is provided) — Next sentence prediction loss. 
logits (
tf.Tensor
of shape(batch_size, 2)
) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of models predicting if two sentences are consecutive or not.
TFSequenceClassifierOutput
class transformers.modeling_tf_outputs.TFSequenceClassifierOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(batch_size, )
, optional, returned whenlabels
is provided) — Classification (or regression if config.num_labels==1) loss. 
logits (
tf.Tensor
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sentence classification models.
TFSeq2SeqSequenceClassifierOutput
class transformers.modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabel
is provided) — Classification (or regression if config.num_labels==1) loss. 
logits (
tf.Tensor
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax). 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
 encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sequencetosequence sentence classification models.
TFMultipleChoiceModelOutput
class transformers.modeling_tf_outputs.TFMultipleChoiceModelOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape (batch_size, ), optional, returned whenlabels
is provided) — Classification loss. 
logits (
tf.Tensor
of shape(batch_size, num_choices)
) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax).
 hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of multiple choice models.
TFTokenClassifierOutput
class transformers.modeling_tf_outputs.TFTokenClassifierOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of unmasked labels, returned whenlabels
is provided) — Classification loss. 
logits (
tf.Tensor
of shape(batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of token classification models.
TFQuestionAnsweringModelOutput
class transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None start_logits: Tensor = None end_logits: Tensor = None hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(batch_size, )
, optional, returned whenstart_positions
andend_positions
are provided) — Total span extraction loss is the sum of a CrossEntropy for the start and end positions. 
start_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) — Spanstart scores (before SoftMax). 
end_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) — Spanend scores (before SoftMax).  hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of question answering models.
TFSeq2SeqQuestionAnsweringModelOutput
class transformers.modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput
< source >( loss: typing.Optional[tensorflow.python.framework.ops.Tensor] = None start_logits: Tensor = None end_logits: Tensor = None past_key_values: typing.Optional[typing.List[tensorflow.python.framework.ops.Tensor]] = None decoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None decoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_last_hidden_state: typing.Optional[tensorflow.python.framework.ops.Tensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None encoder_attentions: typing.Optional[typing.Tuple[tensorflow.python.framework.ops.Tensor]] = None )
Parameters

loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) — Total span extraction loss is the sum of a CrossEntropy for the start and end positions. 
start_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) — Spanstart scores (before SoftMax). 
end_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) — Spanend scores (before SoftMax). 
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — List oftf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
 encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sequencetosequence question answering models.
FlaxBaseModelOutput
class transformers.modeling_flax_outputs.FlaxBaseModelOutput
< source >( last_hidden_state: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
 last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs, with potential hidden states and attentions.
“Returns a new object replacing the specified fields with new values.
FlaxBaseModelOutputWithPast
class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPast
< source >( last_hidden_state: ndarray = None past_key_values: typing.Union[typing.Dict[str, jax._src.numpy.ndarray.ndarray], NoneType] = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
 last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model. 
past_key_values (
Dict[str, jnp.ndarray]
) — Dictionary of precomputed hiddenstates (key and values in the attention blocks) that can be used for fast autoregressive decoding. Precomputed key and value hiddenstates are of shape [batch_size, max_length].  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs, with potential hidden states and attentions.
“Returns a new object replacing the specified fields with new values.
FlaxBaseModelOutputWithPooling
class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling
< source >( last_hidden_state: ndarray = None pooler_output: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
 last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model. 
pooler_output (
jnp.ndarray
of shape(batch_size, hidden_size)
) — Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model’s outputs that also contains a pooling of the last hidden states.
“Returns a new object replacing the specified fields with new values.
FlaxBaseModelOutputWithPastAndCrossAttentions
class transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions
< source >( last_hidden_state: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
 last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Base class for model’s outputs that may also contain a past key/values (to speed up sequential decoding).
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqModelOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqModelOutput
< source >( last_hidden_state: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters
 last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hiddenstates at the output of the last layer of the decoder of the model.If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output. 
past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for model encoder’s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
“Returns a new object replacing the specified fields with new values.
FlaxCausalLMOutputWithCrossAttentions
class transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions
< source >( logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.

past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple ofjnp.ndarray
tuples of lengthconfig.n_layers
, with each tuple containing the cached key, value states of the selfattention and the crossattention layers if model is used in encoderdecoder setting. Only relevant ifconfig.is_decoder = True
.Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
Base class for causal language model (or autoregressive) outputs.
“Returns a new object replacing the specified fields with new values.
FlaxMaskedLMOutput
class transformers.modeling_flax_outputs.FlaxMaskedLMOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for masked language models outputs.
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqLMOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput
< source >( logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). 
past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for sequencetosequence language models outputs.
“Returns a new object replacing the specified fields with new values.
FlaxNextSentencePredictorOutput
class transformers.modeling_flax_outputs.FlaxNextSentencePredictorOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, 2)
) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of models predicting if two sentences are consecutive or not.
“Returns a new object replacing the specified fields with new values.
FlaxSequenceClassifierOutput
class transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sentence classification models.
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqSequenceClassifierOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput
< source >( logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax). 
past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sequencetosequence sentence classification models.
“Returns a new object replacing the specified fields with new values.
FlaxMultipleChoiceModelOutput
class transformers.modeling_flax_outputs.FlaxMultipleChoiceModelOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, num_choices)
) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax).
 hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of multiple choice models.
“Returns a new object replacing the specified fields with new values.
FlaxTokenClassifierOutput
class transformers.modeling_flax_outputs.FlaxTokenClassifierOutput
< source >( logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of token classification models.
“Returns a new object replacing the specified fields with new values.
FlaxQuestionAnsweringModelOutput
class transformers.modeling_flax_outputs.FlaxQuestionAnsweringModelOutput
< source >( start_logits: ndarray = None end_logits: ndarray = None hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

start_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) — Spanstart scores (before SoftMax). 
end_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) — Spanend scores (before SoftMax).  hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.

attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of question answering models.
“Returns a new object replacing the specified fields with new values.
FlaxSeq2SeqQuestionAnsweringModelOutput
class transformers.modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput
< source >( start_logits: ndarray = None end_logits: ndarray = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[jax._src.numpy.ndarray.ndarray]]] = None decoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None decoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None cross_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_last_hidden_state: typing.Optional[jax._src.numpy.ndarray.ndarray] = None encoder_hidden_states: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None encoder_attentions: typing.Optional[typing.Tuple[jax._src.numpy.ndarray.ndarray]] = None )
Parameters

start_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) — Spanstart scores (before SoftMax). 
end_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) — Spanend scores (before SoftMax). 
past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.  decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.

decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.

cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
 encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hiddenstates at the output of the last layer of the encoder of the model.  encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.

encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
Base class for outputs of sequencetosequence question answering models.
“Returns a new object replacing the specified fields with new values.