Model outputsÂ¶
PyTorch models have outputs that are instances of subclasses of ModelOutput
. Those
are data structures containing all the information returned by the model, but that can also be used as tuples or
dictionaries.
Letâ€™s see of this looks on an example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bertbaseuncased')
model = BertForSequenceClassification.from_pretrained('bertbaseuncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(**inputs, labels=labels)
The outputs
object is a SequenceClassifierOutput
, as we can see in the
documentation of that class below, it means it has an optional loss
, a logits
an optional hidden_states
and
an optional attentions
attribute. Here we have the loss
since we passed along labels
, but we donâ€™t have
hidden_states
and attentions
because we didnâ€™t pass output_hidden_states=True
or
output_attentions=True
.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
will get None
. Here for instance outputs.loss
is the loss computed by the model, and outputs.attentions
is
None
.
When considering our outputs
object as tuple, it only considers the attributes that donâ€™t have None
values.
Here for instance, it has two elements, loss
then logits
, so
outputs[:2]
will return the tuple (outputs.loss, outputs.logits)
for instance.
When considering our outputs
object as dictionary, it only considers the attributes that donâ€™t have None
values. Here for instance, it has two keys that are loss
and logits
.
We document here the generic model outputs that are used by more than one model type. Specific output types are documented on their corresponding model page.
ModelOutputÂ¶

class
transformers.file_utils.
ModelOutput
[source]Â¶ Base class for all model outputs as dataclass. Has a
__getitem__
that allows indexing by integer or slice (like a tuple) or strings (like a dictionary) that will ignore theNone
attributes. Otherwise behaves like a regular python dictionary.Warning
You canâ€™t unpack a
ModelOutput
directly. Use theto_tuple()
method to convert it to a tuple before.
pop
(k[, d]) → v, remove specified key and return the corresponding[source]Â¶ value. If key is not found, d is returned if given, otherwise KeyError is raised.

BaseModelOutputÂ¶

class
transformers.modeling_outputs.
BaseModelOutput
(last_hidden_state: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
BaseModelOutputWithPoolingÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPooling
(last_hidden_state: torch.FloatTensor = None, pooler_output: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) â€“ Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
BaseModelOutputWithCrossAttentionsÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithCrossAttentions
(last_hidden_state: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
BaseModelOutputWithPoolingAndCrossAttentionsÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPoolingAndCrossAttentions
(last_hidden_state: torch.FloatTensor = None, pooler_output: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) â€“ Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
BaseModelOutputWithPastÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPast
(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
BaseModelOutputWithPastAndCrossAttentionsÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPastAndCrossAttentions
(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Seq2SeqModelOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqModelOutput
(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for model encoderâ€™s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the decoder of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
CausalLMOutputÂ¶

class
transformers.modeling_outputs.
CausalLMOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
CausalLMOutputWithCrossAttentionsÂ¶

class
transformers.modeling_outputs.
CausalLMOutputWithCrossAttentions
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.
CausalLMOutputWithPastAndCrossAttentionsÂ¶

class
transformers.modeling_outputs.
CausalLMOutputWithPastAndCrossAttentions
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.
CausalLMOutputWithPastÂ¶

class
transformers.modeling_outputs.
CausalLMOutputWithPast
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
MaskedLMOutputÂ¶

class
transformers.modeling_outputs.
MaskedLMOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for masked language models outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Masked language modeling (MLM) loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Seq2SeqLMOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqLMOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for sequencetosequence language models outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
NextSentencePredictorOutputÂ¶

class
transformers.modeling_outputs.
NextSentencePredictorOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of models predicting if two sentences are consecutive or not.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whennext_sentence_label
is provided) â€“ Next sequence prediction (classification) loss.logits (
torch.FloatTensor
of shape(batch_size, 2)
) â€“ Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
SequenceClassifierOutputÂ¶

class
transformers.modeling_outputs.
SequenceClassifierOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of sentence classification models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Seq2SeqSequenceClassifierOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqSequenceClassifierOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence sentence classification models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
MultipleChoiceModelOutputÂ¶

class
transformers.modeling_outputs.
MultipleChoiceModelOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of multiple choice models.
 Parameters
loss (
torch.FloatTensor
of shape (1,), optional, returned whenlabels
is provided) â€“ Classification loss.logits (
torch.FloatTensor
of shape(batch_size, num_choices)
) â€“num_choices is the second dimension of the input tensors. (see input_ids above).
Classification scores (before SoftMax).
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TokenClassifierOutputÂ¶

class
transformers.modeling_outputs.
TokenClassifierOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of token classification models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Classification loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.num_labels)
) â€“ Classification scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
QuestionAnsweringModelOutputÂ¶

class
transformers.modeling_outputs.
QuestionAnsweringModelOutput
(loss: Optional[torch.FloatTensor] = None, start_logits: torch.FloatTensor = None, end_logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of question answering models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Seq2SeqQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqQuestionAnsweringModelOutput
(loss: Optional[torch.FloatTensor] = None, start_logits: torch.FloatTensor = None, end_logits: torch.FloatTensor = None, past_key_values: Optional[List[torch.FloatTensor]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence question answering models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).past_key_values (
List[torch.FloatTensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
torch.FloatTensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutput
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(tf.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputWithPoolingÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutputWithPooling
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, pooler_output: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
tf.Tensor
of shape(batch_size, hidden_size)
) â€“Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
This output is usually not a good summary of the semantic content of the input, youâ€™re often better with averaging or pooling the sequence of hiddenstates for the whole input sequence.
hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputWithPastÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutputWithPast
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqModelOutput
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for model encoderâ€™s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the decoder of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFCausalLMOutputÂ¶

class
transformers.modeling_tf_outputs.
TFCausalLMOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFCausalLMOutputWithPastÂ¶

class
transformers.modeling_tf_outputs.
TFCausalLMOutputWithPast
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFMaskedLMOutputÂ¶

class
transformers.modeling_tf_outputs.
TFMaskedLMOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for masked language models outputs.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Masked language modeling (MLM) loss.logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqLMOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqLMOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for sequencetosequence language models outputs.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss.logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFNextSentencePredictorOutputÂ¶

class
transformers.modeling_tf_outputs.
TFNextSentencePredictorOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of models predicting if two sentences are consecutive or not.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whennext_sentence_label
is provided) â€“ Next sentence prediction loss.logits (
tf.Tensor
of shape(batch_size, 2)
) â€“ Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSequenceClassifierOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSequenceClassifierOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of sentence classification models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
tf.Tensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqSequenceClassifierOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqSequenceClassifierOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence sentence classification models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabel
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
tf.Tensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFMultipleChoiceModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFMultipleChoiceModelOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of multiple choice models.
 Parameters
loss (
tf.Tensor
of shape (1,), optional, returned whenlabels
is provided) â€“ Classification loss.logits (
tf.Tensor
of shape(batch_size, num_choices)
) â€“num_choices is the second dimension of the input tensors. (see input_ids above).
Classification scores (before SoftMax).
hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFTokenClassifierOutputÂ¶

class
transformers.modeling_tf_outputs.
TFTokenClassifierOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of token classification models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Classification loss.logits (
tf.Tensor
of shape(batch_size, sequence_length, config.num_labels)
) â€“ Classification scores (before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFQuestionAnsweringModelOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, start_logits: tensorflow.python.framework.ops.Tensor = None, end_logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of question answering models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqQuestionAnsweringModelOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, start_logits: tensorflow.python.framework.ops.Tensor = None, end_logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence question answering models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.