Pegasus¶
DISCLAIMER: If you see something strange, file a Github Issue and assign @patrickvonplaten.
Overview¶
The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
According to the abstract,
Pegasus’ pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
Pegasus achieves SOTA summarization performance on all 12 downstream tasks, as measured by ROUGE and human eval.
The Authors’ code can be found here.
Checkpoints¶
All the checkpoints are fine-tuned for summarization, besides pegasus-large, whence the other checkpoints are fine-tuned:
Each checkpoint is 2.2 GB on disk and 568M parameters.
FP16 is not supported (help/ideas on this appreciated!).
Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU.
Full replication results and correctly pre-processed data can be found in this Issue.
Distilled checkpoints are described in this paper.
Examples¶
Script to fine-tune pegasus on the XSUM dataset. Data download instructions at examples/seq2seq/.
FP16 is not supported (help/ideas on this appreciated!).
The adafactor optimizer is recommended for pegasus fine-tuning.
Implementation Notes¶
All models are transformer encoder-decoders with 16 layers in each component.
The implementation is completely inherited from
BartForConditionalGeneration
Some key configuration differences:
static, sinusoidal position embeddings
no
layernorm_embedding
(PegasusConfig.normalize_embedding=False
)the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
more beams are used (
num_beams=8
)
All pretrained pegasus checkpoints are the same besides three attributes:
tokenizer.model_max_length
(maximum input size),max_length
(the maximum number of tokens to generate) andlength_penalty
.The code to convert checkpoints trained in the author’s repo can be found in
convert_pegasus_tf_to_pytorch.py
.
Usage Example¶
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
src_text = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]
model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest', return_tensors="pt").to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
PegasusConfig¶
-
class
transformers.
PegasusConfig
(activation_dropout=0.0, extra_pos_embeddings=2, activation_function='gelu', vocab_size=50265, d_model=1024, encoder_ffn_dim=4096, encoder_layers=12, encoder_attention_heads=16, decoder_ffn_dim=4096, decoder_layers=12, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, attention_dropout=0.0, dropout=0.1, max_position_embeddings=1024, init_std=0.02, classifier_dropout=0.0, num_labels=3, is_encoder_decoder=True, normalize_before=False, add_final_layer_norm=False, do_blenderbot_90_layernorm=False, scale_embedding=False, normalize_embedding=True, static_position_embeddings=False, add_bias_logits=False, force_bos_token_to_be_generated=False, use_cache=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, **common_kwargs)[source]¶ This is the configuration class to store the configuration of a
PegasusForConditionalGeneration
. It is used to instantiate a Pegasus model according to the specified arguments, defining the model architecture.Configuration objects inherit from
PretrainedConfig
and can be used to control the model outputs. Read the documentation fromPretrainedConfig
for more information.- Parameters
vocab_size (
int
, optional, defaults to 96103) – Vocabulary size of the Pegasus model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingPegasusForConditionalGeneration
.d_model (
int
, optional, defaults to 1024) – Dimensionality of the layers and the pooler layer.encoder_layers (
int
, optional, defaults to 16) – Number of encoder layers.decoder_layers (
int
, optional, defaults to 16) – Number of decoder layers.encoder_attention_heads (
int
, optional, defaults to 16) – Number of attention heads for each attention layer in the Transformer encoder.decoder_attention_heads (
int
, optional, defaults to 16) – Number of attention heads for each attention layer in the Transformer decoder.decoder_ffn_dim (
int
, optional, defaults to 4096) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.encoder_ffn_dim (
int
, optional, defaults to 4096) – Dimensionality of the “intermediate” (i.e., feed-forward) layer in decoder.activation_function (
str
orfunction
, optional, defaults to"gelu"
) – The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"silu"
and"gelu_new"
are supported.dropout (
float
, optional, defaults to 0.1) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_dropout (
float
, optional, defaults to 0.0) – The dropout ratio for the attention probabilities.activation_dropout (
float
, optional, defaults to 0.0) – The dropout ratio for activations inside the fully connected layer.classifier_dropout (
float
, optional, defaults to 0.0) – The dropout ratio for classifier.max_position_embeddings (
int
, optional, defaults to 1024) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).init_std (
float
, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.add_bias_logits (
bool
, optional, defaults toFalse
) – This should be completed, specific to marian.normalize_before (
bool
, optional, defaults toTrue
) – Call layernorm before attention ops.normalize_embedding (
bool
, optional, defaults toFalse
) – Call layernorm after embeddings.static_position_embeddings (
bool
, optional, defaults toTrue
) – Don’t learn positional embeddings, use sinusoidal.add_final_layer_norm (
bool
, optional, defaults toTrue
) – Why not add another layernorm?scale_embedding (
bool
, optional, defaults toTrue
) – Scale embeddings by diving by sqrt(d_model).eos_token_id (
int
, optional, defaults to 2) – End of stream token id.pad_token_id (
int
, optional, defaults to 1) – Padding token id.bos_token_id (
int
, optional, defaults to 0) – Beginning of stream token id.encoder_layerdrop – (
float
, optional, defaults to 0.0): The LayerDrop probability for the encoder. See the LayerDrop paper for more details.decoder_layerdrop – (
float
, optional, defaults to 0.0): The LayerDrop probability for the decoder. See the LayerDrop paper for more details.extra_pos_embeddings – (
int
, optional, defaults to 2): How many extra learned positional embeddings to use. Should be pad_token_id+1 for bart.is_encoder_decoder (
bool
, optional, defaults toTrue
) – Whether this is an encoder/decoder modelforce_bos_token_to_be_generated (
bool
, optional, defaults toFalse
) – Whether or not to force BOS token to be generated at step 1 (afterdecoder_start_token_id
).
PegasusTokenizer¶
warning: add_tokens
does not work at the moment.
-
class
transformers.
PegasusTokenizer
(*args, pad_token='<pad>', **kwargs)[source]¶ Construct a Pegasus tokenizer.
PegasusTokenizer
is identical toReformerTokenizer
and adds a newprepare_seq2seq_batch()
Refer to superclass
ReformerTokenizer
for usage examples and documentation concerning the initialization parameters and other methods.-
__call__
(text: Union[str, List[str], List[List[str]]], text_pair: Optional[Union[str, List[str], List[List[str]]]] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Optional[Union[str, transformers.tokenization_utils_base.TensorType]] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) → transformers.tokenization_utils_base.BatchEncoding¶ Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.
- Parameters
text (
str
,List[str]
,List[List[str]]
) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences).text_pair (
str
,List[str]
,List[List[str]]
) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences).add_special_tokens (
bool
, optional, defaults toTrue
) – Whether or not to encode the sequences with the special tokens relative to their model.padding (
bool
,str
orPaddingStrategy
, optional, defaults toFalse
) –Activates and controls padding. Accepts the following values:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (
bool
,str
orTruncationStrategy
, optional, defaults toFalse
) –Activates and controls truncation. Accepts the following values:
True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (
int
, optional) –Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.stride (
int
, optional, defaults to 0) – If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.is_split_into_words (
bool
, optional, defaults toFalse
) – Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer will skip the pre-tokenization step. This is useful for NER or token classification.pad_to_multiple_of (
int
, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).return_tensors (
str
orTensorType
, optional) –If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
return_token_type_ids (
bool
, optional) –Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the
return_outputs
attribute.return_attention_mask (
bool
, optional) –Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the
return_outputs
attribute.return_overflowing_tokens (
bool
, optional, defaults toFalse
) – Whether or not to return overflowing token sequences.return_special_tokens_mask (
bool
, optional, defaults toFalse
) – Whether or not to return special tokens mask information.return_offsets_mapping (
bool
, optional, defaults toFalse
) –Whether or not to return
(char_start, char_end)
for each token.This is only available on fast tokenizers inheriting from
PreTrainedTokenizerFast
, if using Python’s tokenizer, this method will raiseNotImplementedError
.return_length (
bool
, optional, defaults toFalse
) – Whether or not to return the lengths of the encoded inputs.verbose (
bool
, optional, defaults toTrue
) – Whether or not to print more information and warnings.**kwargs – passed to the
self.tokenize()
method
- Returns
A
BatchEncoding
with the following fields:input_ids – List of token ids to be fed to a model.
token_type_ids – List of token type ids to be fed to a model (when
return_token_type_ids=True
or if “token_type_ids” is inself.model_input_names
).attention_mask – List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True
or if “attention_mask” is inself.model_input_names
).overflowing_tokens – List of overflowing tokens sequences (when a
max_length
is specified andreturn_overflowing_tokens=True
).num_truncated_tokens – Number of tokens truncated (when a
max_length
is specified andreturn_overflowing_tokens=True
).special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when
add_special_tokens=True
andreturn_special_tokens_mask=True
).length – The length of the inputs (when
return_length=True
)
- Return type
-
prepare_seq2seq_batch
(src_texts: List[str], tgt_texts: Optional[List[str]] = None, max_length: Optional[int] = None, max_target_length: Optional[int] = None, return_tensors: str = None, truncation=True, padding='longest', **unused) → transformers.tokenization_utils_base.BatchEncoding[source]¶ Prepare model inputs for translation. For best performance, translate one sentence at a time.
- Parameters
src_texts (
List[str]
) – List of documents to summarize or source language texts.tgt_texts (
list
, optional) – List of summaries or target language texts.max_length (
int
, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set toNone
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.max_target_length (
int
, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set toNone
, this will use the max_length value.padding (
bool
,str
orPaddingStrategy
, optional, defaults toFalse
) –Activates and controls padding. Accepts the following values:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
return_tensors (
str
orTensorType
, optional) –If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
truncation (
bool
,str
orTruncationStrategy
, optional, defaults toTrue
) –Activates and controls truncation. Accepts the following values:
True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
**kwargs – Additional keyword arguments passed along to
self.__call__
.
- Returns
A
BatchEncoding
with the following fields:input_ids – List of token ids to be fed to the encoder.
attention_mask – List of indices specifying which tokens should be attended to by the model.
labels – List of token ids for tgt_texts.
The full set of keys
[input_ids, attention_mask, labels]
, will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.- Return type
-
PegasusForConditionalGeneration¶
-
class
transformers.
PegasusForConditionalGeneration
(config: transformers.models.bart.configuration_bart.BartConfig)[source]¶ The Pegasus Model for summarization
This model inherits from
PreTrainedModel
. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
- Parameters
config (
BartConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
Pytorch version of google’s pegasus model for summarization. Available models are listed here.
This class overrides
BartForConditionalGeneration
. Please check the superclass for the appropriate documentation alongside usage examples.Examples:
>>> from transformers import PegasusTokenizer, PegasusForConditionalGeneration >>> from typing import List >>> PGE_ARTICLE = "PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow." >>> mname = "google/pegasus-xsum" >>> model = PegasusForConditionalGeneration.from_pretrained(mname) >>> tok = PegasusTokenizer.from_pretrained(mname) >>> batch = tok.prepare_seq2seq_batch(src_texts=[PGE_ARTICLE], return_tensors="pt") # don't need tgt_text for inference >>> gen = model.generate(**batch) # for forward pass: model(**batch) >>> summary: List[str] = tok.batch_decode(gen, skip_special_tokens=True) >>> assert summary == "California's largest electricity provider has turned off power to tens of thousands of customers."
TFPegasusForConditionalGeneration¶
-
class
transformers.
TFPegasusForConditionalGeneration
(*args, **kwargs)[source]¶ Pegasus model for summarization
This model inherits from
TFBartForConditionalGeneration
. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
Note
TF 2.0 models accepts two formats as inputs:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using
tf.keras.Model.fit()
method which currently requires having all the tensors in the first argument of the model call function:model(inputs)
.If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
a single Tensor with
input_ids
only and nothing else:model(inputs_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])
ormodel([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
- Parameters
config (
PegasusConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.