Pegasus¶

DISCLAIMER: If you see something strange, file a Github Issue and assign @sshleifer.

Overview¶

The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019. According to the abstract,

Pegasus’ pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
Pegasus achieves SOTA summarization performance on all 12 downstream tasks, as measured by ROUGE and human eval.

The Authors’ code can be found here.

Checkpoints¶

All the checkpoints are finetuned for summarization, besides pegasus-large, whence the other checkpoints are finetuned. - Each checkpoint is 2.2 GB on disk and 568M parameters. - FP16 is not supported (help/ideas on this appreciated!). - Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU. - For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores 46.91/24.34/39.1. The gap is likely because of different alpha/length_penalty implementations in beam search.

Implementation Notes¶

All models are transformer encoder-decoders with 16 layers in each component.
The implementation is completely inherited from BartForConditionalGeneration
Some key configuration differences:
- static, sinusoidal position embeddings
- no layernorm_embedding (PegasusConfig.normalize_embedding=False)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
- num_beams=8
All pretrained pegasus checkpoints are the same besides three attributes: tokenizer.model_max_length (max input size), max_length (max num tokens to generate) and length_penalty
Code to convert checkpoints trained in the author’s repo can be found in convert_pegasus_tf_to_pytorch.py

Usage Example¶

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."

PegasusForConditionalGeneration¶

This class inherits all functionality from BartForConditionalGeneration, see that page for method signatures. Available models are listed at Model List

class transformers.PegasusForConditionalGeneration(config: transformers.configuration_bart.BartConfig)[source]¶

The Pegasus Model for summarization

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matters related to general usage and behavior.

Parameters: config (BartConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

authorized_missing_keys = ['final_logits_bias', 'encoder\\.version', 'decoder\\.version', 'model.encoder.embed_positions', 'model.decoder.embed_positions']¶

Pytorch version of google’s pegasus model for summarization. Model API is identical to BartForConditionalGeneration. Available models are listed at Model List

Examples:

>>> from transformers import PegasusTokenizer, PegasusForConditionalGeneration
>>> from typing import List
>>> PGE_ARTICLE = "PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
>>> mname = "google/pegasus-xsum"

>>> model = PegasusForConditionalGeneration.from_pretrained(mname)
>>> tok = PegasusTokenizer.from_pretrained(mname)
>>> batch = tok.prepare_seq2seq_batch(src_texts=[PGE_ARTICLE])  # don't need tgt_text for inference
>>> gen = model.generate(**batch)  # for forward pass: model(**batch)
>>> summary: List[str] = tok.batch_decode(gen, skip_special_tokens=True)
>>> assert summary == "California's largest electricity provider has turned off power to tens of thousands of customers."

config_class¶: alias of transformers.configuration_pegasus.PegasusConfig

PegasusConfig¶

This config fully inherits from BartConfig, but pegasus uses different default values: Up to date parameter values can be seen in S3. As of Aug 10, 2020, they are:

dict(
vocab_size=96103,
max_position_embeddings=512,
d_model=1024,
encoder_ffn_dim=4096,
decoder_ffn_dim=4096,
encoder_attention_heads=16,
decoder_attention_heads=16,
encoder_layers=16,
decoder_layers=16,
dropout=0.1,
attention_dropout=0.1,
activation_dropout=0.1,
pad_token_id=0,
eos_token_id=1,
is_encoder_decoder=True,
normalize_before=True,
scale_embedding=True,
normalize_embedding=False,
add_final_layer_norm=True,
static_position_embeddings=True,
num_beams=8,
activation_function="relu",
)

PegasusTokenizer¶

warning: add_tokens does not work at the moment.

class transformers.PegasusTokenizer(*args, **kwargs)[source]¶

__call__(text: Union[str, List[str], List[List[str]]], text_pair: Optional[Union[str, List[str], List[List[str]]]] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Optional[Union[str, transformers.tokenization_utils_base.TensorType]] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) → transformers.tokenization_utils_base.BatchEncoding¶

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

Parameters

text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_pair (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
padding (bool, str or PaddingStrategy, optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or TruncationStrategy, optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer will skip the pre-tokenization step. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_tensors (str or TensorType, optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- 'tf': Return TensorFlow tf.constant objects.
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

What are token type IDs?
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

What are attention masks?
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences.
return_special_tokens_mask (bool, optional, defaults to False) – Wheter or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print informations and warnings.
**kwargs – passed to the self.tokenize() method

Returns

A BatchEncoding with the following fields:

input_ids – List of token ids to be fed to a model.

What are input IDs?
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

What are token type IDs?
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

What are attention masks?
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 0 specifying added special tokens and 1 specifying regual sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type

BatchEncoding

prepare_seq2seq_batch(src_texts: List[str], tgt_texts: Optional[List[str]] = None, max_length: Optional[int] = None, max_target_length: Optional[int] = None, return_tensors: str = 'pt', truncation=True, padding='longest', **unused) → transformers.tokenization_utils_base.BatchEncoding[source]¶

Parameters

src_texts – (list): list of documents to summarize or source language texts
tgt_texts – (list, optional): list of tgt language texts or summaries.
max_length (int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
max_target_length (int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set to None, this will use the max_length value.
padding (bool, str or PaddingStrategy, optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
return_tensors (str or TensorType, optional, defaults to “pt”) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- 'tf': Return TensorFlow tf.constant objects.
- 'pt': Return PyTorch torch.Tensor objects.
- 'np': Return Numpy np.ndarray objects.
truncation (bool, str or TruncationStrategy, optional, defaults to True) –
Activates and controls truncation. Accepts the following values:
- True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).

Returns

A BatchEncoding with the following fields:

input_ids – List of token ids to be fed to the encoder.
attention_mask – List of indices specifying which tokens should be attended to by the model.
decoder_input_ids – List of token ids to be fed to the decoder.
decoder_attention_mask – List of indices specifying which tokens should be attended to by the decoder.
This does not include causal mask, which is built by the model.

The full set of keys [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.

Return type

BatchEncoding

Prepare model inputs for summarization or translation.