XLM-ProphetNet¶
DISCLAIMER: If you see something strange, file a Github Issue and assign @patrickvonplaten
Overview¶
The XLM-ProphetNet model was proposed in ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for “ngram” language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual “wiki100” Wikipedia dump.
The abstract from the paper is the following:
In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.
The Authors’ code can be found here.
XLMProphetNetConfig¶
-
class
transformers.
XLMProphetNetConfig
(activation_dropout=0.1, activation_function='gelu', vocab_size=30522, hidden_size=1024, encoder_ffn_dim=4096, num_encoder_layers=12, num_encoder_attention_heads=16, decoder_ffn_dim=4096, num_decoder_layers=12, num_decoder_attention_heads=16, attention_dropout=0.1, dropout=0.1, max_position_embeddings=512, init_std=0.02, is_encoder_decoder=True, add_cross_attention=True, decoder_start_token_id=0, ngram=2, num_buckets=32, relative_max_distance=128, disable_ngram_loss=False, eps=0.0, use_cache=True, pad_token_id=0, bos_token_id=1, eos_token_id=2, **kwargs)[source]¶ This class overrides
ProphetNetConfig
. Please check the superclass for the appropriate documentation alongside usage examples.
XLMProphetNetTokenizer¶
-
class
transformers.
XLMProphetNetTokenizer
(vocab_file, bos_token='[SEP]', eos_token='[SEP]', sep_token='[SEP]', unk_token='[UNK]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs)[source]¶ Adapted from
RobertaTokenizer
andXLNetTokenizer
. Based on SentencePiece.This tokenizer inherits from
PreTrainedTokenizer
which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.- Parameters
vocab_file (
str
) – Path to the vocabulary file.bos_token (
str
, optional, defaults to"<s>"
) –The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
Note
When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token
.eos_token (
str
, optional, defaults to"</s>"
) –The end of sequence token.
Note
When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
.sep_token (
str
, optional, defaults to"</s>"
) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.cls_token (
str
, optional, defaults to"<s>"
) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.unk_token (
str
, optional, defaults to"<unk>"
) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.pad_token (
str
, optional, defaults to"<pad>"
) – The token used for padding, for example when batching sequences of different lengths.mask_token (
str
, optional, defaults to"<mask>"
) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.additional_special_tokens (
List[str]
, optional, defaults to["<s>NOTUSED", "</s>NOTUSED"]
) – Additional special tokens used by the tokenizer.sp_model_kwargs (
dict
, optional) –Will be passed to the
SentencePieceProcessor.__init__()
method. The Python wrapper for SentencePiece can be used, among other things, to set:enable_sampling
: Enable subword regularization.nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
-
sp_model
¶ The SentencePiece processor that is used for every conversion (string, tokens and IDs).
- Type
SentencePieceProcessor
-
build_inputs_with_special_tokens
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A XLMProphetNet sequence has the following format:
single sequence:
X [SEP]
pair of sequences:
A [SEP] B [SEP]
- Parameters
token_ids_0 (
List[int]
) – List of IDs to which the special tokens will be addedtoken_ids_1 (
List[int]
, optional) – Optional second list of IDs for sequence pairs.
- Returns
list of input IDs with the appropriate special tokens.
- Return type
List[int]
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
create_token_type_ids_from_sequences
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]¶ Create a mask from the two sequences passed to be used in a sequence-pair classification task. XLMProphetNet does not make use of token type ids, therefore a list of zeros is returned.
- Parameters
token_ids_0 (
List[int]
) – List of IDs.token_ids_1 (
List[int]
, optional) – Optional second list of IDs for sequence pairs.
- Returns
List of zeros.
- Return type
List[int]
-
get_special_tokens_mask
(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][source]¶ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
prepare_for_model
method.- Parameters
token_ids_0 (
List[int]
) – List of IDs.token_ids_1 (
List[int]
, optional) – Optional second list of IDs for sequence pairs.already_has_special_tokens (
bool
, optional, defaults toFalse
) – Whether or not the token list is already formatted with special tokens for the model.
- Returns
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- Return type
List[int]
-
get_vocab
()[source]¶ Returns the vocabulary as a dictionary of token to index.
tokenizer.get_vocab()[token]
is equivalent totokenizer.convert_tokens_to_ids(token)
whentoken
is in the vocab.- Returns
The vocabulary.
- Return type
Dict[str, int]
-
save_vocabulary
(save_directory: str, filename_prefix: Optional[str] = None) → Tuple[str][source]¶ Save only the vocabulary of the tokenizer (vocabulary + added tokens).
This method won’t save the configuration and special token mappings of the tokenizer. Use
_save_pretrained()
to save the whole state of the tokenizer.- Parameters
save_directory (
str
) – The directory in which to save the vocabulary.filename_prefix (
str
, optional) – An optional prefix to add to the named of the saved files.
- Returns
Paths to the files saved.
- Return type
Tuple(str)
-
property
vocab_size
¶ Size of the base vocabulary (without the added tokens).
- Type
int
XLMProphetNetModel¶
-
class
transformers.
XLMProphetNetModel
(config)[source]¶ This class overrides
ProphetNetModel
. Please check the superclass for the appropriate documentation alongside usage examples.Example:
>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetModel >>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> model = XLMProphetNetModel.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> input_ids = tokenizer("Studies have been shown that owning a dog is good for you", return_tensors="pt").input_ids # Batch size 1 >>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1 >>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids) >>> last_hidden_states = outputs.last_hidden_state # main stream hidden states >>> last_hidden_states_ngram = outputs.last_hidden_state_ngram # predict hidden states
XLMProphetNetEncoder¶
-
class
transformers.
XLMProphetNetEncoder
(config: transformers.models.prophetnet.configuration_prophetnet.ProphetNetConfig, word_embeddings: torch.nn.modules.sparse.Embedding = None)[source]¶ This class overrides
ProphetNetEncoder
. Please check the superclass for the appropriate documentation alongside usage examples.Example:
>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetEncoder >>> import torch >>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> model = XLMProphetNetEncoder.from_pretrained('patrickvonplaten/xprophetnet-large-uncased-standalone') >>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder." >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) >>> last_hidden_states = outputs.last_hidden_state
XLMProphetNetDecoder¶
-
class
transformers.
XLMProphetNetDecoder
(config: transformers.models.prophetnet.configuration_prophetnet.ProphetNetConfig, word_embeddings: torch.nn.modules.sparse.Embedding = None)[source]¶ This class overrides
ProphetNetDecoder
. Please check the superclass for the appropriate documentation alongside usage examples.Example:
>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetDecoder >>> import torch >>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> model = XLMProphetNetDecoder.from_pretrained('patrickvonplaten/xprophetnet-large-uncased-standalone', add_cross_attention=False) >>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder." >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) >>> last_hidden_states = outputs.last_hidden_state
XLMProphetNetForConditionalGeneration¶
-
class
transformers.
XLMProphetNetForConditionalGeneration
(config: transformers.models.prophetnet.configuration_prophetnet.ProphetNetConfig)[source]¶ This class overrides
ProphetNetForConditionalGeneration
. Please check the superclass for the appropriate documentation alongside usage examples.Example:
>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetForConditionalGeneration >>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> model = XLMProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> input_ids = tokenizer("Studies have been shown that owning a dog is good for you", return_tensors="pt").input_ids # Batch size 1 >>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1 >>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids) >>> logits_next_token = outputs.logits # logits to predict next token as usual >>> logits_ngram_next_tokens = outputs.logits_ngram # logits to predict 2nd, 3rd, ... next tokens
XLMProphetNetForCausalLM¶
-
class
transformers.
XLMProphetNetForCausalLM
(config)[source]¶ This class overrides
ProphetNetForCausalLM
. Please check the superclass for the appropriate documentation alongside usage examples.Example:
>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetForCausalLM >>> import torch >>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> model = XLMProphetNetForCausalLM.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder." >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) >>> logits = outputs.logits >>> # Model can also be used with EncoderDecoder framework >>> from transformers import EncoderDecoderModel, XLMProphetNetTokenizer, XLMRobertaTokenizer >>> import torch >>> tokenizer_enc = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large') >>> tokenizer_dec = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased') >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("xlm-roberta-large", 'microsoft/xprophetnet-large-wiki100-cased') >>> ARTICLE = ( ... "the us state department said wednesday it had received no " ... "formal word from bolivia that it was expelling the us ambassador there " ... "but said the charges made against him are `` baseless ." ... ) >>> input_ids = tokenizer_enc(ARTICLE, return_tensors="pt").input_ids >>> labels = tokenizer_dec("us rejects charges against its ambassador in bolivia", return_tensors="pt").input_ids >>> outputs = model(input_ids=input_ids, decoder_input_ids=labels[:, :-1], labels=labels[:, 1:]) >>> loss = outputs.loss