Transformers documentation

Generation

Transformers

You are viewing v4.25.1 version. A newer version v4.46.3 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Generation

Each framework has a generate method for auto-regressive text generation implemented in their respective GenerationMixin class:

PyTorch generate() is implemented in GenerationMixin.
TensorFlow generate() is implemented in TFGenerationMixin.
Flax/JAX generate() is implemented in FlaxGenerationMixin.

GenerationConfig

class transformers.GenerationConfig

< source >

( **kwargs )

Parameters that control the length of the output

max_length (int, optional, defaults to 20) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
min_length (int, optional, defaults to 0) — The minimum length of the sequence to be generated.
early_stopping (bool, optional, defaults to False) — Whether to stop the beam search when at least num_beams sentences are finished per batch or not.
max_time(float, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed.

Parameters that control the generation strategy used

do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
num_beam_groups (int, optional, defaults to 1) — Number of groups to divide num_beams into in order to ensure diversity among different groups of beams. this paper for more details.
penalty_alpha (float, optional) — The values balance the model confidence and the degeneration penalty in contrastive search decoding.

Parameters for manipulation of the model output logits

temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.
top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to 1.0) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
typical_p (float, optional, defaults to 1.0) — The amount of probability mass from the original distribution to be considered in typical decoding. If set to 1.0 it takes no effect. See this paper for more details.
diversity_penalty (float, optional, defaults to 0.0) — This value is subtracted from a beam’s score if it generates a token same as any beam from other group at a particular time. Note that diversity_penalty is only effective if group beam search is enabled.
repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.
no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.
bad_words_ids(List[List[int]], optional) — List of token ids that are not allowed to be generated. In order to get the token ids of the words that should not appear in the generated text, use tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids.
force_words_ids(List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated. If given a List[List[int]], this is treated as a simple list of words that must be included, the opposite to bad_words_ids. If given List[List[List[int]]], this triggers a disjunctive constraint, where one can allow different forms of each word.
use_cache (bool, optional, defaults to True) — Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
renormalize_logits (bool, optional, defaults to False) — Whether to renormalize the logits after applying all the logits processors or warpers (including the custom ones). It’s highly recommended to set this flag to True as the search algorithms suppose the score logits are normalized but some logit processors or warpers break the normalization.
forced_bos_token_id (int, optional, defaults to model.config.forced_bos_token_id) — The id of the token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token.
forced_eos_token_id (int, optional, defaults to model.config.forced_eos_token_id) — The id of the token to force as the last generated token when max_length is reached.
remove_invalid_values (bool, optional, defaults to model.config.remove_invalid_values) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that using remove_invalid_values can slow down generation.
exponential_decay_length_penalty (tuple(int, float), optional) — This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been generated. The tuple shall consist of: (start_index, decay_factor) where start_index indicates where penalty starts and decay_factor represents the factor of exponential decay
suppress_tokens (List[int], optional) — A list of tokens that will be supressed at generation. The SupressTokens logit processor will set their log probs to -inf so that they are not sampled.
begin_suppress_tokens (List[int], optional) — A list of tokens that will be supressed at the begining of the generation. The SupressBeginTokens logit processor will set their log probs to -inf so that they are not sampled.
forced_decoder_ids (List[List[int]], optional) — A list of pairs of integers which indicates a mapping from generation indices to token indices that will be forced before sampling. For example, [[1, 123]] means the second generated token will always be a token of index 123.

Parameters that define the output variables of `generate`

num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.

Special tokens that can be used at generation time

pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.

Generation parameters exclusive to encoder-decoder models

encoder_no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size that occur in the encoder_input_ids cannot occur in the decoder_input_ids.
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.

Wild card

Class that holds a configuration for a generation task.

A generation configuration file can be loaded and saved to disk. Loading and using a generation configuration file does not change a model configuration or weights. It only affects the model’s behavior at generation time.

from_pretrained

< source >

( pretrained_model_name: typing.Union[str, os.PathLike] config_file_name: typing.Union[str, os.PathLike, NoneType] = None **kwargs ) → GenerationConfig

Parameters

pretrained_model_name (str or os.PathLike) — This can be either:
- a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
- a path to a directory containing a configuration file saved using the save_pretrained() method, e.g., ./my_model_directory/.
config_file_name (str or os.PathLike, optional, defaults to "generation_config.json") — Name of the generation configuration JSON file to be loaded from pretrained_model_name.
cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.
resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
use_auth_token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).
revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

To test a pull request you made on the Hub, you can pass `revision=“refs/pr/“.
return_unused_kwargs (bool, optional, defaults to False) — If False, then this function returns just the final configuration object.

If True, then this functions returns a Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored.
subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
kwargs (Dict[str, Any], optional) — The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter.

Returns

GenerationConfig

The configuration object instantiated from this pretrained model.

Instantiate a GenerationConfig from a generation configuration file.

Examples:

>>> from transformers import GenerationConfig

>>> # Download configuration from huggingface.co and cache.
>>> generation_config = GenerationConfig.from_pretrained("gpt2")

>>> # E.g. config was saved using *save_pretrained('./test/saved_model/')*
>>> generation_config.save_pretrained("./test/saved_model/")
>>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/")

>>> # You can also specify configuration names to your generation configuration file
>>> generation_config.save_pretrained("./test/saved_model/", config_file_name="my_configuration.json")
>>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/", "my_configuration.json")

>>> # If you'd like to try a minor variation to an existing configuration, you can also pass generation
>>> # arguments to `.from_pretrained()`. Be mindful that typos and unused arguments will be ignored
>>> generation_config, unused_kwargs = GenerationConfig.from_pretrained(
...     "gpt2", top_k=1, foo=False, return_unused_kwargs=True
... )
>>> generation_config.top_k
1

>>> unused_kwargs
{'foo': False}

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] config_file_name: typing.Union[str, os.PathLike, NoneType] = None push_to_hub: bool = False **kwargs )

Parameters

save_directory (str or os.PathLike) — Directory where the configuration JSON file will be saved (will be created if it does not exist).
config_file_name (str or os.PathLike, optional, defaults to "generation_config.json") — Name of the generation configuration JSON file to be saved in save_directory.
push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace). kwargs — Additional key word arguments passed along to the push_to_hub() method.

Save a generation configuration object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.

GenerationMixin

class transformers.GenerationMixin

< source >

( )

A class containing all functions for auto-regressive text generation, to be used as a mixin in PreTrainedModel.

The class exposes generate(), which can be used for:

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
contrastive search by calling contrastive_search() if penalty_alpha>0 and top_k>1
multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.
beam-search multinomial sampling by calling beam_sample() if num_beams>1 and do_sample=True.
diverse beam-search decoding by calling group_beam_search(), if num_beams>1 and num_beam_groups>1.
constrained beam-search decoding by calling constrained_beam_search(), if constraints!=None or force_words_ids!=None.

generate

< source >

( inputs: typing.Optional[torch.Tensor] = None max_length: typing.Optional[int] = None min_length: typing.Optional[int] = None do_sample: typing.Optional[bool] = None early_stopping: typing.Optional[bool] = None num_beams: typing.Optional[int] = None temperature: typing.Optional[float] = None penalty_alpha: typing.Optional[float] = None top_k: typing.Optional[int] = None top_p: typing.Optional[float] = None typical_p: typing.Optional[float] = None repetition_penalty: typing.Optional[float] = None bad_words_ids: typing.Optional[typing.Iterable[int]] = None force_words_ids: typing.Union[typing.Iterable[int], typing.Iterable[typing.Iterable[int]], NoneType] = None bos_token_id: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None length_penalty: typing.Optional[float] = None no_repeat_ngram_size: typing.Optional[int] = None encoder_no_repeat_ngram_size: typing.Optional[int] = None num_return_sequences: typing.Optional[int] = None max_time: typing.Optional[float] = None max_new_tokens: typing.Optional[int] = None decoder_start_token_id: typing.Optional[int] = None use_cache: typing.Optional[bool] = None num_beam_groups: typing.Optional[int] = None diversity_penalty: typing.Optional[float] = None prefix_allowed_tokens_fn: typing.Union[typing.Callable[[int, torch.Tensor], typing.List[int]], NoneType] = None logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None renormalize_logits: typing.Optional[bool] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None constraints: typing.Optional[typing.List[transformers.generation.beam_constraints.Constraint]] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None forced_bos_token_id: typing.Optional[int] = None forced_eos_token_id: typing.Optional[int] = None remove_invalid_values: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False exponential_decay_length_penalty: typing.Union[typing.Tuple[int, float], NoneType] = None suppress_tokens: typing.Optional[typing.List[int]] = None begin_suppress_tokens: typing.Optional[typing.List[int]] = None forced_decoder_ids: typing.Optional[typing.List[typing.List[int]]] = None **model_kwargs ) → ModelOutput or torch.LongTensor

Parameters

inputs (torch.Tensor of varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.
max_length (int, optional, defaults to model.config.max_length) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
min_length (int, optional, defaults to model.config.min_length or 10 if the config does not set any value) — The minimum length of the sequence to be generated.
do_sample (bool, optional, defaults to model.config.do_sample or False if the config does not set any value) — Whether or not to use sampling ; use greedy decoding otherwise.
early_stopping (bool, optional, defaults to False) — Whether to stop the beam search when at least num_beams sentences are finished per batch or not.
num_beams (int, optional, defaults to model.config.num_beams or 1 if the config does not set any value) — Number of beams for beam search. 1 means no beam search.
temperature (float, optional, defaults to model.config.temperature or 1.0 if the config does not set any value) — The value used to module the next token probabilities.
penalty_alpha (float, optional, defaults to model.config.penalty_alpha or None if the config does not set any value) — The values balance the model confidence and the degeneration penalty in contrastive search decoding.
top_k (int, optional, defaults to model.config.top_k or 50 if the config does not set any value) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to model.config.top_p or 1.0 if the config does not set any value) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
typical_p (float, optional, defaults to model.config.typical_p or 1.0 if the config does not set any value) — The amount of probability mass from the original distribution to be considered in typical decoding. If set to 1.0 it takes no effect. See this paper for more details.
repetition_penalty (float, optional, defaults to model.config.repetition_penalty or 1.0 if the config does not set any value) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
pad_token_id (int, optional, defaults to model.config.pad_token_id) — The id of the padding token.
bos_token_id (int, optional, defaults to model.config.bos_token_id) — The id of the beginning-of-sequence token.
eos_token_id (int, optional, defaults to model.config.eos_token_id) — The id of the end-of-sequence token.
length_penalty (float, optional, defaults to model.config.length_penalty or 1.0 if the config does not set any value) — Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.
no_repeat_ngram_size (int, optional, defaults to model.config.no_repeat_ngram_size or 0 if the config does not set any value) — If set to int > 0, all ngrams of that size can only occur once.
encoder_no_repeat_ngram_size (int, optional, defaults to model.config.encoder_no_repeat_ngram_size or 0 if the config does not set any value) — If set to int > 0, all ngrams of that size that occur in the encoder_input_ids cannot occur in the decoder_input_ids.
bad_words_ids(List[List[int]], optional, defaults to model.config.bad_words_ids) — List of token ids that are not allowed to be generated. In order to get the token ids of the words that should not appear in the generated text, use tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids.
force_words_ids(List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated. If given a List[List[int]], this is treated as a simple list of words that must be included, the opposite to bad_words_ids. If given List[List[List[int]]], this triggers a disjunctive constraint, where one can allow different forms of each word.
num_return_sequences(int, optional, defaults to model.config.num_return_sequences or 1 if the config does not set any value) — The number of independently computed returned sequences for each element in the batch.
max_time(float, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed.
attention_mask (torch.LongTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values are in [0, 1], 1 for tokens that are not masked, and 0 for masked tokens. If not provided, will default to a tensor the same shape as input_ids that masks the pad token. What are attention masks?
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.
use_cache (bool, optional, defaults to True) — Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
num_beam_groups (int, optional, defaults to model.config.num_beam_groups or 1 if the config does not set any value) — Number of groups to divide num_beams into in order to ensure diversity among different groups of beams. this paper for more details.
diversity_penalty (float, optional, defaults to model.config.diversity_penalty or 0.0 if the config does not set any value) — This value is subtracted from a beam’s score if it generates a token same as any beam from other group at a particular time. Note that diversity_penalty is only effective if group beam search is enabled.
prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and a model’s config. If a logit processor is passed that is already created with the arguments or a model’s config an error is thrown. This feature is intended for advanced users.
renormalize_logits (bool, optional, defaults to False) — Whether to renormalize the logits after applying all the logits processors or warpers (including the custom ones). It’s highly recommended to set this flag to True as the search algorithms suppose the score logits are normalized but some logit processors or warpers break the normalization.
stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a model’s config. If a stopping criteria is passed that is already created with the arguments or a model’s config an error is thrown. This feature is intended for advanced users.
constraints (List[Constraint], optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined by Constraint objects, in the most sensible way possible.
output_attentions (bool, optional, defaults to model.config.output_attentions or False if the config does not set any value) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to model.config.output_hidden_states or False if the config does not set any value) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to model.config.output_scores or False if the config does not set any value) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to model.config.return_dict_in_generate or False if the config does not set any value) — Whether or not to return a ModelOutput instead of a plain tuple.
forced_bos_token_id (int, optional, defaults to model.config.forced_bos_token_id) — The id of the token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token.
forced_eos_token_id (int, optional, defaults to model.config.forced_eos_token_id) — The id of the token to force as the last generated token when max_length is reached.
remove_invalid_values (bool, optional, defaults to model.config.remove_invalid_values) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that using remove_invalid_values can slow down generation.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
exponential_decay_length_penalty (tuple(int, float), optional, defaults to model.config.exponential_decay_length_penalty) — This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been generated. The tuple shall consist of: (start_index, decay_factor) where start_index indicates where penalty starts and decay_factor represents the factor of exponential decay
suppress_tokens (List[int], optional, defaults to model.config.suppress_tokens) — A list of tokens that will be supressed at generation. The SupressTokens logit processor will set their log probs to -inf so that they are not sampled.
begin_suppress_tokens (List[int], optional, defaults to model.config.begin_suppress_tokens) — A list of tokens that will be supressed at the begining of the generation. The SupressBeginTokens logit processor will set their log probs to -inf so that they are not sampled.
forced_decoder_ids (List[List[int]], optional, defaults to model.config.forced_decoder_ids) — A list of pairs of integers which indicates a mapping from generation indices to token indices that will be forced before sampling. For example, [[1, 123]] means the second generated token will always be a token of index 123. modelkwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder*.

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible ModelOutput types are:

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible ModelOutput types are:

Generates sequences of token ids for models with a language modeling head. The method supports the following generation methods for text-decoder, text-to-text, speech-to-text, and vision-to-text models:

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
contrastive search by calling contrastive_search() if penalty_alpha>0. and top_k>1
multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.
beam-search multinomial sampling by calling beam_sample() if num_beams>1 and do_sample=True.
diverse beam-search decoding by calling group_beam_search(), if num_beams>1 and num_beam_groups>1.
constrained beam-search decoding by calling constrained_beam_search(), if constraints!=None or force_words_ids!=None.

Apart from inputs, all the arguments below will default to the value of the attribute of the same name as defined in the model’s config (config.json) which in turn defaults to the PretrainedConfig of the model.

Most of these parameters are explained in more detail in this blog post.

Examples:

Greedy Decoding:

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> prompt = "Today I believe we can finally"
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

>>> # generate up to 30 tokens
>>> outputs = model.generate(input_ids, do_sample=False, max_length=30)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n']

Multinomial Sampling:

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> prompt = "Today I believe we can finally"
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

>>> # sample up to 30 tokens
>>> torch.manual_seed(0)
>>> outputs = model.generate(input_ids, do_sample=True, max_length=30)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today I believe we can finally get rid of discrimination," said Rep. Mark Pocan (D-Wis.).\n\n"Just look at the']

Beam-search decoding:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")

>>> sentence = "Paris is one of the densest populated areas in Europe."
>>> input_ids = tokenizer(sentence, return_tensors="pt").input_ids

>>> outputs = model.generate(input_ids, num_beams=5)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Paris ist eines der dichtesten besiedelten Gebiete Europas.']

greedy_search

< source >

( input_ids: LongTensor logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific keyword arguments will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using greedy decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> # set pad_token_id to eos_token_id because GPT2 does not have a PAD token
>>> model.config.pad_token_id = model.config.eos_token_id

>>> input_prompt = "It might be possible to"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(10, eos_token_id=model.config.eos_token_id),
...     ]
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])

>>> outputs = model.greedy_search(
...     input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["It might be possible to get a better understanding of the nature of the problem, but it's not"]

sample

< source >

( input_ids: LongTensor logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None logits_warper: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs ) → SampleDecoderOnlyOutput, SampleEncoderDecoderOutput or torch.LongTensor

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Returns

SampleDecoderOnlyOutput, SampleEncoderDecoderOutput or torch.LongTensor

A torch.LongTensor containing the generated tokens (default behaviour) or a SampleDecoderOnlyOutput if model.config.is_encoder_decoder=False and return_dict_in_generate=True or a SampleEncoderDecoderOutput if model.config.is_encoder_decoder=True.

Generates sequences of token ids for models with a language modeling head using multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     TopKLogitsWarper,
...     TemperatureLogitsWarper,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
>>> model.config.pad_token_id = model.config.eos_token_id

>>> input_prompt = "Today is a beautiful day, and"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(15, eos_token_id=model.config.eos_token_id),
...     ]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
...     [
...         TopKLogitsWarper(50),
...         TemperatureLogitsWarper(0.7),
...     ]
... )

>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])

>>> torch.manual_seed(0)
>>> outputs = model.sample(
...     input_ids,
...     logits_processor=logits_processor,
...     logits_warper=logits_warper,
...     stopping_criteria=stopping_criteria,
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the']

beam_search

< source >

( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
beam_scorer (BeamScorer) — An derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     num_beams=num_beams,
...     device=model.device,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

beam_sample

< source >

( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None logits_warper: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
beam_scorer (BeamScorer) — A derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using beam search multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     TopKLogitsWarper,
...     TemperatureLogitsWarper,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids

>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     max_length=model.config.max_length,
...     num_beams=num_beams,
...     device=model.device,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id)]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
...     [
...         TopKLogitsWarper(50),
...         TemperatureLogitsWarper(0.7),
...     ]
... )

>>> outputs = model.beam_sample(
...     input_ids, beam_scorer, logits_processor=logits_processor, logits_warper=logits_warper, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

contrastive_search

< source >

( input_ids: LongTensor top_k: typing.Optional[int] = 1 penalty_alpha: typing.Optional[float] = 0 logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None logits_warper: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
top_k (int, optional, defaults to 1) — The size of the candidate set that is used to re-rank for contrastive search
penalty_alpha (float, optional, defaults to 0) — The degeneration penalty for contrastive search; activate when it is larger than 0
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific keyword arguments will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using contrastive search and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
>>> model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
>>> # set pad_token_id to eos_token_id because OPT does not have a PAD token
>>> model.config.pad_token_id = model.config.eos_token_id
>>> input_prompt = "DeepMind Company is"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt")
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=64)])
>>> outputs = model.contrastive_search(
...     **input_ids, penalty_alpha=0.6, top_k=4, stopping_criteria=stopping_criteria
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['DeepMind Company is a company that focuses on the development and commercialization of artificial intelligence (AI). DeepMind’s mission is to help people understand and solve problems that are difficult to solve in the world today.\n\nIn this post, we talk about the benefits of deep learning in business and how it']

group_beam_search

< source >

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
beam_scorer (BeamScorer) — An derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

model_kwargs — Additional model specific kwargs that will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using diverse beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     HammingDiversityLogitsProcessor,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run diverse beam search using 6 beams
>>> num_beams = 6
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     max_length=model.config.max_length,
...     num_beams=num_beams,
...     device=model.device,
...     num_beam_groups=3,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         HammingDiversityLogitsProcessor(5.5, num_beams=6, num_beam_groups=3),
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.group_beam_search(
...     input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

constrained_beam_search

< source >

( input_ids: LongTensor constrained_beam_scorer: ConstrainedBeamSearchScorer logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = None **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
constrained_beam_scorer (ConstrainedBeamSearchScorer) — A derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation, while satisfying a list of positive constraints. For more information, the documentation of ConstrainedBeamSearchScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using constrained beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     ConstrainedBeamSearchScorer,
...     PhrasalConstraint,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> constraint_str = "Sie"
>>> constraint_token_ids = tokenizer.encode(constraint_str)[:-1]  # slice to remove eos token
>>> constraints = [PhrasalConstraint(token_ids=constraint_token_ids)]


>>> # instantiate beam scorer
>>> beam_scorer = ConstrainedBeamSearchScorer(
...     batch_size=1, num_beams=num_beams, device=model.device, constraints=constraints
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.constrained_beam_search(
...     input_ids, beam_scorer, constraints=constraints, logits_processor=logits_processor, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt sind Sie?']

TFGenerationMixin

class transformers.TFGenerationMixin

< source >

( )

A class containing all of the functions supporting generation, to be used as a mixin in TFPreTrainedModel.

The class exposes generate(), which can be used for:

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
contrastive search by calling contrastive_search() if penalty_alpha>0 and top_k>1
multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.

generate

< source >

( input_ids = None max_length = None max_new_tokens = None min_length = None do_sample = None early_stopping = None num_beams = None temperature = None penalty_alpha = None top_k = None top_p = None repetition_penalty = None bad_words_ids = None bos_token_id = None pad_token_id = None eos_token_id = None length_penalty = None no_repeat_ngram_size = None num_return_sequences = None attention_mask = None decoder_start_token_id = None use_cache = None output_scores = None output_attentions = None output_hidden_states = None return_dict_in_generate = None forced_bos_token_id = None forced_eos_token_id = None suppress_tokens: typing.Optional[typing.List[int]] = None begin_suppress_tokens: typing.Optional[typing.List[int]] = None forced_decoder_ids: typing.Optional[typing.List[typing.List[int]]] = None **model_kwargs ) → ModelOutput or tf.Tensor

Parameters

input_ids (tf.Tensor of shape (batch_size, sequence_length), `(batch_size, sequence_length, —
feature_dim)` or (batch_size, num_channels, height, width), optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.
max_length (int, optional, defaults to model.config.max_length) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
min_length (int, optional, defaults to 10) — The minimum length of the sequence to be generated.
do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
early_stopping (bool, optional, defaults to False) — Whether to stop the beam search when at least num_beams sentences are finished per batch or not.
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.
penalty_alpha (float, optional) — The values balance the model confidence and the degeneration penalty in contrastive search decoding.
top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.
no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.
bad_words_ids(List[int], optional) — List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.
attention_mask (tf.Tensor of dtype=tf.int32 and shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values are in [0, 1], 1 for tokens that are not masked, and 0 for masked tokens.

If not provided, will default to a tensor the same shape as input_ids that masks the pad token.

What are attention masks?
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.
use_cache (bool, optional, defaults to True) — Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
forced_bos_token_id (int, optional) — The id of the token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token.
forced_eos_token_id (int, optional) — The id of the token to force as the last generated token when max_length is reached.
suppress_tokens (List[int], optional, defaults to model.config.suppress_tokens) — A list of tokens that will be supressed at generation. The SupressTokens logit processor will set their log probs to -inf so that they are not sampled.
begin_suppress_tokens (List[int], optional, defaults to model.config.begin_suppress_tokens) — A list of tokens that will be supressed at the begining of the generation. The SupressBeginTokens logit processor will set their log probs to -inf so that they are not sampled.
forced_decoder_ids (List[List[int]], optional, defaults to model.config.forced_decoder_ids) — A list of pairs of integers which indicates a mapping from generation indices to token indices that will be forced before sampling. For example, [[1, 123]] means the second generated token will always be a token of index 123. model_specific_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model.

Returns

ModelOutput or tf.Tensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a tf.Tensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible ModelOutput types are:

TFGreedySearchDecoderOnlyOutput,
TFSampleDecoderOnlyOutput,
TFBeamSearchDecoderOnlyOutput,
TFBeamSampleDecoderOnlyOutput

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible ModelOutput types are:

TFGreedySearchEncoderDecoderOutput,
TFSampleEncoderDecoderOutput,
TFBeamSearchEncoderDecoderOutput,
TFBeamSampleEncoderDecoderOutput

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
contrastive search by calling contrastive_search() if penalty_alpha>0 and top_k>1
multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.

Adapted in part from Facebook’s XLM beam search code.

Apart from input_ids and attention_mask, all the arguments below will default to the value of the attribute of the same name inside the PretrainedConfig of the model. The default values indicated are the default values of those config.

Most of these parameters are explained in more detail in this blog post.

Examples:

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "distilgpt2"
)  # Download model and configuration from huggingface.co and cache.
outputs = model.generate(max_length=40)  # do greedy decoding
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("openai-gpt")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "openai-gpt"
)  # Download model and configuration from huggingface.co and cache.
input_context = "The dog"
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5
)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
for i in range(3):  #  3 output sequences were generated
    print(f"Generated {i}: {tokenizer.decode(outputs[i], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "distilgpt2"
)  # Download model and configuration from huggingface.co and cache.
input_context = "The dog"
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3, do_sample=True
)  # generate 3 candidates using sampling
for i in range(3):  #  3 output sequences were generated
    print(f"Generated {i}: {tokenizer.decode(outputs[i], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("ctrl")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "ctrl"
)  # Download model and configuration from huggingface.co and cache.
input_context = "Legal My neighbor is"  # "Legal" is one of the control codes for ctrl
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2
)  # generate sequences
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "gpt2"
)  # Download model and configuration from huggingface.co and cache.
input_context = "My cute dog"
bad_words_ids = [
    tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ["idiot", "stupid", "shut up"]
]
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids
)  # generate sequences without allowing bad_words to be generated

FlaxGenerationMixin

class transformers.FlaxGenerationMixin

< source >

( )

A class containing all functions for auto-regressive text generation, to be used as a mixin in FlaxPreTrainedModel.

The class exposes generate(), which can be used for:

greedy decoding by calling _greedy_search() if num_beams=1 and do_sample=False.
multinomial sampling by calling _sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling _beam_search() if num_beams>1 and do_sample=False.

generate

< source >

( input_ids: ndarray max_length: typing.Optional[int] = None max_new_tokens: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None bos_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None decoder_start_token_id: typing.Optional[int] = None do_sample: typing.Optional[bool] = None prng_key: typing.Optional[jax._src.numpy.ndarray.ndarray] = None top_k: typing.Optional[int] = None top_p: typing.Optional[float] = None temperature: typing.Optional[float] = None num_beams: typing.Optional[int] = None no_repeat_ngram_size: typing.Optional[int] = None min_length: typing.Optional[int] = None forced_bos_token_id: typing.Optional[int] = None forced_eos_token_id: typing.Optional[int] = None length_penalty: typing.Optional[float] = None early_stopping: typing.Optional[bool] = None trace: bool = True params: typing.Union[typing.Dict[str, jax._src.numpy.ndarray.ndarray], NoneType] = None **model_kwargs )

Parameters

input_ids (jnp.ndarray of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
max_length (int, optional, defaults to model.config.max_length) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.
top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.
trace (bool, optional, defaults to True) — Whether to trace generation. Setting trace=False should only be used for debugging and will lead to a considerably slower runtime.
params (Dict[str, jnp.ndarray], optional) — Optionally the model parameters can be passed. Can be useful for parallelized generation. modelkwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder*. Also accepts encoder_outputs to skip encoder part.

greedy decoding by calling _greedy_search() if num_beams=1 and do_sample=False.
multinomial sampling by calling _sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling _beam_search() if num_beams>1 and do_sample=False.

Most of these parameters are explained in more detail in this blog post.

Examples:

>>> from transformers import AutoTokenizer, FlaxAutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
>>> model = FlaxAutoModelForCausalLM.from_pretrained("distilgpt2")
>>> input_context = "The dog"
>>> # encode input context
>>> input_ids = tokenizer(input_context, return_tensors="np").input_ids
>>> # generate candidates using sampling
>>> outputs = model.generate(input_ids=input_ids, max_length=20, top_k=30, do_sample=True)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)

←Models ONNX→