Transformers documentation

Generation

Transformers

You are viewing v4.21.3 version. A newer version v4.56.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Generation

Each framework has a generate method for auto-regressive text generation implemented in their respective GenerationMixin class:

PyTorch generate() is implemented in GenerationMixin.
TensorFlow generate() is implemented in TFGenerationMixin.
Flax/JAX generate() is implemented in FlaxGenerationMixin.

GenerationMixin

class transformers.generation_utils.GenerationMixin

< source >

( )

A class containing all functions for auto-regressive text generation, to be used as a mixin in PreTrainedModel.

The class exposes generate(), which can be used for:

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.
beam-search multinomial sampling by calling beam_sample() if num_beams>1 and do_sample=True.
diverse beam-search decoding by calling group_beam_search(), if num_beams>1 and num_beam_groups>1.
constrained beam-search decoding by calling constrained_beam_search(), if constraints!=None or force_words_ids!=None.

generate

< source >

( inputs: typing.Optional[torch.Tensor] = None max_length: typing.Optional[int] = None min_length: typing.Optional[int] = None do_sample: typing.Optional[bool] = None early_stopping: typing.Optional[bool] = None num_beams: typing.Optional[int] = None temperature: typing.Optional[float] = None top_k: typing.Optional[int] = None top_p: typing.Optional[float] = None typical_p: typing.Optional[float] = None repetition_penalty: typing.Optional[float] = None bad_words_ids: typing.Optional[typing.Iterable[int]] = None force_words_ids: typing.Union[typing.Iterable[int], typing.Iterable[typing.Iterable[int]], NoneType] = None bos_token_id: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None length_penalty: typing.Optional[float] = None no_repeat_ngram_size: typing.Optional[int] = None encoder_no_repeat_ngram_size: typing.Optional[int] = None num_return_sequences: typing.Optional[int] = None max_time: typing.Optional[float] = None max_new_tokens: typing.Optional[int] = None decoder_start_token_id: typing.Optional[int] = None use_cache: typing.Optional[bool] = None num_beam_groups: typing.Optional[int] = None diversity_penalty: typing.Optional[float] = None prefix_allowed_tokens_fn: typing.Union[typing.Callable[[int, torch.Tensor], typing.List[int]], NoneType] = None logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = [] renormalize_logits: typing.Optional[bool] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = [] constraints: typing.Optional[typing.List[transformers.generation_beam_constraints.Constraint]] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None forced_bos_token_id: typing.Optional[int] = None forced_eos_token_id: typing.Optional[int] = None remove_invalid_values: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False exponential_decay_length_penalty: typing.Union[typing.Tuple[typing.Union[int, float]], NoneType] = None **model_kwargs ) → ModelOutput or torch.LongTensor

Parameters

inputs (torch.Tensor of varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.
max_length (int, optional, defaults to model.config.max_length) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
min_length (int, optional, defaults to 10) — The minimum length of the sequence to be generated.
do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
early_stopping (bool, optional, defaults to False) — Whether to stop the beam search when at least num_beams sentences are finished per batch or not.
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.
top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
typical_p (float, optional, defaults to 1.0) — The amount of probability mass from the original distribution to be considered in typical decoding. If set to 1.0 it takes no effect. See this paper for more details.
repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length. 1.0 means that the beam score is penalized by the sequence length. 0.0 means no penalty. Set to values < 0.0 in order to encourage the model to generate longer sequences, to a value > 0.0 in order to encourage the model to produce shorter sequences.
no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.
encoder_no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size that occur in the encoder_input_ids cannot occur in the decoder_input_ids.
bad_words_ids(List[List[int]], optional) — List of token ids that are not allowed to be generated. In order to get the token ids of the words that should not appear in the generated text, use tokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids.
force_words_ids(List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated. If given a List[List[int]], this is treated as a simple list of words that must be included, the opposite to bad_words_ids. If given List[List[List[int]]], this triggers a disjunctive constraint, where one can allow different forms of each word.
num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.
max_time(float, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed.
attention_mask (torch.LongTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values are in [0, 1], 1 for tokens that are not masked, and 0 for masked tokens. If not provided, will default to a tensor the same shape as input_ids that masks the pad token. What are attention masks?
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. use_cache — (bool, optional, defaults to True): Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
num_beam_groups (int, optional, defaults to 1) — Number of groups to divide num_beams into in order to ensure diversity among different groups of beams. this paper for more details.
diversity_penalty (float, optional, defaults to 0.0) — This value is subtracted from a beam’s score if it generates a token same as any beam from other group at a particular time. Note that diversity_penalty is only effective if group beam search is enabled.
prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and a model’s config. If a logit processor is passed that is already created with the arguments or a model’s config an error is thrown. This feature is intended for advanced users. renormalize_logits — (bool, optional, defaults to False): Whether to renormalize the logits after applying all the logits processors or warpers (including the custom ones). It’s highly recommended to set this flag to True as the search algorithms suppose the score logits are normalized but some logit processors or warpers break the normalization.
stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a model’s config. If a stopping criteria is passed that is already created with the arguments or a model’s config an error is thrown. This feature is intended for advanced users.
constraints (List[Constraint], optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined by Constraint objects, in the most sensible way possible.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
forced_bos_token_id (int, optional) — The id of the token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token.
forced_eos_token_id (int, optional) — The id of the token to force as the last generated token when max_length is reached.
remove_invalid_values (bool, optional) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that using remove_invalid_values can slow down generation.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
exponential_decay_length_penalty (tuple(int, float), optional) — This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been generated. The tuple shall consist of: (start_index, decay_factor) where start_index indicates where penalty starts and decay_factor represents the factor of exponential decay

modelkwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder*.

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible ModelOutput types are:

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible ModelOutput types are:

Generates sequences of token ids for models with a language modeling head. The method supports the following generation methods for text-decoder, text-to-text, speech-to-text, and vision-to-text models:

greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False.
multinomial sampling by calling sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling beam_search() if num_beams>1 and do_sample=False.
beam-search multinomial sampling by calling beam_sample() if num_beams>1 and do_sample=True.
diverse beam-search decoding by calling group_beam_search(), if num_beams>1 and num_beam_groups>1.
constrained beam-search decoding by calling constrained_beam_search(), if constraints!=None or force_words_ids!=None.

Apart from inputs, all the arguments below will default to the value of the attribute of the same name as defined in the model’s config (config.json) which in turn defaults to the PretrainedConfig of the model.

Most of these parameters are explained in more detail in this blog post.

Examples:

Greedy Decoding:

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> prompt = "Today I believe we can finally"
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

>>> # generate up to 30 tokens
>>> outputs = model.generate(input_ids, do_sample=False, max_length=30)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n']

Multinomial Sampling:

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> prompt = "Today I believe we can finally"
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

>>> # sample up to 30 tokens
>>> torch.manual_seed(0)
>>> outputs = model.generate(input_ids, do_sample=True, max_length=30)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today I believe we can finally get rid of discrimination," said Rep. Mark Pocan (D-Wis.).\n\n"Just look at the']

Beam-search decoding:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")

>>> sentence = "Paris is one of the densest populated areas in Europe."
>>> input_ids = tokenizer(sentence, return_tensors="pt").input_ids

>>> outputs = model.generate(input_ids)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Paris ist eines der dichtesten besiedelten Gebiete Europas.']

greedy_search

< source >

( input_ids: LongTensor logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific keyword arguments will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using greedy decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
>>> model.config.pad_token_id = model.config.eos_token_id

>>> input_prompt = "It might be possible to"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(10, eos_token_id=model.config.eos_token_id),
...     ]
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])

>>> outputs = model.greedy_search(
...     input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["It might be possible to get a better understanding of the nature of the problem, but it's not"]

sample

< source >

( input_ids: LongTensor logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None logits_warper: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     TopKLogitsWarper,
...     TemperatureLogitsWarper,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
>>> model.config.pad_token_id = model.config.eos_token_id

>>> input_prompt = "Today is a beautiful day, and"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(15, eos_token_id=model.config.eos_token_id),
...     ]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
...     [
...         TopKLogitsWarper(50),
...         TemperatureLogitsWarper(0.7),
...     ]
... )

>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])

>>> torch.manual_seed(0)
>>> outputs = model.sample(
...     input_ids,
...     logits_processor=logits_processor,
...     logits_warper=logits_warper,
...     stopping_criteria=stopping_criteria,
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the']

beam_search

< source >

( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
beam_scorer (BeamScorer) — An derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     num_beams=num_beams,
...     device=model.device,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

beam_sample

< source >

( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None logits_warper: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
beam_scorer (BeamScorer) — A derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using beam search multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     TopKLogitsWarper,
...     TemperatureLogitsWarper,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids

>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     max_length=model.config.max_length,
...     num_beams=num_beams,
...     device=model.device,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id)]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
...     [
...         TopKLogitsWarper(50),
...         TemperatureLogitsWarper(0.7),
...     ]
... )

>>> outputs = model.beam_sample(
...     input_ids, beam_scorer, logits_processor=logits_processor, logits_warper=logits_warper, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

group_beam_search

< source >

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
beam_scorer (BeamScorer) — An derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

model_kwargs — Additional model specific kwargs that will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using diverse beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     HammingDiversityLogitsProcessor,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run diverse beam search using 6 beams
>>> num_beams = 6
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     max_length=model.config.max_length,
...     num_beams=num_beams,
...     device=model.device,
...     num_beam_groups=3,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         HammingDiversityLogitsProcessor(5.5, num_beams=6, num_beam_groups=3),
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.group_beam_search(
...     input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

constrained_beam_search

< source >

( input_ids: LongTensor constrained_beam_scorer: ConstrainedBeamSearchScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = None **model_kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
constrained_beam_scorer (ConstrainedBeamSearchScorer) — A derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation, while satisfying a list of positive constraints. For more information, the documentation of ConstrainedBeamSearchScorer should be read.
logits_processor (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (StoppingCriteriaList, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop.
logits_warper (LogitsProcessorList, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.
max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.
pad_token_id (int, optional) — The id of the padding token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using constrained beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     ConstrainedBeamSearchScorer,
...     PhrasalConstraint,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> constraint_str = "Sie"
>>> constraint_token_ids = tokenizer.encode(constraint_str)[:-1]  # slice to remove eos token
>>> constraints = [PhrasalConstraint(token_ids=constraint_token_ids)]


>>> # instantiate beam scorer
>>> beam_scorer = ConstrainedBeamSearchScorer(
...     batch_size=1, num_beams=num_beams, device=model.device, constraints=constraints
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.constrained_beam_search(
...     input_ids, beam_scorer, constraints=constraints, logits_processor=logits_processor, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt sind Sie?']

TFGenerationMixin

class transformers.generation_tf_utils.TFGenerationMixin

< source >

( )

A class containing all of the functions supporting generation, to be used as a mixin in TFPreTrainedModel.

generate

< source >

( input_ids = None max_length = None max_new_tokens = None min_length = None do_sample = None early_stopping = None num_beams = None temperature = None top_k = None top_p = None repetition_penalty = None bad_words_ids = None bos_token_id = None pad_token_id = None eos_token_id = None length_penalty = None no_repeat_ngram_size = None num_return_sequences = None attention_mask = None decoder_start_token_id = None use_cache = None output_scores = None output_attentions = None output_hidden_states = None return_dict_in_generate = None forced_bos_token_id = None forced_eos_token_id = None **model_kwargs ) → ModelOutput or tf.Tensor

Parameters

input_ids (tf.Tensor of shape (batch_size, sequence_length), `(batch_size, sequence_length, —
feature_dim)` or (batch_size, num_channels, height, width), optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.
max_length (int, optional, defaults to model.config.max_length) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
min_length (int, optional, defaults to 10) — The minimum length of the sequence to be generated.
do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
early_stopping (bool, optional, defaults to False) — Whether to stop the beam search when at least num_beams sentences are finished per batch or not.
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.
top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length. 1.0 means no penalty.

Set to values < 1.0 in order to encourage the model to generate shorter sequences, to a value > 1.0 in order to encourage the model to produce longer sequences.
no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.
bad_words_ids(List[int], optional) — List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.
attention_mask (tf.Tensor of dtype=tf.int32 and shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values are in [0, 1], 1 for tokens that are not masked, and 0 for masked tokens.

If not provided, will default to a tensor the same shape as input_ids that masks the pad token.

What are attention masks?
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. use_cache — (bool, optional, defaults to True): Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.
output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.
output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.
return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a ModelOutput instead of a plain tuple.
forced_bos_token_id (int, optional) — The id of the token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token.
forced_eos_token_id (int, optional) — The id of the token to force as the last generated token when max_length is reached. model_specific_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model.

Returns

ModelOutput or tf.Tensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a tf.Tensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible ModelOutput types are:

TFGreedySearchDecoderOnlyOutput,
TFSampleDecoderOnlyOutput,
TFBeamSearchDecoderOnlyOutput,
TFBeamSampleDecoderOnlyOutput

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible ModelOutput types are:

TFGreedySearchEncoderDecoderOutput,
TFSampleEncoderDecoderOutput,
TFBeamSearchEncoderDecoderOutput,
TFBeamSampleEncoderDecoderOutput

Generates sequences for models with a language modeling head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.

Adapted in part from Facebook’s XLM beam search code.

Apart from input_ids and attention_mask, all the arguments below will default to the value of the attribute of the same name inside the PretrainedConfig of the model. The default values indicated are the default values of those config.

Most of these parameters are explained in more detail in this blog post.

Examples:

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "distilgpt2"
)  # Download model and configuration from huggingface.co and cache.
outputs = model.generate(max_length=40)  # do greedy decoding
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("openai-gpt")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "openai-gpt"
)  # Download model and configuration from huggingface.co and cache.
input_context = "The dog"
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5
)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
for i in range(3):  #  3 output sequences were generated
    print(f"Generated {i}: {tokenizer.decode(outputs[i], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "distilgpt2"
)  # Download model and configuration from huggingface.co and cache.
input_context = "The dog"
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3, do_sample=True
)  # generate 3 candidates using sampling
for i in range(3):  #  3 output sequences were generated
    print(f"Generated {i}: {tokenizer.decode(outputs[i], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("ctrl")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "ctrl"
)  # Download model and configuration from huggingface.co and cache.
input_context = "Legal My neighbor is"  # "Legal" is one of the control codes for ctrl
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2
)  # generate sequences
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
    "gpt2"
)  # Download model and configuration from huggingface.co and cache.
input_context = "My cute dog"
bad_words_ids = [
    tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ["idiot", "stupid", "shut up"]
]
input_ids = tokenizer.encode(input_context, return_tensors="tf")  # encode input context
outputs = model.generate(
    input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids
)  # generate sequences without allowing bad_words to be generated

FlaxGenerationMixin

class transformers.generation_flax_utils.FlaxGenerationMixin

< source >

( )

A class containing all functions for auto-regressive text generation, to be used as a mixin in FlaxPreTrainedModel.

The class exposes generate(), which can be used for:

greedy decoding by calling _greedy_search() if num_beams=1 and do_sample=False.
multinomial sampling by calling _sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling _beam_search if num_beams>1 and do_sample=False.

generate

< source >

( input_ids: ndarray max_length: typing.Optional[int] = None max_new_tokens: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None bos_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None decoder_start_token_id: typing.Optional[int] = None do_sample: typing.Optional[bool] = None prng_key: typing.Optional[jax._src.numpy.ndarray.ndarray] = None top_k: typing.Optional[int] = None top_p: typing.Optional[float] = None temperature: typing.Optional[float] = None num_beams: typing.Optional[int] = None no_repeat_ngram_size: typing.Optional[int] = None min_length: typing.Optional[int] = None forced_bos_token_id: typing.Optional[int] = None forced_eos_token_id: typing.Optional[int] = None length_penalty: typing.Optional[float] = None early_stopping: typing.Optional[bool] = None trace: bool = True params: typing.Union[typing.Dict[str, jax._src.numpy.ndarray.ndarray], NoneType] = None **model_kwargs )

Parameters

input_ids (jnp.ndarray of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.
max_length (int, optional, defaults to model.config.max_length) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. In general, prefer the use of max_new_tokens, which ignores the number of tokens in the prompt.
max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
temperature (float, optional, defaults to 1.0) — The value used to module the next token probabilities.
top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p (float, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional) — The id of the beginning-of-sequence token.
eos_token_id (int, optional) — The id of the end-of-sequence token.
num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.
trace (bool, optional, defaults to True) — Whether to trace generation. Setting trace=False should only be used for debugging and will lead to a considerably slower runtime.
params (Dict[str, jnp.ndarray], optional) — Optionally the model parameters can be passed. Can be useful for parallelized generation. modelkwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder*. Also accepts encoder_outputs to skip encoder part.

greedy decoding by calling _greedy_search() if num_beams=1 and do_sample=False.
multinomial sampling by calling _sample() if num_beams=1 and do_sample=True.
beam-search decoding by calling _beam_search if num_beams>1 and do_sample=False.

Most of these parameters are explained in more detail in this blog post.

Examples:

>>> from transformers import AutoTokenizer, FlaxAutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
>>> model = FlaxAutoModelForCausalLM.from_pretrained("distilgpt2")
>>> input_context = "The dog"
>>> # encode input context
>>> input_ids = tokenizer(input_context, return_tensors="np").input_ids
>>> # generate candidates using sampling
>>> outputs = model.generate(input_ids=input_ids, max_length=20, top_k=30, do_sample=True)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)

←Models ONNX→