Generation
Each framework has a generate method for auto-regressive text generation implemented in their respective GenerationMixin
class:
- PyTorch generate() is implemented in GenerationMixin.
- TensorFlow generate() is implemented in TFGenerationMixin.
- Flax/JAX generate() is implemented in FlaxGenerationMixin.
GenerationMixin
A class containing all functions for auto-regressive text generation, to be used as a mixin in PreTrainedModel.
The class exposes generate(), which can be used for:
- greedy decoding by calling greedy_search() if
num_beams=1
anddo_sample=False
. - multinomial sampling by calling sample() if
num_beams=1
anddo_sample=True
. - beam-search decoding by calling beam_search() if
num_beams>1
anddo_sample=False
. - beam-search multinomial sampling by calling beam_sample() if
num_beams>1
anddo_sample=True
. - diverse beam-search decoding by calling group_beam_search(), if
num_beams>1
andnum_beam_groups>1
. - constrained beam-search decoding by calling constrained_beam_search(),
if
constraints!=None
orforce_words_ids!=None
.
generate
< source >(
inputs: typing.Optional[torch.Tensor] = None
max_length: typing.Optional[int] = None
min_length: typing.Optional[int] = None
do_sample: typing.Optional[bool] = None
early_stopping: typing.Optional[bool] = None
num_beams: typing.Optional[int] = None
temperature: typing.Optional[float] = None
top_k: typing.Optional[int] = None
top_p: typing.Optional[float] = None
typical_p: typing.Optional[float] = None
repetition_penalty: typing.Optional[float] = None
bad_words_ids: typing.Optional[typing.Iterable[int]] = None
force_words_ids: typing.Union[typing.Iterable[int], typing.Iterable[typing.Iterable[int]], NoneType] = None
bos_token_id: typing.Optional[int] = None
pad_token_id: typing.Optional[int] = None
eos_token_id: typing.Optional[int] = None
length_penalty: typing.Optional[float] = None
no_repeat_ngram_size: typing.Optional[int] = None
encoder_no_repeat_ngram_size: typing.Optional[int] = None
num_return_sequences: typing.Optional[int] = None
max_time: typing.Optional[float] = None
max_new_tokens: typing.Optional[int] = None
decoder_start_token_id: typing.Optional[int] = None
use_cache: typing.Optional[bool] = None
num_beam_groups: typing.Optional[int] = None
diversity_penalty: typing.Optional[float] = None
prefix_allowed_tokens_fn: typing.Union[typing.Callable[[int, torch.Tensor], typing.List[int]], NoneType] = None
logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = []
renormalize_logits: typing.Optional[bool] = None
stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = []
constraints: typing.Optional[typing.List[transformers.generation_beam_constraints.Constraint]] = None
output_attentions: typing.Optional[bool] = None
output_hidden_states: typing.Optional[bool] = None
output_scores: typing.Optional[bool] = None
return_dict_in_generate: typing.Optional[bool] = None
forced_bos_token_id: typing.Optional[int] = None
forced_eos_token_id: typing.Optional[int] = None
remove_invalid_values: typing.Optional[bool] = None
synced_gpus: typing.Optional[bool] = False
exponential_decay_length_penalty: typing.Union[typing.Tuple[typing.Union[int, float]], NoneType] = None
**model_kwargs
)
→
ModelOutput or torch.LongTensor
Parameters
-
inputs (
torch.Tensor
of varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. IfNone
the method initializes it withbos_token_id
and a batch size of 1. For decoder-only modelsinputs
should of in the format ofinput_ids
. For encoder-decoder models inputs can represent any ofinput_ids
,input_values
,input_features
, orpixel_values
. -
max_length (
int
, optional, defaults tomodel.config.max_length
) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt +max_new_tokens
. In general, prefer the use ofmax_new_tokens
, which ignores the number of tokens in the prompt. -
max_new_tokens (
int
, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. -
min_length (
int
, optional, defaults tomodel.config.min_length
or 10 if the config does not set any value) — The minimum length of the sequence to be generated. -
do_sample (
bool
, optional, defaults tomodel.config.do_sample
orFalse
if the config does not set any value) — Whether or not to use sampling ; use greedy decoding otherwise. -
early_stopping (
bool
, optional, defaults toFalse
) — Whether to stop the beam search when at leastnum_beams
sentences are finished per batch or not. -
num_beams (
int
, optional, defaults tomodel.config.num_beams
or 1 if the config does not set any value) — Number of beams for beam search. 1 means no beam search. -
temperature (
float
, optional, defaults tomodel.config.temperature
or 1.0 if the config does not set any value) — The value used to module the next token probabilities. -
top_k (
int
, optional, defaults tomodel.config.top_k
or 50 if the config does not set any value) — The number of highest probability vocabulary tokens to keep for top-k-filtering. -
top_p (
float
, optional, defaults tomodel.config.top_p
or 1.0 if the config does not set any value) — If set to float < 1, only the most probable tokens with probabilities that add up totop_p
or higher are kept for generation. -
typical_p (
float
, optional, defaults tomodel.config.typical_p
or 1.0 if the config does not set any value) — The amount of probability mass from the original distribution to be considered in typical decoding. If set to 1.0 it takes no effect. See this paper for more details. -
repetition_penalty (
float
, optional, defaults tomodel.config.repetition_penalty
or 1.0 if the config does not set any value) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. -
pad_token_id (
int
, optional, defaults tomodel.config.pad_token_id
) — The id of the padding token. -
bos_token_id (
int
, optional, defaults tomodel.config.bos_token_id
) — The id of the beginning-of-sequence token. -
eos_token_id (
int
, optional, defaults tomodel.config.eos_token_id
) — The id of the end-of-sequence token. -
length_penalty (
float
, optional, defaults tomodel.config.length_penalty
or 1.0 if the config does not set any value) — Exponential penalty to the length. 1.0 means that the beam score is penalized by the sequence length. 0.0 means no penalty. Set to values < 0.0 in order to encourage the model to generate longer sequences, to a value > 0.0 in order to encourage the model to produce shorter sequences. -
no_repeat_ngram_size (
int
, optional, defaults tomodel.config.no_repeat_ngram_size
or 0 if the config does not set any value) — If set to int > 0, all ngrams of that size can only occur once. -
encoder_no_repeat_ngram_size (
int
, optional, defaults tomodel.config.encoder_no_repeat_ngram_size
or 0 if the config does not set any value) — If set to int > 0, all ngrams of that size that occur in theencoder_input_ids
cannot occur in thedecoder_input_ids
. -
bad_words_ids(
List[List[int]]
, optional, defaults tomodel.config.bad_words_ids
) — List of token ids that are not allowed to be generated. In order to get the token ids of the words that should not appear in the generated text, usetokenizer(bad_words, add_prefix_space=True, add_special_tokens=False).input_ids
. -
force_words_ids(
List[List[int]]
orList[List[List[int]]]
, optional) — List of token ids that must be generated. If given aList[List[int]]
, this is treated as a simple list of words that must be included, the opposite tobad_words_ids
. If givenList[List[List[int]]]
, this triggers a disjunctive constraint, where one can allow different forms of each word. -
num_return_sequences(
int
, optional, defaults tomodel.config.num_return_sequences
or 1 if the config does not set any value) — The number of independently computed returned sequences for each element in the batch. -
max_time(
float
, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed. -
attention_mask (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values are in[0, 1]
, 1 for tokens that are not masked, and 0 for masked tokens. If not provided, will default to a tensor the same shape asinput_ids
that masks the pad token. What are attention masks? -
decoder_start_token_id (
int
, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. use_cache — (bool
, optional, defaults toTrue
): Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. -
num_beam_groups (
int
, optional, defaults tomodel.config.num_beam_groups
or 1 if the config does not set any value) — Number of groups to dividenum_beams
into in order to ensure diversity among different groups of beams. this paper for more details. -
diversity_penalty (
float
, optional, defaults tomodel.config.diversity_penalty
or 0.0 if the config does not set any value) — This value is subtracted from a beam’s score if it generates a token same as any beam from other group at a particular time. Note thatdiversity_penalty
is only effective ifgroup beam search
is enabled. -
prefix_allowed_tokens_fn (
Callable[[int, torch.Tensor], List[int]]
, optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_id
andinput_ids
. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_id
and the previously generated tokensinputs_ids
. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval. -
logits_processor (
LogitsProcessorList
, optional) — Custom logits processors that complement the default logits processors built from arguments and a model’s config. If a logit processor is passed that is already created with the arguments or a model’s config an error is thrown. This feature is intended for advanced users. renormalize_logits — (bool
, optional, defaults toFalse
): Whether to renormalize the logits after applying all the logits processors or warpers (including the custom ones). It’s highly recommended to set this flag toTrue
as the search algorithms suppose the score logits are normalized but some logit processors or warpers break the normalization. -
stopping_criteria (
StoppingCriteriaList
, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a model’s config. If a stopping criteria is passed that is already created with the arguments or a model’s config an error is thrown. This feature is intended for advanced users. -
constraints (
List[Constraint]
, optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined byConstraint
objects, in the most sensible way possible. -
output_attentions (
bool
, optional, defaults tomodel.config.output_attentions
orFalse
if the config does not set any value) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults tomodel.config.output_hidden_states
orFalse
if the config does not set any value) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults tomodel.config.output_scores
orFalse
if the config does not set any value) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults tomodel.config.return_dict_in_generate
orFalse
if the config does not set any value) — Whether or not to return a ModelOutput instead of a plain tuple. -
forced_bos_token_id (
int
, optional, defaults tomodel.config.forced_bos_token_id
) — The id of the token to force as the first generated token after thedecoder_start_token_id
. Useful for multilingual models like mBART where the first generated token needs to be the target language token. -
forced_eos_token_id (
int
, optional, defaults tomodel.config.forced_eos_token_id
) — The id of the token to force as the last generated token whenmax_length
is reached. -
remove_invalid_values (
bool
, optional, defaults tomodel.config.remove_invalid_values
) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that usingremove_invalid_values
can slow down generation. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) -
exponential_decay_length_penalty (
tuple(int, float)
, optional, defaults tomodel.config.exponential_decay_length_penalty
) — This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been generated. The tuple shall consist of:(start_index, decay_factor)
wherestart_index
indicates where penalty starts anddecay_factor
represents the factor of exponential decaymodelkwargs — Additional model specific kwargs will be forwarded to the
forward
function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder*.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
.
If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False
), the possible
ModelOutput types are:
- GreedySearchDecoderOnlyOutput,
- SampleDecoderOnlyOutput,
- BeamSearchDecoderOnlyOutput,
- BeamSampleDecoderOnlyOutput
If the model is an encoder-decoder model (model.config.is_encoder_decoder=True
), the possible
ModelOutput types are:
Generates sequences of token ids for models with a language modeling head. The method supports the following generation methods for text-decoder, text-to-text, speech-to-text, and vision-to-text models:
- greedy decoding by calling greedy_search() if
num_beams=1
anddo_sample=False
. - multinomial sampling by calling sample() if
num_beams=1
anddo_sample=True
. - beam-search decoding by calling beam_search() if
num_beams>1
anddo_sample=False
. - beam-search multinomial sampling by calling beam_sample() if
num_beams>1
anddo_sample=True
. - diverse beam-search decoding by calling group_beam_search(), if
num_beams>1
andnum_beam_groups>1
. - constrained beam-search decoding by calling
constrained_beam_search(), if
constraints!=None
orforce_words_ids!=None
.
Apart from inputs
, all the arguments below will default to the value of the attribute of the same name as
defined in the model’s config (config.json
) which in turn defaults to the
PretrainedConfig of the model.
Most of these parameters are explained in more detail in this blog post.
Examples:
Greedy Decoding:
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> prompt = "Today I believe we can finally"
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
>>> # generate up to 30 tokens
>>> outputs = model.generate(input_ids, do_sample=False, max_length=30)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n']
Multinomial Sampling:
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> prompt = "Today I believe we can finally"
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
>>> # sample up to 30 tokens
>>> torch.manual_seed(0)
>>> outputs = model.generate(input_ids, do_sample=True, max_length=30)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today I believe we can finally get rid of discrimination," said Rep. Mark Pocan (D-Wis.).\n\n"Just look at the']
Beam-search decoding:
>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")
>>> sentence = "Paris is one of the densest populated areas in Europe."
>>> input_ids = tokenizer(sentence, return_tensors="pt").input_ids
>>> outputs = model.generate(input_ids, num_beams=5)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Paris ist eines der dichtesten besiedelten Gebiete Europas.']
greedy_search
< source >( input_ids: LongTensor logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
logits_processor (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step. -
stopping_criteria (
StoppingCriteriaList
, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop. -
max_length (
int
, optional, defaults to 20) — DEPRECATED. Uselogits_processor
orstopping_criteria
directly to cap the number of generated tokens. The maximum length of the sequence to be generated. -
pad_token_id (
int
, optional) — The id of the padding token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific keyword arguments will be forwarded to theforward
function of the model. If model is an encoder-decoder model the kwargs should includeencoder_outputs
.
Generates sequences of token ids for models with a language modeling head using greedy decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
Examples:
>>> from transformers import (
... AutoTokenizer,
... AutoModelForCausalLM,
... LogitsProcessorList,
... MinLengthLogitsProcessor,
... StoppingCriteriaList,
... MaxLengthCriteria,
... )
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> # set pad_token_id to eos_token_id because GPT2 does not have a PAD token
>>> model.config.pad_token_id = model.config.eos_token_id
>>> input_prompt = "It might be possible to"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
... [
... MinLengthLogitsProcessor(10, eos_token_id=model.config.eos_token_id),
... ]
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
>>> outputs = model.greedy_search(
... input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["It might be possible to get a better understanding of the nature of the problem, but it's not"]
sample
< source >( input_ids: LongTensor logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None logits_warper: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
logits_processor (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step. -
stopping_criteria (
StoppingCriteriaList
, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop. -
logits_warper (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. -
max_length (
int
, optional, defaults to 20) — DEPRECATED. Uselogits_processor
orstopping_criteria
directly to cap the number of generated tokens. The maximum length of the sequence to be generated. -
pad_token_id (
int
, optional) — The id of the padding token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to theforward
function of the model. If model is an encoder-decoder model the kwargs should includeencoder_outputs
.
Generates sequences of token ids for models with a language modeling head using multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
Examples:
>>> from transformers import (
... AutoTokenizer,
... AutoModelForCausalLM,
... LogitsProcessorList,
... MinLengthLogitsProcessor,
... TopKLogitsWarper,
... TemperatureLogitsWarper,
... StoppingCriteriaList,
... MaxLengthCriteria,
... )
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
>>> model.config.pad_token_id = model.config.eos_token_id
>>> input_prompt = "Today is a beautiful day, and"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
... [
... MinLengthLogitsProcessor(15, eos_token_id=model.config.eos_token_id),
... ]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
... [
... TopKLogitsWarper(50),
... TemperatureLogitsWarper(0.7),
... ]
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
>>> torch.manual_seed(0)
>>> outputs = model.sample(
... input_ids,
... logits_processor=logits_processor,
... logits_warper=logits_warper,
... stopping_criteria=stopping_criteria,
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the']
beam_search
< source >( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
beam_scorer (
BeamScorer
) — An derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read. -
logits_processor (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step. -
stopping_criteria (
StoppingCriteriaList
, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop. -
max_length (
int
, optional, defaults to 20) — DEPRECATED. Uselogits_processor
orstopping_criteria
directly to cap the number of generated tokens. The maximum length of the sequence to be generated. -
pad_token_id (
int
, optional) — The id of the padding token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to theforward
function of the model. If model is an encoder-decoder model the kwargs should includeencoder_outputs
.
Generates sequences of token ids for models with a language modeling head using beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
Examples:
>>> from transformers import (
... AutoTokenizer,
... AutoModelForSeq2SeqLM,
... LogitsProcessorList,
... MinLengthLogitsProcessor,
... BeamSearchScorer,
... )
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id
>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
... "encoder_outputs": model.get_encoder()(
... encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
... )
... }
>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
... batch_size=1,
... num_beams=num_beams,
... device=model.device,
... )
>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
... [
... MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
... ]
... )
>>> outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']
beam_sample
< source >( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None logits_warper: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
beam_scorer (
BeamScorer
) — A derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read. -
logits_processor (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step. -
stopping_criteria (
StoppingCriteriaList
, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop. -
logits_warper (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. -
max_length (
int
, optional, defaults to 20) — DEPRECATED. Uselogits_processor
orstopping_criteria
directly to cap the number of generated tokens. The maximum length of the sequence to be generated. -
pad_token_id (
int
, optional) — The id of the padding token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to theforward
function of the model. If model is an encoder-decoder model the kwargs should includeencoder_outputs
.
Generates sequences of token ids for models with a language modeling head using beam search multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
Examples:
>>> from transformers import (
... AutoTokenizer,
... AutoModelForSeq2SeqLM,
... LogitsProcessorList,
... MinLengthLogitsProcessor,
... TopKLogitsWarper,
... TemperatureLogitsWarper,
... BeamSearchScorer,
... )
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id
>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
... "encoder_outputs": model.get_encoder()(
... encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
... )
... }
>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
... batch_size=1,
... max_length=model.config.max_length,
... num_beams=num_beams,
... device=model.device,
... )
>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
... [MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id)]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
... [
... TopKLogitsWarper(50),
... TemperatureLogitsWarper(0.7),
... ]
... )
>>> outputs = model.beam_sample(
... input_ids, beam_scorer, logits_processor=logits_processor, logits_warper=logits_warper, **model_kwargs
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']
group_beam_search
< source >( input_ids: LongTensor beam_scorer: BeamScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = False **model_kwargs )
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
beam_scorer (
BeamScorer
) — An derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of BeamScorer should be read. -
logits_processor (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step. -
stopping_criteria (
StoppingCriteriaList
, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop. -
max_length (
int
, optional, defaults to 20) — DEPRECATED. Uselogits_processor
orstopping_criteria
directly to cap the number of generated tokens. The maximum length of the sequence to be generated. -
pad_token_id (
int
, optional) — The id of the padding token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)model_kwargs — Additional model specific kwargs that will be forwarded to the
forward
function of the model. If model is an encoder-decoder model the kwargs should includeencoder_outputs
.
Generates sequences of token ids for models with a language modeling head using diverse beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
Examples:
>>> from transformers import (
... AutoTokenizer,
... AutoModelForSeq2SeqLM,
... LogitsProcessorList,
... MinLengthLogitsProcessor,
... HammingDiversityLogitsProcessor,
... BeamSearchScorer,
... )
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
>>> # lets run diverse beam search using 6 beams
>>> num_beams = 6
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id
>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
... "encoder_outputs": model.get_encoder()(
... encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
... )
... }
>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
... batch_size=1,
... max_length=model.config.max_length,
... num_beams=num_beams,
... device=model.device,
... num_beam_groups=3,
... )
>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
... [
... HammingDiversityLogitsProcessor(5.5, num_beams=6, num_beam_groups=3),
... MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
... ]
... )
>>> outputs = model.group_beam_search(
... input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']
constrained_beam_search
< source >( input_ids: LongTensor constrained_beam_scorer: ConstrainedBeamSearchScorer logits_processor: typing.Optional[transformers.generation_logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None max_length: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None output_scores: typing.Optional[bool] = None return_dict_in_generate: typing.Optional[bool] = None synced_gpus: typing.Optional[bool] = None **model_kwargs )
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
constrained_beam_scorer (
ConstrainedBeamSearchScorer
) — A derived instance of BeamScorer that defines how beam hypotheses are constructed, stored and sorted during generation, while satisfying a list of positive constraints. For more information, the documentation of ConstrainedBeamSearchScorer should be read. -
logits_processor (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsProcessor used to modify the prediction scores of the language modeling head applied at each generation step. -
stopping_criteria (
StoppingCriteriaList
, optional) — An instance of StoppingCriteriaList. List of instances of class derived from StoppingCriteria used to tell if the generation loop should stop. -
logits_warper (
LogitsProcessorList
, optional) — An instance of LogitsProcessorList. List of instances of class derived from LogitsWarper used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. -
max_length (
int
, optional, defaults to 20) — DEPRECATED. Uselogits_processor
orstopping_criteria
directly to cap the number of generated tokens. The maximum length of the sequence to be generated. -
pad_token_id (
int
, optional) — The id of the padding token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to theforward
function of the model. If model is an encoder-decoder model the kwargs should includeencoder_outputs
.
Generates sequences of token ids for models with a language modeling head using constrained beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
Examples:
>>> from transformers import (
... AutoTokenizer,
... AutoModelForSeq2SeqLM,
... LogitsProcessorList,
... MinLengthLogitsProcessor,
... ConstrainedBeamSearchScorer,
... PhrasalConstraint,
... )
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id
>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
... "encoder_outputs": model.get_encoder()(
... encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
... )
... }
>>> constraint_str = "Sie"
>>> constraint_token_ids = tokenizer.encode(constraint_str)[:-1] # slice to remove eos token
>>> constraints = [PhrasalConstraint(token_ids=constraint_token_ids)]
>>> # instantiate beam scorer
>>> beam_scorer = ConstrainedBeamSearchScorer(
... batch_size=1, num_beams=num_beams, device=model.device, constraints=constraints
... )
>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
... [
... MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
... ]
... )
>>> outputs = model.constrained_beam_search(
... input_ids, beam_scorer, constraints=constraints, logits_processor=logits_processor, **model_kwargs
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt sind Sie?']
TFGenerationMixin
A class containing all of the functions supporting generation, to be used as a mixin in TFPreTrainedModel.
generate
< source >(
input_ids = None
max_length = None
max_new_tokens = None
min_length = None
do_sample = None
early_stopping = None
num_beams = None
temperature = None
top_k = None
top_p = None
repetition_penalty = None
bad_words_ids = None
bos_token_id = None
pad_token_id = None
eos_token_id = None
length_penalty = None
no_repeat_ngram_size = None
num_return_sequences = None
attention_mask = None
decoder_start_token_id = None
use_cache = None
output_scores = None
output_attentions = None
output_hidden_states = None
return_dict_in_generate = None
forced_bos_token_id = None
forced_eos_token_id = None
**model_kwargs
)
→
ModelOutput or tf.Tensor
Parameters
-
input_ids (
tf.Tensor
of shape(batch_size, sequence_length)
, `(batch_size, sequence_length, — -
feature_dim)` or
(batch_size, num_channels, height, width)
, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. IfNone
the method initializes it withbos_token_id
and a batch size of 1. For decoder-only modelsinputs
should of in the format ofinput_ids
. For encoder-decoder models inputs can represent any ofinput_ids
,input_values
,input_features
, orpixel_values
. -
max_length (
int
, optional, defaults tomodel.config.max_length
) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt +max_new_tokens
. In general, prefer the use ofmax_new_tokens
, which ignores the number of tokens in the prompt. -
max_new_tokens (
int
, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. -
min_length (
int
, optional, defaults to 10) — The minimum length of the sequence to be generated. -
do_sample (
bool
, optional, defaults toFalse
) — Whether or not to use sampling ; use greedy decoding otherwise. -
early_stopping (
bool
, optional, defaults toFalse
) — Whether to stop the beam search when at leastnum_beams
sentences are finished per batch or not. -
num_beams (
int
, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search. -
temperature (
float
, optional, defaults to 1.0) — The value used to module the next token probabilities. -
top_k (
int
, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering. -
top_p (
float
, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up totop_p
or higher are kept for generation. -
repetition_penalty (
float
, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. -
pad_token_id (
int
, optional) — The id of the padding token. -
bos_token_id (
int
, optional) — The id of the beginning-of-sequence token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
length_penalty (
float
, optional, defaults to 1.0) — Exponential penalty to the length. 1.0 means no penalty.Set to values < 1.0 in order to encourage the model to generate shorter sequences, to a value > 1.0 in order to encourage the model to produce longer sequences.
-
no_repeat_ngram_size (
int
, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once. -
bad_words_ids(
List[int]
, optional) — List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, usetokenizer.encode(bad_word, add_prefix_space=True)
. -
num_return_sequences(
int
, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch. -
attention_mask (
tf.Tensor
ofdtype=tf.int32
and shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values are in[0, 1]
, 1 for tokens that are not masked, and 0 for masked tokens.If not provided, will default to a tensor the same shape as
input_ids
that masks the pad token. -
decoder_start_token_id (
int
, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. use_cache — (bool
, optional, defaults toTrue
): Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. -
output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more details. - output_hidden_states (
bool
, optional, defaults toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more details. -
output_scores (
bool
, optional, defaults toFalse
) — Whether or not to return the prediction scores. Seescores
under returned tensors for more details. -
return_dict_in_generate (
bool
, optional, defaults toFalse
) — Whether or not to return a ModelOutput instead of a plain tuple. -
forced_bos_token_id (
int
, optional) — The id of the token to force as the first generated token after thedecoder_start_token_id
. Useful for multilingual models like mBART where the first generated token needs to be the target language token. -
forced_eos_token_id (
int
, optional) — The id of the token to force as the last generated token whenmax_length
is reached. model_specific_kwargs — Additional model specific kwargs will be forwarded to theforward
function of the model.
Returns
ModelOutput or tf.Tensor
A ModelOutput (if return_dict_in_generate=True
or when
config.return_dict_in_generate=True
) or a tf.Tensor
.
If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False
), the possible
ModelOutput types are:
TFGreedySearchDecoderOnlyOutput
,TFSampleDecoderOnlyOutput
,TFBeamSearchDecoderOnlyOutput
,TFBeamSampleDecoderOnlyOutput
If the model is an encoder-decoder model (model.config.is_encoder_decoder=True
), the possible
ModelOutput types are:
TFGreedySearchEncoderDecoderOutput
,TFSampleEncoderDecoderOutput
,TFBeamSearchEncoderDecoderOutput
,TFBeamSampleEncoderDecoderOutput
Generates sequences for models with a language modeling head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.
Adapted in part from Facebook’s XLM beam search code.
Apart from input_ids
and attention_mask
, all the arguments below will default to the value of the attribute
of the same name inside the PretrainedConfig of the model. The default values indicated are the default
values of those config.
Most of these parameters are explained in more detail in this blog post.
Examples:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2") # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
"distilgpt2"
) # Download model and configuration from huggingface.co and cache.
outputs = model.generate(max_length=40) # do greedy decoding
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
tokenizer = AutoTokenizer.from_pretrained("openai-gpt") # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
"openai-gpt"
) # Download model and configuration from huggingface.co and cache.
input_context = "The dog"
input_ids = tokenizer.encode(input_context, return_tensors="tf") # encode input context
outputs = model.generate(
input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5
) # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
for i in range(3): # 3 output sequences were generated
print(f"Generated {i}: {tokenizer.decode(outputs[i], skip_special_tokens=True)}")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2") # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
"distilgpt2"
) # Download model and configuration from huggingface.co and cache.
input_context = "The dog"
input_ids = tokenizer.encode(input_context, return_tensors="tf") # encode input context
outputs = model.generate(
input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3, do_sample=True
) # generate 3 candidates using sampling
for i in range(3): # 3 output sequences were generated
print(f"Generated {i}: {tokenizer.decode(outputs[i], skip_special_tokens=True)}")
tokenizer = AutoTokenizer.from_pretrained("ctrl") # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
"ctrl"
) # Download model and configuration from huggingface.co and cache.
input_context = "Legal My neighbor is" # "Legal" is one of the control codes for ctrl
input_ids = tokenizer.encode(input_context, return_tensors="tf") # encode input context
outputs = model.generate(
input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2
) # generate sequences
print(f"Generated: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained(
"gpt2"
) # Download model and configuration from huggingface.co and cache.
input_context = "My cute dog"
bad_words_ids = [
tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ["idiot", "stupid", "shut up"]
]
input_ids = tokenizer.encode(input_context, return_tensors="tf") # encode input context
outputs = model.generate(
input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids
) # generate sequences without allowing bad_words to be generated
FlaxGenerationMixin
A class containing all functions for auto-regressive text generation, to be used as a mixin in FlaxPreTrainedModel.
The class exposes generate(), which can be used for:
- greedy decoding by calling
_greedy_search()
ifnum_beams=1
anddo_sample=False
. - multinomial sampling by calling
_sample()
ifnum_beams=1
anddo_sample=True
. - beam-search decoding by calling
_beam_search
ifnum_beams>1
anddo_sample=False
.
generate
< source >( input_ids: ndarray max_length: typing.Optional[int] = None max_new_tokens: typing.Optional[int] = None pad_token_id: typing.Optional[int] = None bos_token_id: typing.Optional[int] = None eos_token_id: typing.Optional[int] = None decoder_start_token_id: typing.Optional[int] = None do_sample: typing.Optional[bool] = None prng_key: typing.Optional[jax._src.numpy.ndarray.ndarray] = None top_k: typing.Optional[int] = None top_p: typing.Optional[float] = None temperature: typing.Optional[float] = None num_beams: typing.Optional[int] = None no_repeat_ngram_size: typing.Optional[int] = None min_length: typing.Optional[int] = None forced_bos_token_id: typing.Optional[int] = None forced_eos_token_id: typing.Optional[int] = None length_penalty: typing.Optional[float] = None early_stopping: typing.Optional[bool] = None trace: bool = True params: typing.Union[typing.Dict[str, jax._src.numpy.ndarray.ndarray], NoneType] = None **model_kwargs )
Parameters
-
input_ids (
jnp.ndarray
of shape(batch_size, sequence_length)
) — The sequence used as a prompt for the generation. -
max_length (
int
, optional, defaults tomodel.config.max_length
) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt +max_new_tokens
. In general, prefer the use ofmax_new_tokens
, which ignores the number of tokens in the prompt. -
max_new_tokens (
int
, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. -
do_sample (
bool
, optional, defaults toFalse
) — Whether or not to use sampling ; use greedy decoding otherwise. -
temperature (
float
, optional, defaults to 1.0) — The value used to module the next token probabilities. -
top_k (
int
, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering. -
top_p (
float
, optional, defaults to 1.0) — If set to float < 1, only the most probable tokens with probabilities that add up totop_p
or higher are kept for generation. -
pad_token_id (
int
, optional) — The id of the padding token. -
bos_token_id (
int
, optional) — The id of the beginning-of-sequence token. -
eos_token_id (
int
, optional) — The id of the end-of-sequence token. -
num_beams (
int
, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search. -
decoder_start_token_id (
int
, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. -
trace (
bool
, optional, defaults toTrue
) — Whether to trace generation. Settingtrace=False
should only be used for debugging and will lead to a considerably slower runtime. -
params (
Dict[str, jnp.ndarray]
, optional) — Optionally the model parameters can be passed. Can be useful for parallelized generation. modelkwargs — Additional model specific kwargs will be forwarded to theforward
function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder*. Also acceptsencoder_outputs
to skip encoder part.
Generates sequences of token ids for models with a language modeling head. The method supports the following generation methods for text-decoder, text-to-text, speech-to-text, and vision-to-text models:
- greedy decoding by calling
_greedy_search()
ifnum_beams=1
anddo_sample=False
. - multinomial sampling by calling
_sample()
ifnum_beams=1
anddo_sample=True
. - beam-search decoding by calling
_beam_search
ifnum_beams>1
anddo_sample=False
.
Apart from inputs
, all the arguments below will default to the value of the attribute of the same name as
defined in the model’s config (config.json
) which in turn defaults to the
PretrainedConfig of the model.
Most of these parameters are explained in more detail in this blog post.
Examples:
>>> from transformers import AutoTokenizer, FlaxAutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
>>> model = FlaxAutoModelForCausalLM.from_pretrained("distilgpt2")
>>> input_context = "The dog"
>>> # encode input context
>>> input_ids = tokenizer(input_context, return_tensors="np").input_ids
>>> # generate candidates using sampling
>>> outputs = model.generate(input_ids=input_ids, max_length=20, top_k=30, do_sample=True)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)