SeamlessM4T
Overview
The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI. This is the version 1 release of the model. For the updated version 2 release, refer to the Seamless M4T v2 docs.
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
SeamlessM4T enables multiple tasks without relying on separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.
The abstract from the paper is the following:
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication
Usage
First, load the processor and a checkpoint of the model:
>>> from transformers import AutoProcessor, SeamlessM4TModel
>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
Here is how to use the processor to process text and audio:
>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>> # now, process some English test as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
Speech
SeamlessM4TModel can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.
Text
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False
to SeamlessM4TModel.generate().
This time, let’s translate to French.
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
Tips
1. Use dedicated models
SeamlessM4TModel is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False
.
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
Feel free to try out SeamlessM4TForSpeechToText and SeamlessM4TForTextToSpeech as well.
2. Change the speaker identity
You have the possibility to change the speaker used for speech synthesis with the spkr_id
argument. Some spkr_id
works better than other for some languages!
3. Change the generation strategy
You can use different generation strategies for speech and text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)
which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.
4. Generate speech and text at the same time
Use return_intermediate_token_ids=True
with SeamlessM4TModel to return both speech and text !
Model architecture
SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.
Here’s how the generation process works:
- Input text or speech is processed through its specific encoder.
- A decoder creates text tokens in the desired language.
- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
- These unit tokens are then passed through the final vocoder to produce the actual speech.
This model was contributed by ylacombe. The original code can be found here.
SeamlessM4TModel
class transformers.SeamlessM4TModel
< source >( config current_modality = 'text' )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- current_modality (
str
, optional, defaults to"text"
) — Default modality. Used to initialize the model.
The original SeamlessM4T Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST). This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_ids: typing.Optional[torch.Tensor] = None input_features: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None spkr_id: typing.Optional[int] = 0 generate_speech: typing.Optional[bool] = True **kwargs ) → Union[SeamlessM4TGenerationOutput, Tuple[Tensor], ModelOutput]
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- input_features (
torch.FloatTensor
of shape(batch_size, sequence_length, num_banks)
, optional) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. - return_intermediate_token_ids (
bool
, optional) — IfTrue
, also returns the intermediate generated text and unit tokens. Set toTrue
if you also want to get translated text alongside the audio. Note that ifgenerate_speech=True
, this parameter will be ignored. - tgt_lang (
str
, optional) — The language to use as target language for translation. - spkr_id (
int
, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs
. - generate_speech (
bool
, optional, defaults toTrue
) — IfFalse
, will only returns the text tokens and won’t generate speech. - kwargs (optional) —
Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword
arguments are of two types:
- Without a prefix, they will be entered as
**kwargs
for thegenerate
method of each sub-model, except fordecoder_input_ids
which will only be passed through the text components. - With a text_ or speech_ prefix, they will be input for the
generate
method of the text model and speech model respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for one generation but not for the other.
- Without a prefix, they will be entered as
Returns
Union[SeamlessM4TGenerationOutput, Tuple[Tensor], ModelOutput]
- If
generate_speech
andreturn_intermediate_token_ids
, returnsSeamlessM4TGenerationOutput
. - If
generate_speech
and notreturn_intermediate_token_ids
, returns a tuple composed of waveforms of shape(batch_size, sequence_length)
and andwaveform_lengths
which gives the length of each sample. - If
generate_speech=False
, it will returnsModelOutput
.
Generates translated token ids and/or translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)
will successively
perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4TForTextToSpeech
class transformers.SeamlessM4TForTextToSpeech
< source >( config: SeamlessM4TConfig )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The text-to-speech SeamlessM4T Model transformer which can be used for T2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_ids: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None spkr_id: typing.Optional[int] = 0 **kwargs ) → Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- return_intermediate_token_ids (
bool
, optional) — IfTrue
, also returns the intermediate generated text and unit tokens. Set toTrue
if you also want to get translated text alongside the audio. - tgt_lang (
str
, optional) — The language to use as target language for translation. - spkr_id (
int
, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs
. - kwargs (optional) —
Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword
arguments are of two types:
- Without a prefix, they will be entered as
**kwargs
for thegenerate
method of each sub-model, except fordecoder_input_ids
which will only be passed through the text components. - With a text_ or speech_ prefix, they will be input for the
generate
method of the text model and speech model respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for one generation but not for the other.
- Without a prefix, they will be entered as
Returns
Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
- If
return_intermediate_token_ids
, returnsSeamlessM4TGenerationOutput
. - If not
return_intermediate_token_ids
, returns a tuple composed of waveforms of shape(batch_size, sequence_length)
and andwaveform_lengths
which gives the length of each sample.
Generates translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_ids, num_beams=4, speech_do_sample=True)
will successively perform
beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4TForSpeechToSpeech
class transformers.SeamlessM4TForSpeechToSpeech
< source >( config )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The speech-to-speech SeamlessM4T Model transformer which can be used for S2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_features: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None spkr_id: typing.Optional[int] = 0 **kwargs ) → Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
Parameters
- input_features (
torch.FloatTensor
of shape(batch_size, sequence_length, num_banks)
) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. - return_intermediate_token_ids (
bool
, optional) — IfTrue
, also returns the intermediate generated text and unit tokens. Set toTrue
if you also want to get translated text alongside the audio. - tgt_lang (
str
, optional) — The language to use as target language for translation. - spkr_id (
int
, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs
. - kwargs (optional) —
Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword
arguments are of two types:
- Without a prefix, they will be entered as
**kwargs
for thegenerate
method of each sub-model, except fordecoder_input_ids
which will only be passed through the text components. - With a text_ or speech_ prefix, they will be input for the
generate
method of the text model and speech model respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for one generation but not for the other.
- Without a prefix, they will be entered as
Returns
Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
- If
return_intermediate_token_ids
, returnsSeamlessM4TGenerationOutput
. - If not
return_intermediate_token_ids
, returns a tuple composed of waveforms of shape(batch_size, sequence_length)
and andwaveform_lengths
which gives the length of each sample.
Generates translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_features, num_beams=4, speech_do_sample=True)
will successively perform
beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4TForTextToText
class transformers.SeamlessM4TForTextToText
< source >( config: SeamlessM4TConfig )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The text-to-text SeamlessM4T Model transformer which can be used for T2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the
eos_token_id
as the starting token fordecoder_input_ids
generation. Ifpast_key_values
is used, optionally only the lastdecoder_input_ids
have to be input (seepast_key_values
).For translation and summarization training,
decoder_input_ids
should be provided. If nodecoder_input_ids
is provided, the model will create this tensor by shifting theinput_ids
to the right for denoising pre-training following the paper. - decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids
. Causal mask will also be used by default.If you want to change padding behavior, you should read
modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy. - encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional) — Tuple consists of (last_hidden_state
, optional:hidden_states
, optional:attentions
)last_hidden_state
of shape(batch_size, sequence_length, hidden_size)
, optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.If
decoder_input_ids
anddecoder_inputs_embeds
are both unset,decoder_inputs_embeds
takes the value ofinputs_embeds
. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should be in[-100, 0, ..., config.vocab_size]
(seeinput_ids
docstring) Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
- use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The SeamlessM4TForTextToText forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
generate
< source >( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor
Parameters
- input_ids (
torch.Tensor
of varying shape depending on the modality, optional) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- tgt_lang (
str
, optional) — The language to use as target language for translation. - generation_config (
~generation.GenerationConfig
, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation. - logits_processor (
LogitsProcessorList
, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. - stopping_criteria (
StoppingCriteriaList
, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. - prefix_allowed_tokens_fn (
Callable[[int, torch.Tensor], List[int]]
, optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_id
andinput_ids
. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_id
and the previously generated tokensinputs_ids
. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval. - synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) - kwargs (
Dict[str, Any]
, optional) — Ad hoc parametrization ofgenerate_config
and/or additional model-specific kwargs that will be forwarded to theforward
function of the model.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
.
For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4TForSpeechToText
class transformers.SeamlessM4TForSpeechToText
< source >( config: SeamlessM4TConfig )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The speech-to-text SeamlessM4T Model transformer which can be used for S2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_features: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs )
Parameters
- input_features (
torch.FloatTensor
of shape(batch_size, sequence_length, num_banks)
) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. - attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the
eos_token_id
as the starting token fordecoder_input_ids
generation. Ifpast_key_values
is used, optionally only the lastdecoder_input_ids
have to be input (seepast_key_values
).For translation and summarization training,
decoder_input_ids
should be provided. If nodecoder_input_ids
is provided, the model will create this tensor by shifting theinput_ids
to the right for denoising pre-training following the paper. - decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids
. Causal mask will also be used by default.If you want to change padding behavior, you should read
modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy. - encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional) — Tuple consists of (last_hidden_state
, optional:hidden_states
, optional:attentions
)last_hidden_state
of shape(batch_size, sequence_length, hidden_size)
, optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.If
decoder_input_ids
anddecoder_inputs_embeds
are both unset,decoder_inputs_embeds
takes the value ofinputs_embeds
. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should be in[-100, 0, ..., config.vocab_size]
(seeinput_ids
docstring) Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
- use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The SeamlessM4TForSpeechToText forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
generate
< source >( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor
Parameters
- input_features (
torch.FloatTensor
of shape(batch_size, sequence_length, num_banks)
) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. - tgt_lang (
str
, optional) — The language to use as target language for translation. - generation_config (
~generation.GenerationConfig
, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation. - logits_processor (
LogitsProcessorList
, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. - stopping_criteria (
StoppingCriteriaList
, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. - prefix_allowed_tokens_fn (
Callable[[int, torch.Tensor], List[int]]
, optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_id
andinput_ids
. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_id
and the previously generated tokensinputs_ids
. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval. - synced_gpus (
bool
, optional, defaults toFalse
) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) - kwargs (
Dict[str, Any]
, optional) — Ad hoc parametrization ofgenerate_config
and/or additional model-specific kwargs that will be forwarded to theforward
function of the model.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
.
For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4TConfig
class transformers.SeamlessM4TConfig
< source >( vocab_size = 256102 t2u_vocab_size = 10082 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 1024 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative' rotary_embedding_base = 10000 max_source_positions = 4096 conv_depthwise_kernel_size = 31 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_decoder_start_token_id = 2 t2u_max_new_tokens = 1024 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 2048 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 256102) — Vocabulary size of the SeamlessM4T model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling ~SeamlessM4TModel, ~SeamlessM4TForTextToSpeech or ~SeamlessM4TForTextToText. - t2u_vocab_size (
int
, optional, defaults to 10082) — Unit vocabulary size of the SeamlessM4T model. Defines the number of different unit tokens that can be represented by theinputs_ids
passed when calling the Text-To-Units sub-model of ~SeamlessM4TModel, ~SeamlessM4TForSpeechToSpeech or ~SeamlessM4TForTextToSpeech.
Parameters shared across sub-models
- hidden_size (
int
, optional, defaults to 1024) — Dimensionality of the “intermediate” layers in the architecture. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). - max_position_embeddings (
int
, optional, defaults to 1024) — The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). - is_encoder_decoder (
bool
, optional, defaults toTrue
) — Whether the model is used as an encoder/decoder or not. - encoder_layerdrop (
float
, optional, defaults to 0.05) — The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details. - decoder_layerdrop (
float
, optional, defaults to 0.05) — The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details. - activation_function (
str
orfunction
, optional, defaults to"relu"
) — The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,"gelu"
,"relu"
,"selu"
,"swish"
and"gelu_new"
are supported. - dropout (
float
, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler. - attention_dropout (
float
, optional, defaults to 0.1) — The dropout probability for all attention layers. - activation_dropout (
float
, optional, defaults to 0.0) — The dropout probability for all activation layers in the model. - scale_embedding (
bool
, optional, defaults toTrue
) — Scale embeddings by diving by sqrt(d_model).
Text encoder and text decoder specific parameters
- encoder_layers (
int
, optional, defaults to 24) — Number of hidden layers in the Transformer text encoder. - encoder_ffn_dim (
int
, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text encoder. - encoder_attention_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text encoder. - decoder_layers (
int
, optional, defaults to 24) — Number of hidden layers in the Transformer text decoder. - decoder_ffn_dim (
int
, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text decoder. - decoder_attention_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text decoder. - decoder_start_token_id (
int
, optional, defaults to 3) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only applied in the text decoder. - max_new_tokens (
int
, optional, defaults to 256) — The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt. - pad_token_id (
int
, optional, defaults to 0) — The id of the padding text token. Only applied to the text-decoder model. - bos_token_id (
int
, optional, defaults to 2) — The id of the beginning-of-stream text token. Only applied to the text-decoder model. - eos_token_id (
int
, optional, defaults to 3) — The id of the end-of-stream text token. Only applied to the text-decoder model.
Speech encoder specific parameters
- speech_encoder_layers (
int
, optional, defaults to 24) — Number of hidden layers in the Transformer speech encoder. - speech_encoder_attention_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer speech encoder. - speech_encoder_intermediate_size (
int
, optional, defaults to 4096) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer speech encoder. - speech_encoder_hidden_act (
str
orfunction
, optional, defaults to"swish"
) — The non-linear activation function (function or string) in the speech encoder. If string,"gelu"
,"relu"
,"selu"
,"swish"
and"gelu_new"
are supported. - speech_encoder_dropout (
float
, optional, defaults to 0.0) — The dropout probability for all layers in the speech encoder. - add_adapter (
bool
, optional, defaults toTrue
) — Add an adapter layer on top of the speech encoder. - speech_encoder_layerdrop (
float
, optional, defaults to 0.1) — The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details. - feature_projection_input_dim (
int
, optional, defaults to 160) — Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing input audios with SeamlessM4TFeatureExtractor. - num_conv_pos_embeddings (
int
, optional, defaults to 128) — Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional embeddings layer of the speech encoder. - num_conv_pos_embedding_groups (
int
, optional, defaults to 16) — Number of groups of 1D convolutional positional embeddings layer of the speech encoder. - adaptor_kernel_size (
int
, optional, defaults to 8) — Kernel size of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True
. - adaptor_stride (
int
, optional, defaults to 8) — Stride of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True
. - adaptor_dropout (
float
, optional, defaults to 0.1) — The dropout probability for all layers in the speech adapter. - num_adapter_layers (
int
, optional, defaults to 1) — Number of convolutional layers that should be used in the adapter network. Only relevant ifadd_adapter is True
. - position_embeddings_type (
str
, optional, defaults to"relative"
) — Can be specified torelative
orrotary
for relative or rotary position embeddings respectively. If leftNone
no relative position embedding is applied. Only applied to the speech encoder. - rotary_embedding_base (
int
, optional, defaults to 10000) — If"rotary"
position embeddings are used, defines the size of the embedding base. Only applied to the speech encoder. - max_source_positions (
int
, optional, defaults to 4096) — if"relative"
position embeddings are used, defines the maximum source input positions. Only applied to the speech encoder. - conv_depthwise_kernel_size (
int
, optional, defaults to 31) — Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.
Text-To-Unit (t2u) model specific parameters
- t2u_bos_token_id (
int
, optional, defaults to 0) — The id of the beginning-of-stream unit token. Only applied to the text-to-unit seq2seq model. - t2u_pad_token_id (
int
, optional, defaults to 1) — The id of the padding unit token. Only applied to the text-to-unit seq2seq model. - t2u_eos_token_id (
int
, optional, defaults to 2) — The id of the end-of-stream unit token. Only applied to the text-to-unit seq2seq model. - t2u_decoder_start_token_id (
int
, optional, defaults to 2) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only applied to the text-to-unit seq2seq model. - t2u_max_new_tokens (
int
, optional, defaults to 1024) — The maximum numbers of unit tokens to generate, ignoring the number of tokens in the prompt. Only applied to the text-to-unit seq2seq model. - t2u_encoder_layers (
int
, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit encoder. - t2u_encoder_ffn_dim (
int
, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit encoder. - t2u_encoder_attention_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit encoder. - t2u_decoder_layers (
int
, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit decoder. - t2u_decoder_ffn_dim (
int
, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit decoder. - t2u_decoder_attention_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit decoder. - t2u_max_position_embeddings (
int
, optional, defaults to 2048) — The maximum sequence length that this model text-to-unit component might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).Hifi-Gan Vocoder specific parameters
- sampling_rate (
int
, optional, defaults to 16000) — The sampling rate at which the output audio will be generated, expressed in hertz (Hz). - upsample_initial_channel (
int
, optional, defaults to 512) — The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only. - upsample_rates (
Tuple[int]
orList[int]
, optional, defaults to[5, 4, 4, 2, 2]
) — A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_rates defines the number of convolutional layers and has to match the length of upsample_kernel_sizes. Applies to the vocoder only. - upsample_kernel_sizes (
Tuple[int]
orList[int]
, optional, defaults to[11, 8, 8, 4, 4]
) — A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_kernel_sizes defines the number of convolutional layers and has to match the length of upsample_rates. Applies to the vocoder only. - resblock_kernel_sizes (
Tuple[int]
orList[int]
, optional, defaults to[3, 7, 11]
) — A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only. - resblock_dilation_sizes (
Tuple[Tuple[int]]
orList[List[int]]
, optional, defaults to[[1, 3, 5], [1, 3, 5], [1, 3, 5]]
) — A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only. - leaky_relu_slope (
float
, optional, defaults to 0.1) — The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder only. - unit_hifi_gan_vocab_size (
int
, optional, defaults to 10000) — Vocabulary size of the SeamlessM4T vocoder. Defines the number of different unit tokens that can be represented by theinputs_ids
passed when calling the vocoder of ~SeamlessM4TModel, ~SeamlessM4TForSpeechToSpeech or ~SeamlessM4TForTextToSpeech. - unit_embed_dim (
int
, optional, defaults to 1280) — The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only. - lang_embed_dim (
int
, optional, defaults to 256) — The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only. - spkr_embed_dim (
int
, optional, defaults to 256) — The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only. - vocoder_num_langs (
int
, optional, defaults to 36) — Number of langs supported by the vocoder. Might be different fromt2u_num_langs
. - vocoder_num_spkrs (
int
, optional, defaults to 200) — Number of speakers supported by the vocoder. - variance_predictor_kernel_size (
int
, optional, defaults to 3) — Kernel size of the duration predictor. Applies to the vocoder only. - var_pred_dropout (
float
, optional, defaults to 0.5) — The dropout probabilitiy of the duration predictor. Applies to the vocoder only. - vocoder_offset (
int
, optional, defaults to 4) — Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.
This is the configuration class to store the configuration of a ~SeamlessM4TModel. It is used to instantiate an SeamlessM4T model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4T “facebook/hf-seamless-m4t-medium” architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import SeamlessM4TModel, SeamlessM4TConfig
>>> # Initializing a SeamlessM4T "facebook/hf-seamless-m4t-medium" style configuration
>>> configuration = SeamlessM4TConfig()
>>> # Initializing a model from the "facebook/hf-seamless-m4t-medium" style configuration
>>> model = SeamlessM4TModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
SeamlessM4TTokenizer
class transformers.SeamlessM4TTokenizer
< source >( vocab_file bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' tokenizer_file = None src_lang = 'eng' tgt_lang = 'fra' sp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None additional_special_tokens = None **kwargs )
Parameters
- vocab_file (
str
) — Path to the vocabulary file. - bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token
. - eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
. - sep_token (
str
, optional, defaults to"</s>"
) — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - cls_token (
str
, optional, defaults to"<s>"
) — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. - unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. - tokenizer_file (
str
, optional) — The path to a tokenizer file to use instead of the vocab file. - src_lang (
str
, optional, defaults to"eng"
) — The language to use as source language for translation. - tgt_lang (
str
, optional, defaults to"fra"
) — The language to use as target language for translation. - sp_model_kwargs (
Dict[str, Any]
, optional) — Additional keyword arguments to pass to the model initialization. - additional_special_tokens (tuple or list of
str
ortokenizers.AddedToken
, optional) — A tuple or a list of additional special tokens. Can be used to specify the list of languages that will be supported by the tokenizer.
Construct a SeamlessM4T tokenizer.
Adapted from RobertaTokenizer and XLNetTokenizer. Based on SentencePiece.
The tokenization method is <language code> <tokens> <eos>
for source language documents, and <eos> <language code> <tokens> <eos>
for target language documents.
Examples:
>>> from transformers import SeamlessM4TTokenizer
>>> tokenizer = SeamlessM4TTokenizer.from_pretrained(
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
__call__
< source >( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True pad_to_multiple_of: typing.Optional[int] = 2 src_lang: typing.Optional[str] = None tgt_lang: typing.Optional[str] = None **kwargs )
Parameters
- text (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - text_pair (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - text_target (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - text_pair_target (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - padding (
bool
,str
or PaddingStrategy, optional, defaults toTrue
) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta). - src_lang (
str
, optional) — A string representing the source language. If not specified, the lastsrc_lang
specified (either during initialization or when calling this tokenizer) will be used. - tgt_lang (
str
, optional) — A string representing the target language. If not specified, the lasttgt_lang
specified (either during initialization or when calling this tokenizer) will be used. - kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to PreTrainedTokenizer.call().
build_inputs_with_special_tokens
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
- token_ids_0 (
List[int]
) — List of IDs to which the special tokens will be added. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of input IDs with the appropriate special tokens.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An NLLB sequence has the following format, where X
represents the sequence:
input_ids
(for encoder)X [eos, src_lang_code]
decoder_input_ids
: (for decoder)X [eos, tgt_lang_code]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
get_special_tokens_mask
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → List[int]
Parameters
- token_ids_0 (
List[int]
) — List of IDs. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs. - already_has_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
method.
create_token_type_ids_from_sequences
< source >( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not make use of token type ids, therefore a list of zeros is returned.
SeamlessM4TTokenizerFast
class transformers.SeamlessM4TTokenizerFast
< source >( vocab_file = None tokenizer_file = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' src_lang = 'eng' tgt_lang = 'fra' additional_special_tokens = None **kwargs )
Parameters
- vocab_file (
str
, optional) — Path to the vocabulary file. - tokenizer_file (
str
, optional) — The path to a tokenizer file to use instead of the vocab file. - bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token
. - eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
. - sep_token (
str
, optional, defaults to"</s>"
) — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - cls_token (
str
, optional, defaults to"<s>"
) — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. - unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. - src_lang (
str
, optional, defaults to"eng"
) — The language to use as source language for translation. - tgt_lang (
str
, optional, defaults to"fra"
) — The language to use as target language for translation. - additional_special_tokens (tuple or list of
str
ortokenizers.AddedToken
, optional) — A tuple or a list of additional special tokens.
Construct a “fast” SeamlessM4T tokenizer (backed by HuggingFace’s tokenizers library). Based on BPE.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
The tokenization method is <language code> <tokens> <eos>
for source language documents, and <eos> <language code> <tokens> <eos>
for target language documents.
Examples:
>>> from transformers import SeamlessM4TTokenizerFast
>>> tokenizer = SeamlessM4TTokenizerFast.from_pretrained(
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
__call__
< source >( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True pad_to_multiple_of: typing.Optional[int] = 2 src_lang: typing.Optional[str] = None tgt_lang: typing.Optional[str] = None **kwargs )
Parameters
- text (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - text_pair (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - text_target (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - text_pair_target (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - padding (
bool
,str
or PaddingStrategy, optional, defaults toTrue
) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta). - src_lang (
str
, optional) — A string representing the source language. If not specified, the lastsrc_lang
specified (either during initialization or when calling this tokenizer) will be used. - tgt_lang (
str
, optional) — A string representing the target language. If not specified, the lasttgt_lang
specified (either during initialization or when calling this tokenizer) will be used. - kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to PreTrainedTokenizerFast.call().
SeamlessM4TFeatureExtractor
class transformers.SeamlessM4TFeatureExtractor
< source >( feature_size = 80 sampling_rate = 16000 num_mel_bins = 80 padding_value = 0.0 stride = 2 **kwargs )
Parameters
- feature_size (
int
, optional, defaults to 80) — The feature dimension of the extracted features. - sampling_rate (
int
, optional, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). - num_mel_bins (
int
, optional, defaults to 80) — Number of Mel-frequency bins. - padding_value (
float
, optional, defaults to 0.0) — The value that is used to fill the padding vectors. - stride (
int
, optional, defaults to 2) — Stride used to reshape audios from shape (batch_size,num_frames,num_mel_bins) to (batch_size,num_frames//stride,num_mel_bins*stride).
Constructs a SeamlessM4T feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech.
__call__
< source >( raw_speech: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True pad_to_multiple_of: typing.Optional[int] = 2 max_length: typing.Optional[int] = None truncation: bool = False return_tensors: typing.Union[transformers.utils.generic.TensorType, str, NoneType] = None sampling_rate: typing.Optional[int] = None return_attention_mask: typing.Optional[bool] = None do_normalize_per_mel_bins: typing.Optional[bool] = True **kwargs )
Parameters
- raw_speech (
np.ndarray
,List[float]
,List[np.ndarray]
,List[List[float]]
,List[List[List[float]]]
) — The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float values, a list of numpy arrays, a list of list of float values or a list of a list of list of float values. Ifraw_speech
is a one-dimensionalnp.ndarray
or aList[float]
,raw_speech
is considered a single-channel, single-sample sound. In all other cases, the first dimension ofraw_speech
, whether from annp.ndarray
or aList[...]
, corresponds to the number of samples in the batch, and the number of channels (i.e. mono or stereo character) is derived from the other dimensions (1D -> single-channel waveform batches; 2D-> stereo-channel waveform batches). - padding (
bool
,str
or PaddingStrategy, optional, defaults toTrue
) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- pad_to_multiple_of (
int
, optional, defaults to 2) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. - max_length (
int
, optional) — Maximum length of the returned list and optionally padding length (see above). - truncation (
bool
) — Activates truncation to cut input sequences longer than max_length to max_length. - return_attention_mask (
bool
, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.For SeamlessM4T models,
attention_mask
should always be passed for batched inference, to avoid subtle bugs. - return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
- sampling_rate (
int
, optional) — The sampling rate at which theraw_speech
input was sampled. It is strongly recommended to passsampling_rate
at the forward call to prevent silent errors. - do_normalize_per_mel_bins (
bool
, optional, defaults toTrue
) — Whether or not to zero-mean unit-variance normalize the input per mel-channel. - kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to the tokenizer or the feature extractor.
Main method to featurize and prepare for the model one or several sequence(s).
SeamlessM4TProcessor
class transformers.SeamlessM4TProcessor
< source >( feature_extractor tokenizer )
Parameters
- feature_extractor (SeamlessM4TFeatureExtractor) — The audio processor is a required input.
- tokenizer (SeamlessM4TTokenizerFast) — The tokenizer is a required input.
Constructs a SeamlessM4T processor which wraps a SeamlessM4T feature extractor and a SeamlessM4T tokenizer into a single processor.
SeamlessM4TProcessor offers all the functionalities of SeamlessM4TFeatureExtractor and
SeamlessM4TTokenizerFast. See the call() and decode()
for
more information.
__call__
< source >( text = None audios = None src_lang = None tgt_lang = None **kwargs ) → BatchEncoding
Parameters
- text (
str
,List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - audios (
np.ndarray
,torch.Tensor
,List[np.ndarray]
,List[torch.Tensor]
) — The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels, and T the sample length of the audio. - src_lang (
str
, optional) — The language code of the input texts/audios. If not specified, the lastsrc_lang
specified will be used. - tgt_lang (
str
, optional) — The code of the target language. If not specified, the lasttgt_lang
specified will be used. - kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to the feature extractor and/or the tokenizer.
Returns
A BatchEncoding with the following fields:
- input_ids — List of token ids to be fed to a model. Returned when
text
is notNone
. - attention_mask — List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True
or if “attention_mask” is inself.model_input_names
and iftext
is notNone
). - input_features — Audio input features to be fed to a model. Returned when
audios
is notNone
.
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the text
and kwargs
arguments to SeamlessM4TTokenizerFast’s call() if text
is not
None
to encode the text. To prepare the audio(s), this method forwards the audios
and kwrags
arguments to
SeamlessM4TFeatureExtractor’s call() if audios
is not None
. Please refer
to the doctsring of the above two methods for more information.
SeamlessM4TCodeHifiGan
class transformers.SeamlessM4TCodeHifiGan
< source >( config )
Parameters
- config (SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Code HiFi-GAN vocoder as described in this repository. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor spkr_id: Tensor lang_id: Tensor )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTextToUnitForConditionalGeneration. What are input IDs?
- spkr_id (
int
, optional) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs
. - tgt_lang (
str
, optional) — The language id to use as target language for translation.
SeamlessM4THifiGan
forward
< source >( input_embeds: FloatTensor ) → torch.FloatTensor
Parameters
- spectrogram (
torch.FloatTensor
) — Tensor containing the log-mel spectrograms. Can be batched and of shape(batch_size, sequence_length, model_in_dim)
, or un-batched and of shape(sequence_length, model_in_dim)
. Note thatmodel_in_dim
is the sum ofconfig.unit_embed_dim
,config.lang_embed_dim
andconfig.spkr_embed_dim
.
Returns
torch.FloatTensor
Tensor containing the speech waveform. If the input spectrogram is batched, will be of
shape (batch_size, num_frames,)
. If un-batched, will be of shape (num_frames,)
.
Converts a log-mel spectrogram into a speech waveform. Passing a batch of log-mel spectrograms returns a batch of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a single, un-batched speech waveform.
SeamlessM4TTextToUnitModel
class transformers.SeamlessM4TTextToUnitModel
< source >( config: SeamlessM4TConfig embed_tokens_decoder: typing.Optional[torch.nn.modules.sparse.Embedding] = None )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- embed_tokens_decoder (
nn.Embedding
, optional) — input embedding of the decoder.
Transformer bare text-to-unit encoder-decoder. The encoder is a SeamlessM4TEncoder
without embeddings and the decoder is a SeamlessM4TDecoder
.
This model is a PyTorch torch.nn.Module sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
SeamlessM4TTextToUnitForConditionalGeneration
class transformers.SeamlessM4TTextToUnitForConditionalGeneration
< source >( config: SeamlessM4TConfig embed_tokens_decoder: typing.Optional[torch.nn.modules.sparse.Embedding] = None )
Parameters
- config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- embed_tokens_decoder (
nn.Embedding
, optional) — input embedding of the decoder.
Transformer text-to-unit encoder-decoder with a language model head. The base encoder-decoder model is a SeamlessM4TTextToUnit
.
This model is a PyTorch torch.nn.Module sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the
eos_token_id
as the starting token fordecoder_input_ids
generation. Ifpast_key_values
is used, optionally only the lastdecoder_input_ids
have to be input (seepast_key_values
).For translation and summarization training,
decoder_input_ids
should be provided. If nodecoder_input_ids
is provided, the model will create this tensor by shifting theinput_ids
to the right for denoising pre-training following the paper. - decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids
. Causal mask will also be used by default.If you want to change padding behavior, you should read
modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy. - encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional) — Tuple consists of (last_hidden_state
, optional:hidden_states
, optional:attentions
)last_hidden_state
of shape(batch_size, sequence_length, hidden_size)
, optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.If
decoder_input_ids
anddecoder_inputs_embeds
are both unset,decoder_inputs_embeds
takes the value ofinputs_embeds
. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should be in[-100, 0, ..., config.vocab_size]
(seeinput_ids
docstring) Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
- use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The SeamlessM4TTextToUnitForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.