Transformers documentation

SeamlessM4T

You are viewing v4.39.0 version. A newer version v4.47.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SeamlessM4T

Overview

The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.

This is the version 1 release of the model. For the updated version 2 release, refer to the Seamless M4T v2 docs.

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Text-to-speech translation (T2ST)
  • Text-to-text translation (T2TT)
  • Automatic speech recognition (ASR)

SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.

The abstract from the paper is the following:

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

Usage

First, load the processor and a checkpoint of the model:

>>> from transformers import AutoProcessor, SeamlessM4TModel

>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")

You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

Here is how to use the processor to process text and audio:

>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]

>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")

>>> # now, process some English test as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")

Speech

SeamlessM4TModel can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:

>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.

Text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False to SeamlessM4TModel.generate(). This time, let’s translate to French.

>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)

>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)

Tips

1. Use dedicated models

SeamlessM4TModel is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:

>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")

Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False.

>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")

Feel free to try out SeamlessM4TForSpeechToText and SeamlessM4TForTextToSpeech as well.

2. Change the speaker identity

You have the possibility to change the speaker used for speech synthesis with the spkr_id argument. Some spkr_id works better than other for some languages!

3. Change the generation strategy

You can use different generation strategies for speech and text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True) which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.

4. Generate speech and text at the same time

Use return_intermediate_token_ids=True with SeamlessM4TModel to return both speech and text !

Model architecture

SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.

Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.

Here’s how the generation process works:

  • Input text or speech is processed through its specific encoder.
  • A decoder creates text tokens in the desired language.
  • If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
  • These unit tokens are then passed through the final vocoder to produce the actual speech.

This model was contributed by ylacombe. The original code can be found here.

SeamlessM4TModel

class transformers.SeamlessM4TModel

< >

( config current_modality = 'text' )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • current_modality (str, optional, defaults to "text") — Default modality. Used to initialize the model.

The original SeamlessM4T Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST). This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate

< >

( input_ids: Optional = None input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None spkr_id: Optional = 0 generate_speech: Optional = True **kwargs ) Union[SeamlessM4TGenerationOutput, Tuple[Tensor], ModelOutput]

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks), optional) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
  • return_intermediate_token_ids (bool, optional) — If True, also returns the intermediate generated text and unit tokens. Set to True if you also want to get translated text alongside the audio. Note that if generate_speech=True, this parameter will be ignored.
  • tgt_lang (str, optional) — The language to use as target language for translation.
  • spkr_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
  • generate_speech (bool, optional, defaults to True) — If False, will only returns the text tokens and won’t generate speech.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword arguments are of two types:

    • Without a prefix, they will be entered as **kwargs for the generate method of each sub-model, except for decoder_input_ids which will only be passed through the text components.
    • With a text_ or speech_ prefix, they will be input for the generate method of the text model and speech model respectively. It has the priority over the keywords without a prefix.

    This means you can, for example, specify a generation strategy for one generation but not for the other.

Returns

Union[SeamlessM4TGenerationOutput, Tuple[Tensor], ModelOutput]

  • If generate_speech and return_intermediate_token_ids, returns SeamlessM4TGenerationOutput.
  • If generate_speech and not return_intermediate_token_ids, returns a tuple composed of waveforms of shape (batch_size, sequence_length)and and waveform_lengths which gives the length of each sample.
  • If generate_speech=False, it will returns ModelOutput.

Generates translated token ids and/or translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_ids=input_ids, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4TForTextToSpeech

class transformers.SeamlessM4TForTextToSpeech

< >

( config: SeamlessM4TConfig )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The text-to-speech SeamlessM4T Model transformer which can be used for T2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate

< >

( input_ids: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None spkr_id: Optional = 0 **kwargs ) Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • return_intermediate_token_ids (bool, optional) — If True, also returns the intermediate generated text and unit tokens. Set to True if you also want to get translated text alongside the audio.
  • tgt_lang (str, optional) — The language to use as target language for translation.
  • spkr_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword arguments are of two types:

    • Without a prefix, they will be entered as **kwargs for the generate method of each sub-model, except for decoder_input_ids which will only be passed through the text components.
    • With a text_ or speech_ prefix, they will be input for the generate method of the text model and speech model respectively. It has the priority over the keywords without a prefix.

    This means you can, for example, specify a generation strategy for one generation but not for the other.

Returns

Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]

  • If return_intermediate_token_ids, returns SeamlessM4TGenerationOutput.
  • If not return_intermediate_token_ids, returns a tuple composed of waveforms of shape (batch_size, sequence_length)and and waveform_lengths which gives the length of each sample.

Generates translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_ids, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4TForSpeechToSpeech

class transformers.SeamlessM4TForSpeechToSpeech

< >

( config )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The speech-to-speech SeamlessM4T Model transformer which can be used for S2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate

< >

( input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None spkr_id: Optional = 0 **kwargs ) Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]

Parameters

  • input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks)) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
  • return_intermediate_token_ids (bool, optional) — If True, also returns the intermediate generated text and unit tokens. Set to True if you also want to get translated text alongside the audio.
  • tgt_lang (str, optional) — The language to use as target language for translation.
  • spkr_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword arguments are of two types:

    • Without a prefix, they will be entered as **kwargs for the generate method of each sub-model, except for decoder_input_ids which will only be passed through the text components.
    • With a text_ or speech_ prefix, they will be input for the generate method of the text model and speech model respectively. It has the priority over the keywords without a prefix.

    This means you can, for example, specify a generation strategy for one generation but not for the other.

Returns

Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]

  • If return_intermediate_token_ids, returns SeamlessM4TGenerationOutput.
  • If not return_intermediate_token_ids, returns a tuple composed of waveforms of shape (batch_size, sequence_length)and and waveform_lengths which gives the length of each sample.

Generates translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_features, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4TForTextToText

class transformers.SeamlessM4TForTextToText

< >

( config: SeamlessM4TConfig )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The text-to-text SeamlessM4T Model transformer which can be used for T2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are decoder input IDs?

    Bart uses the eos_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

    For translation and summarization training, decoder_input_ids should be provided. If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper.

  • decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

    If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

    If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

The SeamlessM4TForTextToText forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

generate

< >

( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) ModelOutput or torch.LongTensor

Parameters

  • input_ids (torch.Tensor of varying shape depending on the modality, optional) — Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • tgt_lang (str, optional) — The language to use as target language for translation.
  • generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
  • logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
  • stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
  • prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
  • kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model.

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible ModelOutput types are:

Generates sequences of token ids.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4TForSpeechToText

class transformers.SeamlessM4TForSpeechToText

< >

( config: SeamlessM4TConfig )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The speech-to-text SeamlessM4T Model transformer which can be used for S2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_features: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )

Parameters

  • input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks)) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
  • attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are decoder input IDs?

    Bart uses the eos_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

    For translation and summarization training, decoder_input_ids should be provided. If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper.

  • decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

    If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

    If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

The SeamlessM4TForSpeechToText forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

generate

< >

( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) ModelOutput or torch.LongTensor

Parameters

  • input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks)) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
  • tgt_lang (str, optional) — The language to use as target language for translation.
  • generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
  • logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
  • stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
  • prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
  • kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model.

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible ModelOutput types are:

Generates sequences of token ids.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4TConfig

class transformers.SeamlessM4TConfig

< >

( vocab_size = 256102 t2u_vocab_size = 10082 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 1024 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative' rotary_embedding_base = 10000 max_source_positions = 4096 conv_depthwise_kernel_size = 31 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_decoder_start_token_id = 2 t2u_max_new_tokens = 1024 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 2048 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )

Parameters

Parameters shared across sub-models

  • hidden_size (int, optional, defaults to 1024) — Dimensionality of the “intermediate” layers in the architecture.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • layer_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers.
  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models).
  • max_position_embeddings (int, optional, defaults to 1024) — The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
  • is_encoder_decoder (bool, optional, defaults to True) — Whether the model is used as an encoder/decoder or not.
  • encoder_layerdrop (float, optional, defaults to 0.05) — The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
  • decoder_layerdrop (float, optional, defaults to 0.05) — The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
  • activation_function (str or function, optional, defaults to "relu") — The non-linear activation function (function or string) in the decoder and feed-forward layers. If string, "gelu", "relu", "selu", "swish" and "gelu_new" are supported.
  • dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler.
  • attention_dropout (float, optional, defaults to 0.1) — The dropout probability for all attention layers.
  • activation_dropout (float, optional, defaults to 0.0) — The dropout probability for all activation layers in the model.
  • scale_embedding (bool, optional, defaults to True) — Scale embeddings by diving by sqrt(d_model).

Text encoder and text decoder specific parameters

  • encoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer text encoder.
  • encoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text encoder.
  • encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text encoder.
  • decoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer text decoder.
  • decoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text decoder.
  • decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text decoder.
  • decoder_start_token_id (int, optional, defaults to 3) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only applied in the text decoder.
  • max_new_tokens (int, optional, defaults to 256) — The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.
  • pad_token_id (int, optional, defaults to 0) — The id of the padding text token. Only applied to the text-decoder model.
  • bos_token_id (int, optional, defaults to 2) — The id of the beginning-of-stream text token. Only applied to the text-decoder model.
  • eos_token_id (int, optional, defaults to 3) — The id of the end-of-stream text token. Only applied to the text-decoder model.

Speech encoder specific parameters

  • speech_encoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer speech encoder.
  • speech_encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer speech encoder.
  • speech_encoder_intermediate_size (int, optional, defaults to 4096) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer speech encoder.
  • speech_encoder_hidden_act (str or function, optional, defaults to "swish") — The non-linear activation function (function or string) in the speech encoder. If string, "gelu", "relu", "selu", "swish" and "gelu_new" are supported.
  • speech_encoder_dropout (float, optional, defaults to 0.0) — The dropout probability for all layers in the speech encoder.
  • add_adapter (bool, optional, defaults to True) — Add an adapter layer on top of the speech encoder.
  • speech_encoder_layerdrop (float, optional, defaults to 0.1) — The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
  • feature_projection_input_dim (int, optional, defaults to 160) — Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing input audios with SeamlessM4TFeatureExtractor.
  • num_conv_pos_embeddings (int, optional, defaults to 128) — Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional embeddings layer of the speech encoder.
  • num_conv_pos_embedding_groups (int, optional, defaults to 16) — Number of groups of 1D convolutional positional embeddings layer of the speech encoder.
  • adaptor_kernel_size (int, optional, defaults to 8) — Kernel size of the convolutional layers in the adapter network. Only relevant if add_adapter is True.
  • adaptor_stride (int, optional, defaults to 8) — Stride of the convolutional layers in the adapter network. Only relevant if add_adapter is True.
  • adaptor_dropout (float, optional, defaults to 0.1) — The dropout probability for all layers in the speech adapter.
  • num_adapter_layers (int, optional, defaults to 1) — Number of convolutional layers that should be used in the adapter network. Only relevant if add_adapter is True.
  • position_embeddings_type (str, optional, defaults to "relative") — Can be specified to relative or rotary for relative or rotary position embeddings respectively. If left None no relative position embedding is applied. Only applied to the speech encoder.
  • rotary_embedding_base (int, optional, defaults to 10000) — If "rotary" position embeddings are used, defines the size of the embedding base. Only applied to the speech encoder.
  • max_source_positions (int, optional, defaults to 4096) — if "relative" position embeddings are used, defines the maximum source input positions. Only applied to the speech encoder.
  • conv_depthwise_kernel_size (int, optional, defaults to 31) — Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.

Text-To-Unit (t2u) model specific parameters

  • t2u_bos_token_id (int, optional, defaults to 0) — The id of the beginning-of-stream unit token. Only applied to the text-to-unit seq2seq model.
  • t2u_pad_token_id (int, optional, defaults to 1) — The id of the padding unit token. Only applied to the text-to-unit seq2seq model.
  • t2u_eos_token_id (int, optional, defaults to 2) — The id of the end-of-stream unit token. Only applied to the text-to-unit seq2seq model.
  • t2u_decoder_start_token_id (int, optional, defaults to 2) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only applied to the text-to-unit seq2seq model.
  • t2u_max_new_tokens (int, optional, defaults to 1024) — The maximum numbers of unit tokens to generate, ignoring the number of tokens in the prompt. Only applied to the text-to-unit seq2seq model.
  • t2u_encoder_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit encoder.
  • t2u_encoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit encoder.
  • t2u_encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit encoder.
  • t2u_decoder_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit decoder.
  • t2u_decoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit decoder.
  • t2u_decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit decoder.
  • t2u_max_position_embeddings (int, optional, defaults to 2048) — The maximum sequence length that this model text-to-unit component might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

    Hifi-Gan Vocoder specific parameters

  • sampling_rate (int, optional, defaults to 16000) — The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
  • upsample_initial_channel (int, optional, defaults to 512) — The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only.
  • upsample_rates (Tuple[int] or List[int], optional, defaults to [5, 4, 4, 2, 2]) — A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_rates defines the number of convolutional layers and has to match the length of upsample_kernel_sizes. Applies to the vocoder only.
  • upsample_kernel_sizes (Tuple[int] or List[int], optional, defaults to [11, 8, 8, 4, 4]) — A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_kernel_sizes defines the number of convolutional layers and has to match the length of upsample_rates. Applies to the vocoder only.
  • resblock_kernel_sizes (Tuple[int] or List[int], optional, defaults to [3, 7, 11]) — A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
  • resblock_dilation_sizes (Tuple[Tuple[int]] or List[List[int]], optional, defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]]) — A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
  • leaky_relu_slope (float, optional, defaults to 0.1) — The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder only.
  • unit_hifi_gan_vocab_size (int, optional, defaults to 10000) — Vocabulary size of the SeamlessM4T vocoder. Defines the number of different unit tokens that can be represented by the inputs_ids passed when calling the vocoder of ~SeamlessM4TModel, ~SeamlessM4TForSpeechToSpeech or ~SeamlessM4TForTextToSpeech.
  • unit_embed_dim (int, optional, defaults to 1280) — The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only.
  • lang_embed_dim (int, optional, defaults to 256) — The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only.
  • spkr_embed_dim (int, optional, defaults to 256) — The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only.
  • vocoder_num_langs (int, optional, defaults to 36) — Number of langs supported by the vocoder. Might be different from t2u_num_langs.
  • vocoder_num_spkrs (int, optional, defaults to 200) — Number of speakers supported by the vocoder.
  • variance_predictor_kernel_size (int, optional, defaults to 3) — Kernel size of the duration predictor. Applies to the vocoder only.
  • var_pred_dropout (float, optional, defaults to 0.5) — The dropout probability of the duration predictor. Applies to the vocoder only.
  • vocoder_offset (int, optional, defaults to 4) — Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.

This is the configuration class to store the configuration of a ~SeamlessM4TModel. It is used to instantiate an SeamlessM4T model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4T “facebook/hf-seamless-m4t-medium” architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

>>> from transformers import SeamlessM4TModel, SeamlessM4TConfig

>>> # Initializing a SeamlessM4T "facebook/hf-seamless-m4t-medium" style configuration
>>> configuration = SeamlessM4TConfig()

>>> # Initializing a model from the "facebook/hf-seamless-m4t-medium" style configuration
>>> model = SeamlessM4TModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

SeamlessM4TTokenizer

class transformers.SeamlessM4TTokenizer

< >

( vocab_file bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' tokenizer_file = None src_lang = 'eng' tgt_lang = 'fra' sp_model_kwargs: Optional = None additional_special_tokens = None add_prefix_space = True **kwargs )

Parameters

  • vocab_file (str) — Path to the vocabulary file.
  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

    When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
  • cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
  • tokenizer_file (str, optional) — The path to a tokenizer file to use instead of the vocab file.
  • src_lang (str, optional, defaults to "eng") — The language to use as source language for translation.
  • tgt_lang (str, optional, defaults to "fra") — The language to use as target language for translation.
  • sp_model_kwargs (Dict[str, Any], optional) — Additional keyword arguments to pass to the model initialization.
  • additional_special_tokens (tuple or list of str or tokenizers.AddedToken, optional) — A tuple or a list of additional special tokens. Can be used to specify the list of languages that will be supported by the tokenizer.
  • add_prefix_space (bool, optional, defaults to True) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word.

Construct a SeamlessM4T tokenizer.

Adapted from RobertaTokenizer and XLNetTokenizer. Based on SentencePiece.

The tokenization method is <language code> <tokens> <eos> for source language documents, and <eos> <language code> <tokens> <eos> for target language documents.

Examples:

>>> from transformers import SeamlessM4TTokenizer

>>> tokenizer = SeamlessM4TTokenizer.from_pretrained(
...     "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")

__call__

< >

( text: Union = None text_pair: Union = None text_target: Union = None text_pair_target: Union = None padding: Union = True pad_to_multiple_of: Optional = 2 src_lang: Optional = None tgt_lang: Optional = None **kwargs )

Parameters

  • text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_pair (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_pair_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • padding (bool, str or PaddingStrategy, optional, defaults to True) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

  • src_lang (str, optional) — A string representing the source language. If not specified, the last src_lang specified (either during initialization or when calling this tokenizer) will be used.
  • tgt_lang (str, optional) — A string representing the target language. If not specified, the last tgt_lang specified (either during initialization or when calling this tokenizer) will be used.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to PreTrainedTokenizer.call().

build_inputs_with_special_tokens

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs to which the special tokens will be added.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An NLLB sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) X [eos, src_lang_code]
  • decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

get_special_tokens_mask

< >

( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.
  • already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

create_token_type_ids_from_sequences

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of zeros.

Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not make use of token type ids, therefore a list of zeros is returned.

save_vocabulary

< >

( save_directory: str filename_prefix: Optional = None )

SeamlessM4TTokenizerFast

class transformers.SeamlessM4TTokenizerFast

< >

( vocab_file = None tokenizer_file = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' src_lang = 'eng' tgt_lang = 'fra' additional_special_tokens = None **kwargs )

Parameters

  • vocab_file (str, optional) — Path to the vocabulary file.
  • tokenizer_file (str, optional) — The path to a tokenizer file to use instead of the vocab file.
  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

    When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
  • cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
  • src_lang (str, optional, defaults to "eng") — The language to use as source language for translation.
  • tgt_lang (str, optional, defaults to "fra") — The language to use as target language for translation.
  • additional_special_tokens (tuple or list of str or tokenizers.AddedToken, optional) — A tuple or a list of additional special tokens.

Construct a “fast” SeamlessM4T tokenizer (backed by HuggingFace’s tokenizers library). Based on BPE.

This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

The tokenization method is <language code> <tokens> <eos> for source language documents, and <eos> <language code> <tokens> <eos> for target language documents.

Examples:

>>> from transformers import SeamlessM4TTokenizerFast

>>> tokenizer = SeamlessM4TTokenizerFast.from_pretrained(
...     "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")

__call__

< >

( text: Union = None text_pair: Union = None text_target: Union = None text_pair_target: Union = None padding: Union = True pad_to_multiple_of: Optional = 2 src_lang: Optional = None tgt_lang: Optional = None **kwargs )

Parameters

  • text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_pair (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • text_pair_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • padding (bool, str or PaddingStrategy, optional, defaults to True) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

  • src_lang (str, optional) — A string representing the source language. If not specified, the last src_lang specified (either during initialization or when calling this tokenizer) will be used.
  • tgt_lang (str, optional) — A string representing the target language. If not specified, the last tgt_lang specified (either during initialization or when calling this tokenizer) will be used.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to PreTrainedTokenizerFast.call().

SeamlessM4TFeatureExtractor

class transformers.SeamlessM4TFeatureExtractor

< >

( feature_size = 80 sampling_rate = 16000 num_mel_bins = 80 padding_value = 0.0 stride = 2 **kwargs )

Parameters

  • feature_size (int, optional, defaults to 80) — The feature dimension of the extracted features.
  • sampling_rate (int, optional, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
  • num_mel_bins (int, optional, defaults to 80) — Number of Mel-frequency bins.
  • padding_value (float, optional, defaults to 0.0) — The value that is used to fill the padding vectors.
  • stride (int, optional, defaults to 2) — Stride used to reshape audios from shape (batch_size,num_frames,num_mel_bins) to (batch_size,num_frames//stride,num_mel_bins*stride).

Constructs a SeamlessM4T feature extractor.

This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

This class extracts mel-filter bank features from raw speech.

__call__

< >

( raw_speech: Union padding: Union = True pad_to_multiple_of: Optional = 2 max_length: Optional = None truncation: bool = False return_tensors: Union = None sampling_rate: Optional = None return_attention_mask: Optional = None do_normalize_per_mel_bins: Optional = True **kwargs )

Parameters

  • raw_speech (np.ndarray, torch.Tensor, List[float], List[np.ndarray], List[torch.Tensor], —
  • List[List[float]], List[List[List[float]]]) — The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a torch tensor, a list of float values, a list of numpy arrays, a list of torch tensors, a list of list of float values or a list of a list of list of float values. If raw_speech is a one-dimensional np.ndarray, torch.Tensor or a List[float], raw_speech is considered a single-channel, single-sample sound. In all other cases, the first dimension of raw_speech, whether from an np.ndarray, a torch.Tensor or a List[...], corresponds to the number of samples in the batch, and the number of channels (i.e. mono or stereo character) is derived from the other dimensions (1D -> single-channel waveform batches; 2D-> stereo-channel waveform batches).
  • padding (bool, str or PaddingStrategy, optional, defaults to True) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • pad_to_multiple_of (int, optional, defaults to 2) — If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.

  • max_length (int, optional) — Maximum length of the returned list and optionally padding length (see above).
  • truncation (bool) — Activates truncation to cut input sequences longer than max_length to max_length.
  • return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.

    What are attention masks?

    For SeamlessM4T models, attention_mask should always be passed for batched inference, to avoid subtle bugs.

  • return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • sampling_rate (int, optional) — The sampling rate at which the raw_speech input was sampled. It is strongly recommended to pass sampling_rate at the forward call to prevent silent errors.
  • do_normalize_per_mel_bins (bool, optional, defaults to True) — Whether or not to zero-mean unit-variance normalize the input per mel-channel.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to the tokenizer or the feature extractor.

Main method to featurize and prepare for the model one or several sequence(s).

SeamlessM4TProcessor

class transformers.SeamlessM4TProcessor

< >

( feature_extractor tokenizer )

Parameters

Constructs a SeamlessM4T processor which wraps a SeamlessM4T feature extractor and a SeamlessM4T tokenizer into a single processor.

SeamlessM4TProcessor offers all the functionalities of SeamlessM4TFeatureExtractor and SeamlessM4TTokenizerFast. See the call() and decode() for more information.

__call__

< >

( text = None audios = None src_lang = None tgt_lang = None **kwargs ) BatchEncoding

Parameters

  • text (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
  • audios (np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]) — The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels, and T the sample length of the audio.
  • src_lang (str, optional) — The language code of the input texts/audios. If not specified, the last src_lang specified will be used.
  • tgt_lang (str, optional) — The code of the target language. If not specified, the last tgt_lang specified will be used.
  • kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to the feature extractor and/or the tokenizer.

Returns

BatchEncoding

A BatchEncoding with the following fields:

  • input_ids — List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask — List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names and if text is not None).
  • input_features — Audio input features to be fed to a model. Returned when audios is not None.

Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the text and kwargs arguments to SeamlessM4TTokenizerFast’s call() if text is not None to encode the text. To prepare the audio(s), this method forwards the audios and kwrags arguments to SeamlessM4TFeatureExtractor’s call() if audios is not None. Please refer to the doctsring of the above two methods for more information.

SeamlessM4TCodeHifiGan

class transformers.SeamlessM4TCodeHifiGan

< >

( config )

Parameters

  • config (SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Code HiFi-GAN vocoder as described in this repository. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: LongTensor spkr_id: Tensor lang_id: Tensor )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using SeamlessM4TTextToUnitForConditionalGeneration. What are input IDs?

  • spkr_id (int, optional) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
  • tgt_lang (str, optional) — The language id to use as target language for translation.

SeamlessM4THifiGan

class transformers.SeamlessM4THifiGan

< >

( config: SeamlessM4TConfig )

forward

< >

( input_embeds: FloatTensor ) torch.FloatTensor

Parameters

  • spectrogram (torch.FloatTensor) — Tensor containing the log-mel spectrograms. Can be batched and of shape (batch_size, sequence_length, model_in_dim), or un-batched and of shape (sequence_length, model_in_dim). Note that model_in_dim is the sum of config.unit_embed_dim, config.lang_embed_dim and config.spkr_embed_dim.

Returns

torch.FloatTensor

Tensor containing the speech waveform. If the input spectrogram is batched, will be of shape (batch_size, num_frames,). If un-batched, will be of shape (num_frames,).

Converts a log-mel spectrogram into a speech waveform. Passing a batch of log-mel spectrograms returns a batch of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a single, un-batched speech waveform.

SeamlessM4TTextToUnitModel

class transformers.SeamlessM4TTextToUnitModel

< >

( config: SeamlessM4TConfig embed_tokens_decoder: Optional = None )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • embed_tokens_decoder (nn.Embedding, optional) — input embedding of the decoder.

Transformer bare text-to-unit encoder-decoder. The encoder is a SeamlessM4TEncoder without embeddings and the decoder is a SeamlessM4TDecoder. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

SeamlessM4TTextToUnitForConditionalGeneration

class transformers.SeamlessM4TTextToUnitForConditionalGeneration

< >

( config: SeamlessM4TConfig embed_tokens_decoder: Optional = None )

Parameters

  • config (~SeamlessM4TConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • embed_tokens_decoder (nn.Embedding, optional) — input embedding of the decoder.

Transformer text-to-unit encoder-decoder with a language model head. The base encoder-decoder model is a SeamlessM4TTextToUnit. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are decoder input IDs?

    Bart uses the eos_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

    For translation and summarization training, decoder_input_ids should be provided. If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper.

  • decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

    If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

    If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

The SeamlessM4TTextToUnitForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.