transformers documentation

UniSpeech-SAT

UniSpeech-SAT

Overview

The UniSpeech-SAT model was proposed in UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu .

The abstract from the paper is the following:

Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years witness great successes in applying self-supervised learning in speech recognition, while limited exploration was attempted in applying SSL for modeling speaker characteristics. In this paper, we aim to improve the existing SSL framework for speaker representation learning. Two methods are introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement in all SUPERB tasks.

Tips:

  • UniSpeechSat is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please use Wav2Vec2Processor for the feature extraction.
  • UniSpeechSat model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.
  • UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks.

This model was contributed by patrickvonplaten. The Authors’ code can be found here.

UniSpeechSatConfig

class transformers.UniSpeechSatConfig < > expand 

( vocab_size = 32 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout = 0.1 activation_dropout = 0.1 attention_dropout = 0.1 feat_proj_dropout = 0.0 feat_quantizer_dropout = 0.0 final_dropout = 0.1 layerdrop = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-05 feat_extract_norm = 'group' feat_extract_activation = 'gelu' conv_dim = (512, 512, 512, 512, 512, 512, 512) conv_stride = (5, 2, 2, 2, 2, 2, 2) conv_kernel = (10, 3, 3, 3, 3, 2, 2) conv_bias = False num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 do_stable_layer_norm = False apply_spec_augment = True mask_time_prob = 0.05 mask_time_length = 10 mask_time_min_masks = 2 mask_feature_prob = 0.0 mask_feature_length = 10 mask_feature_min_masks = 0 num_codevectors_per_group = 320 num_codevector_groups = 2 contrastive_logits_temperature = 0.1 num_negatives = 100 codevector_dim = 256 proj_codevector_dim = 256 diversity_loss_weight = 0.1 ctc_loss_reduction = 'mean' ctc_zero_infinity = False use_weighted_layer_sum = False classifier_proj_size = 256 pad_token_id = 0 bos_token_id = 1 eos_token_id = 2 num_clusters = 504 **kwargs )

This is the configuration class to store the configuration of a UniSpeechSatModel. It is used to instantiate an UniSpeechSat model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the UniSpeechSat facebook/unispeech_sat-base-960h architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import UniSpeechSatModel, UniSpeechSatConfig

>>> # Initializing a UniSpeechSat facebook/unispeech_sat-base-960h style configuration
>>> configuration = UniSpeechSatConfig()

>>> # Initializing a model from the facebook/unispeech_sat-base-960h style configuration
>>> model = UniSpeechSatModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

UniSpeechSat specific outputs

class transformers.models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatBaseModelOutput < > expand 

( last_hidden_state: FloatTensor = None extract_features: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Output type of UniSpeechSatBaseModelOutput, with potential hidden states and attentions.

class transformers.models.unispeech_sat.modeling_unispeech_sat.UniSpeechSatForPreTrainingOutput < > expand 

( loss: typing.Optional[torch.FloatTensor] = None logits: FloatTensor = None projected_states: FloatTensor = None projected_quantized_states: FloatTensor = None codevector_perplexity: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Output type of UniSpeechSatForPreTrainingOutput, with potential hidden states and attentions.

UniSpeechSatModel

class transformers.UniSpeechSatModel < > expand 

( config: UniSpeechSatConfig )

The bare UniSpeechSat Model transformer outputting raw hidden-states without any specific head on top. UniSpeechSat was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None mask_time_indices = None output_attentions = None output_hidden_states = None return_dict = None ) UniSpeechSatBaseModelOutput or tuple(torch.FloatTensor)

The UniSpeechSatModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2Processor, UniSpeechSatModel
>>> from datasets import load_dataset

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
>>> model = UniSpeechSatModel.from_pretrained('microsoft/unispeech-sat-base-plus')

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

UniSpeechSatForCTC

class transformers.UniSpeechSatForCTC < > expand 

( config )

UniSpeechSat Model with a language modeling head on top for Connectionist Temporal Classification (CTC). UniSpeechSat was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None labels = None ) CausalLMOutput or tuple(torch.FloatTensor)

The UniSpeechSatForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2Processor, UniSpeechSatForCTC
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
>>> model = UniSpeechSatForCTC.from_pretrained('microsoft/unispeech-sat-base-plus')

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)

>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)

>>> # compute loss
>>> with processor.as_target_processor():
...     inputs["labels"] = processor(dataset[0]["text"], return_tensors="pt").input_ids

>>> loss = model(**inputs).loss

UniSpeechSatForSequenceClassification

class transformers.UniSpeechSatForSequenceClassification < > expand 

( config )

UniSpeechSat Model with a sequence classification head on top (a linear layer over the pooled output) for tasks like SUPERB Keyword Spotting.

UniSpeechSat was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None labels = None ) SequenceClassifierOutput or tuple(torch.FloatTensor)

The UniSpeechSatForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2FeatureExtractor, UniSpeechSatForSequenceClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/unispeech-sat-base-plus')
>>> model = UniSpeechSatForSequenceClassification.from_pretrained('microsoft/unispeech-sat-base-plus')

>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_class_ids = torch.argmax(logits, dim=-1)
>>> predicted_label = model.config.id2label[predicted_class_ids]

>>> # compute loss - target_label is e.g. "down"
>>> target_label = model.config.id2label[0]
>>> inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
>>> loss = model(**inputs).loss

UniSpeechSatForPreTraining

class transformers.UniSpeechSatForPreTraining < > expand 

( config: UniSpeechSatConfig )

UniSpeechSat Model with a quantizer and VQ head on top. UniSpeechSat was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None ) UniSpeechSatForPreTrainingOutput or tuple(torch.FloatTensor)

The UniSpeechSatForPreTraining forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import torch
>>> from transformers import UniSpeechSatFeatureExtractor, UniSpeechSatForPreTraining
>>> from transformers.models.unispeech_sat.modeling_unispeech_sat import _compute_mask_indices
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> feature_extractor = UniSpeechSatFeatureExtractor.from_pretrained("patrickvonplaten/unispeech_sat-base")
>>> model = UniSpeechSatForPreTraining.from_pretrained("patrickvonplaten/unispeech_sat-base")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = feature_extractor(ds["speech"][0], return_tensors="pt").input_values  # Batch size 1

>>> # compute masked indices
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length)
>>> mask_time_indices = _compute_mask_indices((batch_size, sequence_length), mask_prob=0.2, mask_length=2, device=model.device)

>>> with torch.no_grad():
...     outputs = model(input_values, mask_time_indices=mask_time_indices)

>>> # compute cosine similarity between predicted (=projected_states) and target (=projected_quantized_states)
>>> cosine_sim = torch.cosine_similarity(
...     outputs.projected_states, outputs.projected_quantized_states, dim=-1
... )

>>> # show that cosine similarity is much higher than random
>>> assert cosine_sim[mask_time_indices].mean() > 0.5

>>> # for contrastive loss training model should be put into train mode
>>> model.train()
>>> loss = model(input_values, mask_time_indices=mask_time_indices).loss