transformers documentation

UniSpeech

UniSpeech

Overview

The UniSpeech model was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang .

The abstract from the paper is the following:

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.

Tips:

  • UniSpeech is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please use Wav2Vec2Processor for the feature extraction.
  • UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.

This model was contributed by patrickvonplaten. The Authors’ code can be found here.

UniSpeechConfig

class transformers.UniSpeechConfig < > expand 

( vocab_size = 32 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout = 0.1 activation_dropout = 0.1 attention_dropout = 0.1 feat_proj_dropout = 0.0 feat_quantizer_dropout = 0.0 final_dropout = 0.1 layerdrop = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-05 feat_extract_norm = 'group' feat_extract_activation = 'gelu' conv_dim = (512, 512, 512, 512, 512, 512, 512) conv_stride = (5, 2, 2, 2, 2, 2, 2) conv_kernel = (10, 3, 3, 3, 3, 2, 2) conv_bias = False num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 do_stable_layer_norm = False apply_spec_augment = True mask_time_prob = 0.05 mask_time_length = 10 mask_time_min_masks = 2 mask_feature_prob = 0.0 mask_feature_length = 10 mask_feature_min_masks = 0 num_codevectors_per_group = 320 num_codevector_groups = 2 contrastive_logits_temperature = 0.1 num_negatives = 100 codevector_dim = 256 proj_codevector_dim = 256 diversity_loss_weight = 0.1 ctc_loss_reduction = 'mean' ctc_zero_infinity = False use_weighted_layer_sum = False classifier_proj_size = 256 num_ctc_classes = 80 pad_token_id = 0 bos_token_id = 1 eos_token_id = 2 replace_prob = 0.5 **kwargs )

This is the configuration class to store the configuration of a UniSpeechModel. It is used to instantiate an UniSpeech model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the UniSpeech facebook/unispeech-base-960h architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import UniSpeechModel, UniSpeechConfig

>>> # Initializing a UniSpeech facebook/unispeech-base-960h style configuration
>>> configuration = UniSpeechConfig()

>>> # Initializing a model from the facebook/unispeech-base-960h style configuration
>>> model = UniSpeechModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

UniSpeech specific outputs

class transformers.models.unispeech.modeling_unispeech.UniSpeechBaseModelOutput < > expand 

( last_hidden_state: FloatTensor = None extract_features: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Output type of UniSpeechBaseModelOutput, with potential hidden states and attentions.

class transformers.models.unispeech.modeling_unispeech.UniSpeechForPreTrainingOutput < > expand 

( loss: typing.Optional[torch.FloatTensor] = None projected_states: FloatTensor = None projected_quantized_states: FloatTensor = None codevector_perplexity: FloatTensor = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Output type of UniSpeechForPreTrainingOutput, with potential hidden states and attentions.

UniSpeechModel

class transformers.UniSpeechModel < > expand 

( config: UniSpeechConfig )

The bare UniSpeech Model transformer outputting raw hidden-states without any specific head on top. UniSpeech was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None mask_time_indices = None output_attentions = None output_hidden_states = None return_dict = None ) UniSpeechBaseModelOutput or tuple(torch.FloatTensor)

The UniSpeechModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2Processor, UniSpeechModel
>>> from datasets import load_dataset

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-large-1500h-cv')
>>> model = UniSpeechModel.from_pretrained('microsoft/unispeech-large-1500h-cv')

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

UniSpeechForCTC

class transformers.UniSpeechForCTC < > expand 

( config )

UniSpeech Model with a language modeling head on top for Connectionist Temporal Classification (CTC). UniSpeech was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None labels = None ) CausalLMOutput or tuple(torch.FloatTensor)

The UniSpeechForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2Processor, UniSpeechForCTC
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-large-1500h-cv')
>>> model = UniSpeechForCTC.from_pretrained('microsoft/unispeech-large-1500h-cv')

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)

>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)

>>> # compute loss
>>> with processor.as_target_processor():
...     inputs["labels"] = processor(dataset[0]["text"], return_tensors="pt").input_ids

>>> loss = model(**inputs).loss

UniSpeechForSequenceClassification

class transformers.UniSpeechForSequenceClassification < > expand 

( config )

UniSpeech Model with a sequence classification head on top (a linear layer over the pooled output) for tasks like SUPERB Keyword Spotting.

UniSpeech was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None labels = None ) SequenceClassifierOutput or tuple(torch.FloatTensor)

The UniSpeechForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2FeatureExtractor, UniSpeechForSequenceClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/unispeech-large-1500h-cv')
>>> model = UniSpeechForSequenceClassification.from_pretrained('microsoft/unispeech-large-1500h-cv')

>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_class_ids = torch.argmax(logits, dim=-1)
>>> predicted_label = model.config.id2label[predicted_class_ids]

>>> # compute loss - target_label is e.g. "down"
>>> target_label = model.config.id2label[0]
>>> inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
>>> loss = model(**inputs).loss

UniSpeechForPreTraining

class transformers.UniSpeechForPreTraining < > expand 

( config: UniSpeechConfig )

UniSpeech Model with a vector-quantization module and ctc loss for pre-training. UniSpeech was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None ) UniSpeechForPreTrainingOutput or tuple(torch.FloatTensor)

The UniSpeechForPreTraining forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import torch
>>> from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForPreTraining
>>> from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("patrickvonplaten/wav2vec2-base")
>>> model = Wav2Vec2ForPreTraining.from_pretrained("patrickvonplaten/wav2vec2-base")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = feature_extractor(ds["speech"][0], return_tensors="pt").input_values  # Batch size 1

>>> # compute masked indices
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length)
>>> mask_time_indices = _compute_mask_indices((batch_size, sequence_length), mask_prob=0.2, mask_length=2, device=model.device)

>>> with torch.no_grad():
...     outputs = model(input_values, mask_time_indices=mask_time_indices)

>>> # compute cosine similarity between predicted (=projected_states) and target (=projected_quantized_states)
>>> cosine_sim = torch.cosine_similarity(
...     outputs.projected_states, outputs.projected_quantized_states, dim=-1
... )

>>> # show that cosine similarity is much higher than random
>>> assert cosine_sim[mask_time_indices].mean() > 0.5

>>> # for contrastive loss training model should be put into train mode
>>> model.train()
>>> loss = model(input_values, mask_time_indices=mask_time_indices).loss