transformers documentation

SEW-D

SEW-D

Overview

SEW-D (Squeezed and Efficient Wav2Vec with Disentangled attention) was proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

The abstract from the paper is the following:

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.

Tips:

  • SEW-D is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
  • SEWDForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.

This model was contributed by anton-l.

SEWDConfig

class transformers.SEWDConfig < > expand 

( vocab_size = 32 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 squeeze_factor = 2 max_position_embeddings = 512 position_buckets = 256 share_att_key = True relative_attention = True position_biased_input = False pos_att_type = ('p2c', 'c2p') norm_rel_ebd = 'layer_norm' hidden_act = 'gelu_python' hidden_dropout = 0.1 activation_dropout = 0.1 attention_dropout = 0.1 feat_proj_dropout = 0.0 final_dropout = 0.1 layerdrop = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-07 feature_layer_norm_eps = 1e-05 feat_extract_norm = 'group' feat_extract_activation = 'gelu' conv_dim = (64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512) conv_stride = (5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1) conv_kernel = (10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1) conv_bias = False num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 apply_spec_augment = True mask_time_prob = 0.05 mask_time_length = 10 mask_time_min_masks = 2 mask_feature_prob = 0.0 mask_feature_length = 10 mask_feature_min_masks = 0 ctc_loss_reduction = 'mean' ctc_zero_infinity = False use_weighted_layer_sum = False classifier_proj_size = 256 pad_token_id = 0 bos_token_id = 1 eos_token_id = 2 **kwargs )

This is the configuration class to store the configuration of a SEWDModel. It is used to instantiate a SEW-D model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SEW-D asapp/sew-d-tiny-100k architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import SEWDModel, SEWDConfig

>>> # Initializing a SEW-D asapp/sew-d-tiny-100k style configuration
>>> configuration = SEWDConfig()

>>> # Initializing a model from the asapp/sew-d-tiny-100k style configuration
>>> model = SEWDModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

SEWDModel

class transformers.SEWDModel < > expand 

( config: SEWDConfig )

The bare SEW-D Model transformer outputting raw hidden-states without any specific head on top. SEW-D was proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None mask_time_indices = None output_attentions = None output_hidden_states = None return_dict = None ) BaseModelOutput or tuple(torch.FloatTensor)

The SEWDModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2Processor, SEWDModel
>>> from datasets import load_dataset

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = Wav2Vec2Processor.from_pretrained('asapp/sew-d-tiny-100k')
>>> model = SEWDModel.from_pretrained('asapp/sew-d-tiny-100k')

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

SEWDForCTC

class transformers.SEWDForCTC < > expand 

( config )

SEW-D Model with a language modeling head on top for Connectionist Temporal Classification (CTC). SEW-D was proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None labels = None ) CausalLMOutput or tuple(torch.FloatTensor)

The SEWDForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2Processor, SEWDForCTC
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = Wav2Vec2Processor.from_pretrained('asapp/sew-d-tiny-100k')
>>> model = SEWDForCTC.from_pretrained('asapp/sew-d-tiny-100k')

>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)

>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)

>>> # compute loss
>>> with processor.as_target_processor():
...     inputs["labels"] = processor(dataset[0]["text"], return_tensors="pt").input_ids

>>> loss = model(**inputs).loss

SEWDForSequenceClassification

class transformers.SEWDForSequenceClassification < > expand 

( config )

SEWD Model with a sequence classification head on top (a linear layer over the pooled output) for tasks like SUPERB Keyword Spotting.

SEW-D was proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_values attention_mask = None output_attentions = None output_hidden_states = None return_dict = None labels = None ) SequenceClassifierOutput or tuple(torch.FloatTensor)

The SEWDForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import Wav2Vec2FeatureExtractor, SEWDForSequenceClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('asapp/sew-d-tiny-100k')
>>> model = SEWDForSequenceClassification.from_pretrained('asapp/sew-d-tiny-100k')

>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt")
>>> logits = model(**inputs).logits
>>> predicted_class_ids = torch.argmax(logits, dim=-1)
>>> predicted_label = model.config.id2label[predicted_class_ids]

>>> # compute loss - target_label is e.g. "down"
>>> target_label = model.config.id2label[0]
>>> inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
>>> loss = model(**inputs).loss