M-CTC-T
This model is in maintenance mode only, so we wonβt accept any new PRs changing its code.
If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
You can do so by running the following command: pip install -U transformers==4.30.0
.
Overview
The M-CTC-T model was proposed in Pseudo-Labeling For Massively Multilingual Speech Recognition by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. The model is a 1B-param transformer encoder, with a CTC head over 8065 character labels and a language identification head over 60 language ID labels. It is trained on Common Voice (version 6.1, December 2020 release) and VoxPopuli. After training on Common Voice and VoxPopuli, the model is trained on Common Voice only. The labels are unnormalized character-level transcripts (punctuation and capitalization are not removed). The model takes as input Mel filterbank features from a 16Khz audio signal.
The abstract from the paper is the following:
Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised learning on a target language, generate pseudo-labels for that language, and train a final model using pseudo-labels for all languages, either from scratch or by fine-tuning. Experiments on the labeled Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better performance for many languages that also transfers well to LibriSpeech.
This model was contributed by cwkeam. The original code can be found here.
Documentation resources
Tips:
- The PyTorch version of this model is only available in torch 1.9 and higher.
MCTCTConfig
class transformers.MCTCTConfig
< source >( vocab_size = 8065 hidden_size = 1536 num_hidden_layers = 36 intermediate_size = 6144 num_attention_heads = 4 attention_head_dim = 384 max_position_embeddings = 920 layer_norm_eps = 1e-05 layerdrop = 0.3 hidden_act = 'relu' initializer_range = 0.02 hidden_dropout_prob = 0.3 attention_probs_dropout_prob = 0.3 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 conv_glu_dim = 1 conv_dropout = 0.3 num_conv_layers = 1 conv_kernel = (7,) conv_stride = (3,) input_feat_per_channel = 80 input_channels = 1 conv_channels = None ctc_loss_reduction = 'sum' ctc_zero_infinity = False **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 8065) — Vocabulary size of the M-CTC-T model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling MCTCTModel. - hidden_size (
int
, optional, defaults to 1536) — Dimension of the encoder layers and the pooler layer. - num_hidden_layers (
int
, optional, defaults to 36) — Number of hidden layers in the Transformer encoder. - intermediate_size (
int
, optional, defaults to 6144) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. - num_attention_heads (
int
, optional, defaults to 4) — Number of attention heads for each attention layer in the Transformer encoder. - attention_head_dim (
int
, optional, defaults to 384) — Dimensions of each attention head for each attention layer in the Transformer encoder. - max_position_embeddings (
int
, optional, defaults to 920) — The maximum sequence length that this model might ever be used with (after log-mel spectrogram extraction). - layer_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers. - layerdrop (
float
, optional, defaults to 0.3) — The probability of dropping an encoder layer during training. The default 0.3 value is used in the original implementation. - hidden_act (
str
orfunction
, optional, defaults to"relu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"selu"
and"gelu_new"
are supported. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - hidden_dropout_prob (
float
, optional, defaults to 0.3) — The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. - attention_probs_dropout_prob (
float
, optional, defaults to 0.3) — The dropout ratio for the attention probabilities. - pad_token_id (
int
, optional, defaults to 1) — The tokenizer index of the pad token. - bos_token_id (
int
, optional, defaults to 0) — The tokenizer index of the bos token. - eos_token_id (
int
, optional, defaults to 2) — The tokenizer index of the eos token. - conv_glu_dim (
int
, optional, defaults to 1) — The dimension of the output of theConv1dSubsampler
layer in which GLU is applied on. Though the original Flashlight code uses the value of 2, here it’s adapted to 1 due to transposition differences. - conv_dropout (
int
, optional, defaults to 0.3) — The probability of randomly dropping theConv1dSubsampler
layer during training. - num_conv_layers (
int
, optional, defaults to 1) — Number of convolution layers before applying transformer encoder layers. - conv_kernel (
Sequence[int]
, optional, defaults to(7,)
) — The kernel size of the 1D convolution applied before transformer layers.len(conv_kernel)
must be equal tonum_conv_layers
. - conv_stride (
Sequence[int]
, optional, defaults to(3,)
) — The stride length of the 1D convolution applied before transformer layers.len(conv_stride)
must be equal tonum_conv_layers
. - input_feat_per_channel (
int
, optional, defaults to 80) — Feature dimensions of the channels of the input to the Conv1D layer. - input_channels (
int
, optional, defaults to 1) — Number of input channels of the input to the Conv1D layer. - conv_channels (
List[int]
, optional) — Channel sizes of intermediate Conv1D layers. - ctc_loss_reduction (
str
, optional, defaults to"sum"
) — Specifies the reduction to apply to the output oftorch.nn.CTCLoss
. Only relevant when training an instance of MCTCTForCTC. - ctc_zero_infinity (
bool
, optional, defaults toFalse
) — Whether to zero infinite losses and the associated gradients oftorch.nn.CTCLoss
. Infinite losses mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance of MCTCTForCTC.
This is the configuration class to store the configuration of a MCTCTModel. It is used to instantiate an M-CTC-T model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the M-CTC-T speechbrain/m-ctc-t-large architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import MCTCTConfig, MCTCTModel
>>> # Initializing a M-CTC-T mctct-large style configuration
>>> configuration = MCTCTConfig()
>>> # Initializing a model (with random weights) from the mctct-large style configuration
>>> model = MCTCTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
MCTCTFeatureExtractor
class transformers.MCTCTFeatureExtractor
< source >( feature_size = 80 sampling_rate = 16000 padding_value = 0.0 hop_length = 10 win_length = 25 win_function = 'hamming_window' frame_signal_scale = 32768.0 preemphasis_coeff = 0.97 mel_floor = 1.0 normalize_means = True normalize_vars = True return_attention_mask = False **kwargs )
Parameters
- feature_size (
int
, defaults to 80) — The feature dimension of the extracted features. This is the number of mel_frequency - sampling_rate (
int
, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). - padding_value (
float
, defaults to 0.0) — The value that is used to fill the padding values. - hop_length (
int
, defaults to 10) — Number of audio samples between windows. Otherwise referred to as “shift” in many papers. - win_length (
int
, defaults to 25) — Number of ms per window - win_function (
str
, defaults to"hamming_window"
) — Name for the window function used for windowing, must be accessible viatorch.{win_function}
- frame_signal_scale (
float
, defaults to 32768.0) — Constant multiplied in creating the frames before applying DFT. - preemphasis_coeff (
float
, defaults to 0.97) — Constant multiplied in applying Pre-emphasis before DFT. - mel_floor (
float
defaults to 1.0) — Minimum value of mel frequency banks. - normalize_means (
bool
, optional, defaults toTrue
) — Whether or not to zero-mean normalize the extracted features. - normalize_vars (
bool
, optional, defaults toTrue
) — Whether or not to unit-variance normalize the extracted features.
Constructs a M-CTC-T feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. This code has been adapted from Flashlightβs C++ code. For more information about the implementation, one can refer to this notebook that takes the user step-by-step in the implementation.
__call__
< source >( raw_speech: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False max_length: typing.Optional[int] = None truncation: bool = False pad_to_multiple_of: typing.Optional[int] = None return_attention_mask: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None sampling_rate: typing.Optional[int] = None **kwargs )
Parameters
- raw_speech (
torch.Tensor
,np.ndarray
,List[float]
,List[torch.Tensor]
,List[np.ndarray]
,List[List[float]]
) — The sequence or batch of sequences to be padded. Each sequence can be a tensor, a numpy array, a list of float values, a list of tensors, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not stereo, i.e. single float per timestep. - padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- max_length (
int
, optional) — Maximum length of the returned list and optionally padding length (see above). - truncation (
bool
) — Activates truncation to cut input sequences longer than max_length to max_length. - pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. - return_attention_mask (
bool
, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default. - return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
- sampling_rate (
int
, optional) — The sampling rate at which theraw_speech
input was sampled. It is strongly recommended to passsampling_rate
at the forward call to prevent silent errors. - padding_value (
float
, defaults to 0.0) —
Main method to featurize and prepare for the model one or several sequence(s). sequences. It returns the log-mel spectrogram of the input audio, as implemented in the original Flashlight MFSC feature extraction code.
MCTCTProcessor
class transformers.MCTCTProcessor
< source >( feature_extractor tokenizer )
Parameters
- feature_extractor (
MCTCTFeatureExtractor
) — An instance of MCTCTFeatureExtractor. The feature extractor is a required input. - tokenizer (
AutoTokenizer
) — An instance of AutoTokenizer. The tokenizer is a required input.
Constructs a MCTCT processor which wraps a MCTCT feature extractor and a MCTCT tokenizer into a single processor.
MCTCTProcessor offers all the functionalities of MCTCTFeatureExtractor and AutoTokenizer. See the call() and decode() for more information.
When used in normal mode, this method forwards all its arguments to MCTCTFeatureExtractorβs
call() and returns its output. If used in the context
as_target_processor()
this method forwards all its arguments to AutoTokenizerβs
__call__()
. Please refer to the doctsring of the above two methods for more information.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[bool, str, NoneType] = None revision: str = 'main' **kwargs )
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — This can be either:- a string, the model id of a pretrained feature_extractor hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like
bert-base-uncased
, or namespaced under a user or organization name, likedbmdz/bert-base-german-cased
. - a path to a directory containing a feature extractor file saved using the
save_pretrained() method, e.g.,
./my_model_directory/
. - a path or url to a saved feature extractor JSON file, e.g.,
./my_model_directory/preprocessor_config.json
. **kwargs — Additional keyword arguments passed along to both from_pretrained() and~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
.
- a string, the model id of a pretrained feature_extractor hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like
Instantiate a processor associated with a pretrained model.
This class method is simply calling the feature extractor
from_pretrained(), image processor
ImageProcessingMixin and the tokenizer
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
methods. Please refer to the docstrings of the
methods above for more information.
save_pretrained
< source >( save_directory push_to_hub: bool = False **kwargs )
Parameters
- save_directory (
str
oros.PathLike
) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist). - push_to_hub (
bool
, optional, defaults toFalse
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id
(will default to the name ofsave_directory
in your namespace). - kwargs (
Dict[str, Any]
, optional) — Additional key word arguments passed along to the push_to_hub() method.
Saves the attributes of this processor (feature extractor, tokenizerβ¦) in the specified directory so that it can be reloaded using the from_pretrained() method.
This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.
This method forwards all its arguments to AutoTokenizerβs batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to AutoTokenizerβs decode(). Please refer to the docstring of this method for more information.
MCTCTModel
class transformers.MCTCTModel
< source >( config )
Parameters
- config (MCTCTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare M-CTC-T Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_features: Tensor attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
- input_features (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using Wav2Vec2CTCTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (MCTCTConfig) and inputs.
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) β Sequence of hidden-states at the output of the last layer of the model. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The MCTCTModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, MCTCTModel
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("speechbrain/m-ctc-t-large")
>>> model = MCTCTModel.from_pretrained("speechbrain/m-ctc-t-large")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
[1, 195, 1536]
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
MCTCTForCTC
class transformers.MCTCTForCTC
< source >( config )
Parameters
- config (MCTCTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
MCTCT Model with a language modeling
head on top for Connectionist Temporal Classification (CTC).
This model is a PyTorch torch.nn.Module sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
forward
< source >( input_features: Tensor attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None labels: typing.Optional[torch.LongTensor] = None ) β transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
Parameters
- input_features (
torch.LongTensor
of shape({0})
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using Wav2Vec2CTCTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.FloatTensor
of shape({0})
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - labels (
torch.LongTensor
of shape(batch_size, target_length)
, optional) — Labels for connectionist temporal classification. Note thattarget_length
has to be smaller or equal to the sequence length of the output logits. Indices are selected in[-100, 0, ..., config.vocab_size - 1]
. All labels set to-100
are ignored (masked), the loss is only computed for labels in[0, ..., config.vocab_size - 1]
.
Returns
transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (MCTCTConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The MCTCTForCTC forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, MCTCTForCTC
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("speechbrain/m-ctc-t-large")
>>> model = MCTCTForCTC.from_pretrained("speechbrain/m-ctc-t-large")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
"Mr. Quilter is the apostle of the middle classes, and we're glad to welcome his gospel."
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
1885.65