Wav2Vec2Phoneme
Overview
The Wav2Vec2Phoneme model was proposed in Simple and Effective Zero-shot Cross-lingual Phoneme Recognition (Xu et al., 2021 by Qiantong Xu, Alexei Baevski, Michael Auli.
The abstract from the paper is the following:
Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.
Tips:
- Wav2Vec2Phoneme uses the exact same architecture as Wav2Vec2
- Wav2Vec2Phoneme is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Wav2Vec2Phoneme model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2PhonemeCTCTokenizer.
- Wav2Vec2Phoneme can be fine-tuned on multiple language at once and decode unseen languages in a single forward pass to a sequence of phonemes
- By default the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one should make use of a dictionary and language model.
Relevant checkpoints can be found under https://huggingface.co/models?other=phoneme-recognition.
This model was contributed by patrickvonplaten
The original code can be found here.
Wav2Vec2Phoneme’s architecture is based on the Wav2Vec2 model, so one can refer to Wav2Vec2
’s documentation page except for the tokenizer.
Wav2Vec2PhonemeCTCTokenizer
class transformers.Wav2Vec2PhonemeCTCTokenizer
< source >( vocab_file bos_token = '<s>' eos_token = '</s>' unk_token = '<unk>' pad_token = '<pad>' phone_delimiter_token = ' ' word_delimiter_token = None do_phonemize = True phonemizer_lang = 'en-us' phonemizer_backend = 'espeak' **kwargs )
Parameters
-
vocab_file (
str
) — File containing the vocabulary. -
bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sentence token. -
eos_token (
str
, optional, defaults to"</s>"
) — The end of sentence token. -
unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. -
pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. -
do_phonemize (
bool
, optional, defaults toTrue
) — Whether the tokenizer should phonetize the input or not. Only if a sequence of phonemes is passed to the tokenizer,do_phonemize
should be set toFalse
. -
phonemizer_lang (
str
, optional, defaults to"en-us"
) — The language of the phoneme set to which the tokenizer should phonetize the input text to. -
phonemizer_backend (
str
, optional. defaults to"espeak"
) — The backend phonetization library that shall be used by the phonemizer library. Defaults toespeak-ng
. See the phonemizer package. for more information.**kwargs — Additional keyword arguments passed along to PreTrainedTokenizer
Constructs a Wav2Vec2PhonemeCTC tokenizer.
This tokenizer inherits from PreTrainedTokenizer which contains some of the main methods. Users should refer to the superclass for more information regarding such methods.
__call__
< source >( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding
Parameters
-
text (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). -
text_pair (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). -
text_target (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). -
text_pair_target (
str
,List[str]
,List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). -
add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to encode the sequences with the special tokens relative to their model. -
padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
-
truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
-
max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. -
stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. -
is_split_into_words (
bool
, optional, defaults toFalse
) — Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification. -
pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). -
return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
-
return_token_type_ids (
bool
, optional) — Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by thereturn_outputs
attribute. -
return_attention_mask (
bool
, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputs
attribute. -
return_overflowing_tokens (
bool
, optional, defaults toFalse
) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided withtruncation_strategy = longest_first
orTrue
, an error is raised instead of returning overflowing tokens. -
return_special_tokens_mask (
bool
, optional, defaults toFalse
) — Whether or not to return special tokens mask information. -
return_offsets_mapping (
bool
, optional, defaults toFalse
) — Whether or not to return(char_start, char_end)
for each token.This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise
NotImplementedError
. -
return_length (
bool
, optional, defaults toFalse
) — Whether or not to return the lengths of the encoded inputs. -
verbose (
bool
, optional, defaults toTrue
) — Whether or not to print more information and warnings. **kwargs — passed to theself.tokenize()
method
Returns
A BatchEncoding with the following fields:
-
input_ids — List of token ids to be fed to a model.
-
token_type_ids — List of token type ids to be fed to a model (when
return_token_type_ids=True
or if “token_type_ids” is inself.model_input_names
). -
attention_mask — List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True
or if “attention_mask” is inself.model_input_names
). -
overflowing_tokens — List of overflowing tokens sequences (when a
max_length
is specified andreturn_overflowing_tokens=True
). -
num_truncated_tokens — Number of tokens truncated (when a
max_length
is specified andreturn_overflowing_tokens=True
). -
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when
add_special_tokens=True
andreturn_special_tokens_mask=True
). -
length — The length of the inputs (when
return_length=True
)
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.
batch_decode
< source >(
sequences: typing.Union[typing.List[int], typing.List[typing.List[int]], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')]
skip_special_tokens: bool = False
clean_up_tokenization_spaces: bool = True
output_char_offsets: bool = False
**kwargs
)
→
List[str]
or ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput
Parameters
-
sequences (
Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]
) — List of tokenized input ids. Can be obtained using the__call__
method. -
skip_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not to remove special tokens in the decoding. -
clean_up_tokenization_spaces (
bool
, optional, defaults toTrue
) — Whether or not to clean up the tokenization spaces. -
output_char_offsets (
bool
, optional, defaults toFalse
) — Whether or not to output character offsets. Character offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.Please take a look at the Example of
~models.wav2vec2.tokenization_wav2vec2.decode
to better understand how to make use ofoutput_word_offsets
.~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode
works analogous with phonemes and batched output. - kwargs (additional keyword arguments, optional) — Will be passed to the underlying model specific decode method.
Returns
List[str]
or ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput
The
decoded sentence. Will be a
~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput
when
output_char_offsets == True
.
Convert a list of lists of token ids into a list of strings by calling decode.
decode
< source >(
token_ids: typing.Union[int, typing.List[int], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')]
skip_special_tokens: bool = False
clean_up_tokenization_spaces: bool = True
output_char_offsets: bool = False
**kwargs
)
→
str
or ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput
Parameters
-
token_ids (
Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]
) — List of tokenized input ids. Can be obtained using the__call__
method. -
skip_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not to remove special tokens in the decoding. -
clean_up_tokenization_spaces (
bool
, optional, defaults toTrue
) — Whether or not to clean up the tokenization spaces. -
output_char_offsets (
bool
, optional, defaults toFalse
) — Whether or not to output character offsets. Character offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed characters.Please take a look at the Example of
~models.wav2vec2.tokenization_wav2vec2.decode
to better understand how to make use ofoutput_word_offsets
.~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode
works the same way with phonemes. - kwargs (additional keyword arguments, optional) — Will be passed to the underlying model specific decode method.
Returns
str
or ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput
The decoded
sentence. Will be a ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput
when output_char_offsets == True
.
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))
.