WER stays on 1 when fine tune #3

by abramovi - opened

Hey @patrickvonplaten is it possible to use this mode for fine tune?

I used your example - https://huggingface.co/blog/fine-tune-wav2vec2-english when I just change the model from Wav2vec2-base to Wav2vec2-base-960h

WER stays on 1

Should I use model vocab to data-prep?

What do I do wrong?

I believe @sanchit-gandhi has experience training such models

Hey @abramovi!

This model takes the pre-trained checkpoint facebook/wav2vec2-base and fine-tunes on 960h of data from the LibriSpeech ASR corpus. It is generally not advised to use this checkpoint for fine-tuning: this model has a tokenizer and classification layer that are purposed built for the LibriSpeech corpus. If training on a different dataset, you need to build the tokenizer from scratch and fine-tune the model on your dataset of choice.

You should follow the steps listed in the blog for building the tokenizer - the vocabulary in this checkpoint is purpose built for the LibriSpeech corpus, not timit.

Hey @sanchit-gandhi
Thank you so much for your answer - between the line I understand that for fine-tune on my own data it is better to use one of the base models (facebook/wav2vec2-base, facebook/wav2vec2-large)

Am I right ?

Hey @abramovi!

That's right! Either one of the pre-trained checkpoints (facebook/wav2vec2-base or facebook/wav2vev2-large-lv60) would be suitable for fine-tuning. The choice comes down to your constraints (runtime, memory etc). Hope that answers your question!

Vishfeb27
edited Aug 9

Hey @sanchit-gandhi,
I have a query in similar aspect too. Consider i have wav2vec2-base-960h model and i would want the model to work for different accent such as Arabic/Indian Accent so if i use the dataset of indian/arabic voice on base model then it wont work for US accent (960h of librispeech) right so to tackle it, only option is to finetune on wav2vec2-base-960h but WER stays in 1. So my question is should we take base model and train with 960h+indian accent dataset+Arabic accent or is there any option to resume training with new dataset from checkpoint.

Just to cross verify i took wav2vec2-base-960h and used the same tokenizer of wav2vec2-base-960h to fine tune timit dataset but WER is still staying at 1, did i miss anything?

Hey @Vishfeb27

To clarify, you wish to fine-tune a model that handles both: LibriSpeech ASR data (US speakers reading audiobooks aloud) and speakers with an Arabic/Indian accent? Or just the latter?

The checkpoint facebook/wav2vec2-base-960h is fine-tuned on 960h Librispeech ASR data. Consequently, it already has a tokenizer built for the LibriSpeech ASR corpus. This tokenizer is built on upper-case characters. The blog fine-tunes a system on TIMIT using lower-case characters only. Comparing the two vocabularies, they are indeed the same, with the exception that the wav2vec2 tokenizer is upper-case characters, and the TIMIT vocabulary lower-case character. You will need to match one to the other to fine-tune the system on lower-case text only. Assuming you are using the official wav2vec2 tokenizer from this repo, you can either:

  1. Set tokenizer.do_lower_case = True (converts all input strings to uppercase prior to tokenization)
  2. Normalise all the training data text to upper case:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).upper()
    return batch

timit = timit.map(remove_special_characters)

Hey @sanchit-gandhi ,

Thanks for the solution for finetuning Timid.

I wish to fine-tune a model that handles both: LibriSpeech ASR data (US speakers reading audiobooks aloud) and speakers with an Arabic/Indian accent. Consider i have the indian/arabic accent voice data from Commonvoice and i would require the model to work well for US accent as well as Arabic/Indian accent, so do you suggest to finetune Arabic/Indian accent on the checkpoint facebook/wav2vec2-base-960h since the vocab is same or it needs a pretraining step?

Hey @Vishfeb27,

If you can match the vocabularies for both datasets it might work. Note that with Common Voice, the text is cased and has punctuation. For LibriSpeech, the text is all upper-cased and has no punctuation. You'll need to define/adapt your tokenizer accordingly. Either you build a tokenizer on the combined vocabulary of both datasets and fine-tune the wav2vec2-base model from scratch. Or, you remove all punctuation from Common Voice, lower case the text, and fine-tune wav2vec2-base-960h.

@sanchit-gandhi , Thank you soo much for your response. This helps!