WER stays on 1 when fine tune

#3
by abramovi - opened

Hey @patrickvonplaten is it possible to use this mode for fine tune?

I used your example - https://huggingface.co/blog/fine-tune-wav2vec2-english when I just change the model from Wav2vec2-base to Wav2vec2-base-960h

WER stays on 1

Should I use model vocab to data-prep?

What do I do wrong?

I believe @sanchit-gandhi has experience training such models

Hey @abramovi !

This model takes the pre-trained checkpoint facebook/wav2vec2-base and fine-tunes on 960h of data from the LibriSpeech ASR corpus. It is generally not advised to use this checkpoint for fine-tuning: this model has a tokenizer and classification layer that are purposed built for the LibriSpeech corpus. If training on a different dataset, you need to build the tokenizer from scratch and fine-tune the model on your dataset of choice.

You should follow the steps listed in the blog for building the tokenizer - the vocabulary in this checkpoint is purpose built for the LibriSpeech corpus, not timit.

Hey @sanchit-gandhi
Thank you so much for your answer - between the line I understand that for fine-tune on my own data it is better to use one of the base models (facebook/wav2vec2-base, facebook/wav2vec2-large)

Am I right ?

Hey @abramovi !

That's right! Either one of the pre-trained checkpoints (facebook/wav2vec2-base or facebook/wav2vev2-large-lv60) would be suitable for fine-tuning. The choice comes down to your constraints (runtime, memory etc). Hope that answers your question!

Hey @sanchit-gandhi ,
I have a query in similar aspect too. Consider i have wav2vec2-base-960h model and i would want the model to work for different accent such as Arabic/Indian Accent so if i use the dataset of indian/arabic voice on base model then it wont work for US accent (960h of librispeech) right so to tackle it, only option is to finetune on wav2vec2-base-960h but WER stays in 1. So my question is should we take base model and train with 960h+indian accent dataset+Arabic accent or is there any option to resume training with new dataset from checkpoint.

Just to cross verify i took wav2vec2-base-960h and used the same tokenizer of wav2vec2-base-960h to fine tune timit dataset but WER is still staying at 1, did i miss anything?

Hey @Vishfeb27

To clarify, you wish to fine-tune a model that handles both: LibriSpeech ASR data (US speakers reading audiobooks aloud) and speakers with an Arabic/Indian accent? Or just the latter?

The checkpoint facebook/wav2vec2-base-960h is fine-tuned on 960h Librispeech ASR data. Consequently, it already has a tokenizer built for the LibriSpeech ASR corpus. This tokenizer is built on upper-case characters. The blog fine-tunes a system on TIMIT using lower-case characters only. Comparing the two vocabularies, they are indeed the same, with the exception that the wav2vec2 tokenizer is upper-case characters, and the TIMIT vocabulary lower-case character. You will need to match one to the other to fine-tune the system on lower-case text only. Assuming you are using the official wav2vec2 tokenizer from this repo, you can either:

  1. Set tokenizer.do_lower_case = True (converts all input strings to uppercase prior to tokenization)
  2. Normalise all the training data text to upper case:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).upper()
    return batch

timit = timit.map(remove_special_characters)

Hey @sanchit-gandhi ,

Thanks for the solution for finetuning Timid.

I wish to fine-tune a model that handles both: LibriSpeech ASR data (US speakers reading audiobooks aloud) and speakers with an Arabic/Indian accent. Consider i have the indian/arabic accent voice data from Commonvoice and i would require the model to work well for US accent as well as Arabic/Indian accent, so do you suggest to finetune Arabic/Indian accent on the checkpoint facebook/wav2vec2-base-960h since the vocab is same or it needs a pretraining step?

Hey @Vishfeb27 ,

If you can match the vocabularies for both datasets it might work. Note that with Common Voice, the text is cased and has punctuation. For LibriSpeech, the text is all upper-cased and has no punctuation. You'll need to define/adapt your tokenizer accordingly. Either you build a tokenizer on the combined vocabulary of both datasets and fine-tune the wav2vec2-base model from scratch. Or, you remove all punctuation from Common Voice, lower case the text, and fine-tune wav2vec2-base-960h.

@sanchit-gandhi , Thank you soo much for your response. This helps!

Hi @sanchit-gandhi ,

I used the facebook/wav2vec2-base-960h and tested its performance on custom voice recording. It worked decently (let's call this test 1). To improve the performance, I fine tuned facebook/wav2vec2-base-960h on the same recoding I tested on during test 1. Also, to make sure everything was working fine, I used the same data for training and testing. I expected the performance to improve since I trained on custom data. However, it deteriorated. It's not even as good as it was during test 1. In fact now it is not even predicting one word correctly. I am not able to figure out what went wrong. Can you help me with this?

Hey @Rishilulla ! Do you have any additional details about the custom dataset? Such as language, accent, text formatting (upper or lower-case, with or without punctuation)? Given the model worked on reasonably well in the zero-shot setting, I'm assuming that you didn't create a new tokenizer, but rather leveraged the pre-trained one directly?

Hi @sanchit-gandhi !
I am using self recorded English data just to test things out. After reading the comments here, I did change the text to upper case and since I am referring "https://huggingface.co/blog/fine-tune-wav2vec2-english" this article, I created a new tokenizer.
I realized I had not changed the number of epochs and that's why the performance was not great. I have increased it now and its WIP.
Also, wanted to understand some basics about fine tuning. So if I am fine tuning 'facebook/wav2vec2-base-960h' and I think the only thing that is missing is domain specific words, all I need to do is fine tune the pre-trained model with domain specific words, correct? Post which, the domain specific words would be taken care by the custom data and everything else by the 'facebook/wav2vec2-base-960h'. Is my understanding correct?

Hi @sanchit-gandhi ,
The training is completed and have attached the results. The results seem promising, but when I am evaluating it on test data, I am getting WER as 100%. Not sure where I am going wrong.

training_result.png

Hey @Rishilulla ! Thanks for the additional details and sorry about the late reply here!

Just a note on the tokenizer: if you want to directly leverage the pre-trained tokenizer, you should ensure that all of your training data is i) English, ii) upper-cased and iii) without punctuation. If you create a new tokenizer according to the aforementioned blog post, the format of your training data is unconstrained: you will build a new tokenizer from your training data, so the tokenizer will be matched to the format of your dataset implicitly. Therefore, your training data can take any form and your tokenizer will be built accordingly. However, it is recommended to single-case the data and optionally remove punctuation to improve the speech recognition performance of the Wav2Vec2 CTC model (this simplifies the task of speech recognition, thus improving downstream performance).

For the best performance with fine-tuning, you should ensure that your training dataset is in domain with your test set. If there is domain specific vocabulary in your test set that you want your model to be able to handle, you should ensure that the training dataset is from the same distribution of data. The pre-trained facebook/wav2vec2-base-960h model is trained on the LibriSpeech 960h dataset, a corpus of narrated audio books. Thus, it performs well on data drawn from this distribution, but not necessarily on data from other distributions. This is why we need additional fine-tuning data to boost the model's performance on our specific domain.

It looks like you got some quite nice results for fine-tuning! What hyper parameters were you using? You might benefit from increasing the amount of regularisation if your dataset is small (c.f. https://github.com/huggingface/transformers/blob/b210c83a78022226ce48402cd67d8c8da7afbd8d/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L556).

Do you have a reproducible code-snippet for your evaluation on test data? If you're able to share this I can take a look as to why the model is performing poorly! 🤗

Sign up or log in to comment