Hey @patrickvonplaten is it possible to use this mode for fine tune?
I used your example - https://huggingface.co/blog/fine-tune-wav2vec2-english when I just change the model from Wav2vec2-base to Wav2vec2-base-960h
WER stays on 1
Should I use model vocab to data-prep?
What do I do wrong?
This model takes the pre-trained checkpoint facebook/wav2vec2-base and fine-tunes on 960h of data from the LibriSpeech ASR corpus. It is generally not advised to use this checkpoint for fine-tuning: this model has a tokenizer and classification layer that are purposed built for the LibriSpeech corpus. If training on a different dataset, you need to build the tokenizer from scratch and fine-tune the model on your dataset of choice.
You should follow the steps listed in the blog for building the tokenizer - the vocabulary in this checkpoint is purpose built for the LibriSpeech corpus, not timit.
I have a query in similar aspect too. Consider i have wav2vec2-base-960h model and i would want the model to work for different accent such as Arabic/Indian Accent so if i use the dataset of indian/arabic voice on base model then it wont work for US accent (960h of librispeech) right so to tackle it, only option is to finetune on wav2vec2-base-960h but WER stays in 1. So my question is should we take base model and train with 960h+indian accent dataset+Arabic accent or is there any option to resume training with new dataset from checkpoint.
Just to cross verify i took wav2vec2-base-960h and used the same tokenizer of wav2vec2-base-960h to fine tune timit dataset but WER is still staying at 1, did i miss anything?
To clarify, you wish to fine-tune a model that handles both: LibriSpeech ASR data (US speakers reading audiobooks aloud) and speakers with an Arabic/Indian accent? Or just the latter?
The checkpoint facebook/wav2vec2-base-960h is fine-tuned on 960h Librispeech ASR data. Consequently, it already has a tokenizer built for the LibriSpeech ASR corpus. This tokenizer is built on upper-case characters. The blog fine-tunes a system on TIMIT using lower-case characters only. Comparing the two vocabularies, they are indeed the same, with the exception that the wav2vec2 tokenizer is upper-case characters, and the TIMIT vocabulary lower-case character. You will need to match one to the other to fine-tune the system on lower-case text only. Assuming you are using the official wav2vec2 tokenizer from this repo, you can either:
tokenizer.do_lower_case = True(converts all input strings to uppercase prior to tokenization)
- Normalise all the training data text to upper case:
import re chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]' def remove_special_characters(batch): batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).upper() return batch timit = timit.map(remove_special_characters)
Hey @sanchit-gandhi ,
Thanks for the solution for finetuning Timid.
I wish to fine-tune a model that handles both: LibriSpeech ASR data (US speakers reading audiobooks aloud) and speakers with an Arabic/Indian accent. Consider i have the indian/arabic accent voice data from Commonvoice and i would require the model to work well for US accent as well as Arabic/Indian accent, so do you suggest to finetune Arabic/Indian accent on the checkpoint facebook/wav2vec2-base-960h since the vocab is same or it needs a pretraining step?
If you can match the vocabularies for both datasets it might work. Note that with Common Voice, the text is cased and has punctuation. For LibriSpeech, the text is all upper-cased and has no punctuation. You'll need to define/adapt your tokenizer accordingly. Either you build a tokenizer on the combined vocabulary of both datasets and fine-tune the wav2vec2-base model from scratch. Or, you remove all punctuation from Common Voice, lower case the text, and fine-tune wav2vec2-base-960h.