How did you convert your HuBERTs to .pt formats?

#1
by NeuroDonu - opened

Hi. I was recently looking for a HuBERT that would be better than the basic one and came across many projects (like HuBERT based on dusha dataset, Polish Hubert etc.) and recently saw yours. can you share your code to save fairseq model in .pt format?

UTTER - Unified Transcription and Translation for Extended Reality org

Hello,

Thanks for the interest. The technical report for mHuBERT-147 will soon be available, but in summary it is a very compact and powerful model for downstream applications (new SOTA on ML-SUPERB 1h leaderboard).
The script for converting fairseq to HF comes from the transformers library: https://github.com/huggingface/transformers/blob/main/src/transformers/models/hubert/convert_hubert_original_pytorch_checkpoint_to_pytorch.py

Hey @mzboito ! Really strong results, and awesome to see transformers checkpoints already in the repo 🤗

In the conversion script, there should have also been a tokenizer and feature extractor that were saved (to the pytorch_dump_folder_path). The feature extractor is used to pre-process the audio inputs, and the tokenizer used to map the argmax ids to character tokens. These objects required to perform inference with the Transformers library, e.g. as per this code snippet. It could be the case that the flag is_finetuned was not passed to the script, which is required for the conversion of the tokenizer and feature extractor.

If you have the script ready locally, it would be awesome to covert the tokenizer and feature extractor (e.g. by simply passing is_finetuned) and then pushing these to the repo! Once done, the Transformers integration should be complete, and we can add a codesnippet to the model card that highlights Transformers usage (e.g. as with HuBERT large here).

Happy to double check that the Transformers integration works once these two objects are pushed!

UTTER - Unified Transcription and Translation for Extended Reality org

Hi Sanchit, thanks for the support!

I'm actually slightly confused. I indeed did not pass the flag is_finetuned to the conversion script, and that is because the HuggingFace checkpoint I'm sharing is the foundation model itself, not a fine-tuned ASR model. Did I used the conversion script incorrectly?

Ah! that makes sense @mzboito - I was under the impression that this is the fine-tuned checkpoint. Would you consider sharing the fine-tuned CTC checkpoints too I think it would be quite useful for the research community?

UTTER - Unified Transcription and Translation for Extended Reality org

I went back to my tweet and realized I called it a "multilingual model", which in retrospect is a bit confusing! Sorry! I added "pre-trained" to the model card to help clearing the confusion.
Regarding the CTC version that produced the nice results in the table... the problem is that I used the ML-SUPERB codebase for the benchmark, so it's not HF.
It would be great to have an implementation of ML-SUPERB on HF. 👀

Thanks for the clarifications @mzboito , makes sense! Regarding the ML-SUPERB codebase, was the one you used the recipe in ESPNet? If so, the final format of your model is likely s3prl, for which we have a conversion script in Transformers. You should be able to pass this model id as the base model name (utter-project/mHuBERT-147) and the path to the s3prl checkpoint to update the projector and classifier weights. The script doesn't handle the tokenizer - do you know what format the tokenizer is in? Happy to provide some pointers to covert it based on this!

UTTER - Unified Transcription and Translation for Extended Reality org

Hi again Sanchit and Vaibhav.
Yes, indeed, everything was trained using their recipes.

Regarding the tokenizer, I don't know. I believe everything is char based for the ML-SUPERB setup, but since I downloaded the data package already processed, I don't know which transformations were applied to the 143 languages. Moreover, there was nothing saved at the experiments folder, where checkpoint, decoding and evaluation are available.
I believe the answer is probably somewhere in https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/asr.sh , but you will probably have an easier time just directly asking Jiatong Shi from ML-SUPERB about this.
By the way, this should not be a problem for the LID model, of course.

Meanwhile, I'll start the internal reviewing process to release these checkpoints. 😅 Honestly I did not expect people would be interested on this, since overall performance of these ML-SUPERB downstream models is not as good as full fine-tuning.

Hey @mzboito ! Thanks again for the super clear clarification! It makes sense regarding the ML-SUPERB checkpoint - if there are plans to release better models (i.e. full fine-tuned variants), it's probably worth waiting for those and focussing efforts there!

mzboito changed discussion status to closed

Sign up or log in to comment