Can this be uses with WhisperX?

by Dgoryeo - opened Jan 4, 2023

Jan 4, 2023

Hi,

WhisperX uses force alignmnet with wav2vec2.0 to produce more accurate timestamps for Whipser transcription outputs. Currently Japanese is not supported by WhisperX. Can this be used with WhisperX to add the language?

https://github.com/m-bain/whisperX

Thanks

jonatasgrosman

Owner Jan 4, 2023

Hi @Dgoryeo , I think Japanese is already supported by whisperx and it uses this exact model for that: https://github.com/m-bain/whisperX/blob/main/whisperx/transcribe.py#L31

Dgoryeo

Jan 5, 2023

Thanks for the qick reply. Much appreciated! I'll check it out right away.

Dgoryeo

Jan 5, 2023

hI @jonatasgrosman , I noticed that you have tuned Whisper large model for few languages. Would you have any plans to tune for Japanese too? Thanks.

jonatasgrosman

Owner Jan 6, 2023

Hi @Dgoryeo , I plan to do that for other languages in the future too, but for now, I'm out of resources. It's 'cause these large Whisper models are costly to train. I only managed to train some large Whisper models thanks to @sanchit-gandhi , which gave me access to an A100 for a few days.

Dgoryeo

Jan 6, 2023

Can I give you access? I just signed up for free credit from GCP, but I think I have been given T4 level quota, I can double check though.

sanchit-gandhi

Jan 13, 2023

Hey @Dgoryeo ! You can check out the leaderboard from the Whisper fine-tuning event to see the most performant fine-tuned models in Japanese: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ja&split=test

There are a couple of strong large-v2 checkpoints there that might suit your needs!

riken12

Oct 9, 2024

this model is downloaded when using whisperX and I can see it in the location at this ""C:\Users\username.cache\huggingface\hub\models--jonatasgrosman--wav2vec2-large-xlsr-53-japanese"" but it's not getting used in the whisperX, when I say language as "en" it works and the segments has words as words, but when I use language as "ja" then the words in the segments results are not words but it's letters instead of words

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment