Fine-tunining Whisper models for shorter audio segments

#26
by Malishevsky - opened

Hi all. My project needs to recognize many short audio parts. Can I use fine to change the multilingual model for short audios like 10 seconds ? If not, can I train the model from scratch for these purposes? I would be grateful for any help and hints.

You can seed it with an input text the shows the style of what you are transcribing. That helps it to have prior context. Provide the language explicitly as it need more than 10 secs to figure out the language. Consider combining the files and add a stop word in between, by pasting, using pydub, in e.g. "poolboy" or some other word not used in the text.

The model is closed as far as I know. With many files, run them in parallel on more GPUs, using the ordinary fast-whisper repo in stead of Jax.

That's all I know.

Sign up or log in to comment