Distinguishing between speech and non speech

#74
by CarelessWhisperer - opened

Hello,

Whisper does a great job recognizing words. Sometimes, even too much and in cases of audio files that contain non speech (grunts, breathing and other efforts) it will still try to detect a word that is not there.

I was wondering if there is a way, through fine tuning for example, to specialize this model in correctly detecting whether an audio file has speech or not.

Any pointer into the right direction will be appreciated.

Thank you!

Hey @CarelessWhisperer ! If you don't want to do any fine-tuning, you could use a voice activity detection model before the Whisper model to segment out the parts where there's active speaker speech, see https://huggingface.co/pyannote/voice-activity-detection#🎹-voice-activity-detection

Otherwise, you can fine-tune Whisper on a dataset that does not have the non-speech annotations - this way it will learn not to predict these tokens

Actually I've thought of an even better way than the previous two! What you can do is find out the tokens for these non-speech sounds, and add them to the suppress_tokens list in the generation config (see GenerationConfig.suppress_tokens and generation_config.json#L126). If you do this, the tokens will be suppressed during generation.

The ones that we have already are:

from transformers import WhisperProcessor
from transformers import GenerationConfig

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")

suppress_tokens = processor.batch_decode(generation_config.suppress_tokens)
print(suppress_tokens)

Print Output:

['"',
 '#',
 '(',
 ')',
 '*',
 '+',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 ' -',
 ' "',
 ' (',
 ' [',
 ' �',
 '>>',
 ' >>',
 '--',
 " '",
 ' ♪',
 ' --',
 ' *',
 ' :',
 ' /',
 ' <',
 '「',
 '」',
 '�',
 ' #',
 ' ♫',
 '♪',
 ' ]',
 ' +',
 ' =',
 ' -(',
 ' )',
 ' ♪♪',
 '))',
 ' @',
 ' {',
 ' ~',
 ' \\',
 ' >',
 ' ;',
 ' >>>',
 '♫',
 ' -[',
 ' ((',
 ' ("',
 '『',
 '』',
 ' |',
 ' ^',
 '---',
 ' 「',
 ' ♬',
 '♪♪',
 ' _',
 ' )))',
 ' `',
 '}}',
 ' ♪♪♪',
 ' ))',
 ' ---',
 ' ♩',
 '♬',
 ' <<',
 ' }',
 " ('",
 '<|startoftranscript|>',
 '<|startoflm|>',
 '<|startofprev|>',
 '<|nocaptions|>']

So you just need to find the token ids for these non-speech sounds, and then add them to generation_config.suppress_tokens:

generation_config.suppress_tokens.append([TOKEN_ID])

Or model.generation_config.suppress_tokens if you've loaded the model from pre-trained.

Sign up or log in to comment