Spaces:
Running
on
L4
Distinguishing between speech and non speech
Hello,
Whisper does a great job recognizing words. Sometimes, even too much and in cases of audio files that contain non speech (grunts, breathing and other efforts) it will still try to detect a word that is not there.
I was wondering if there is a way, through fine tuning for example, to specialize this model in correctly detecting whether an audio file has speech or not.
Any pointer into the right direction will be appreciated.
Thank you!
Hey @CarelessWhisperer ! If you don't want to do any fine-tuning, you could use a voice activity detection model before the Whisper model to segment out the parts where there's active speaker speech, see https://huggingface.co/pyannote/voice-activity-detection#🎹-voice-activity-detection
Otherwise, you can fine-tune Whisper on a dataset that does not have the non-speech annotations - this way it will learn not to predict these tokens
Actually I've thought of an even better way than the previous two! What you can do is find out the tokens for these non-speech sounds, and add them to the suppress_tokens
list in the generation config (see GenerationConfig.suppress_tokens and generation_config.json#L126). If you do this, the tokens will be suppressed during generation.
The ones that we have already are:
from transformers import WhisperProcessor
from transformers import GenerationConfig
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")
suppress_tokens = processor.batch_decode(generation_config.suppress_tokens)
print(suppress_tokens)
Print Output:
['"',
'#',
'(',
')',
'*',
'+',
'/',
':',
';',
'<',
'=',
'>',
'@',
'[',
'\\',
']',
'^',
'_',
'`',
'{',
'|',
'}',
'~',
' -',
' "',
' (',
' [',
' �',
'>>',
' >>',
'--',
" '",
' ♪',
' --',
' *',
' :',
' /',
' <',
'「',
'」',
'�',
' #',
' ♫',
'♪',
' ]',
' +',
' =',
' -(',
' )',
' ♪♪',
'))',
' @',
' {',
' ~',
' \\',
' >',
' ;',
' >>>',
'♫',
' -[',
' ((',
' ("',
'『',
'』',
' |',
' ^',
'---',
' 「',
' ♬',
'♪♪',
' _',
' )))',
' `',
'}}',
' ♪♪♪',
' ))',
' ---',
' ♩',
'♬',
' <<',
' }',
" ('",
'<|startoftranscript|>',
'<|startoflm|>',
'<|startofprev|>',
'<|nocaptions|>']
So you just need to find the token ids for these non-speech sounds, and then add them to generation_config.suppress_tokens
:
generation_config.suppress_tokens.append([TOKEN_ID])
Or model.generation_config.suppress_tokens
if you've loaded the model from pre-trained.
I am not sure that using a config file to suppress tokens during generation is going to necessarily prevent ANY TOKEN from being generated in response to a received nonspeech audio. Suppressing some tokens would simply make the probability of a different token HIGHER?
"suppress_tokens (List[int], optional) — A list of tokens that will be suppressed at generation. The SupressTokens logit processor will set their log probs to -inf so that they are not sampled."
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.suppress_tokens
IF the "different token" selected is not a special token, the selected token will be a closest-word token? So, he will still get an undesired token in response to non-speech sounds? Why activate the Whisper model at all if the tokens will be unwanted? So, he may still be better off with a pre-filter such as "a voice activity detection model before the Whisper model to segment out the parts where there's active speaker speech, see https://huggingface.co/pyannote/voice-activity-detection#🎹-voice-activity-detection Thus, silence and non-speech, detected and used for deactivating the Whisper model (not activating the Whisper model at all except when "there's active speaker speech", would reduce Whisper compute) would be optimal?