Spaces:

openai
/

whisper

Running on L40S

App Files Files Community

132

Distinguishing between speech and non speech

#74

by CarelessWhisperer - opened Mar 15, 2023

Discussion

CarelessWhisperer

Mar 15, 2023

Hello,

Whisper does a great job recognizing words. Sometimes, even too much and in cases of audio files that contain non speech (grunts, breathing and other efforts) it will still try to detect a word that is not there.

I was wondering if there is a way, through fine tuning for example, to specialize this model in correctly detecting whether an audio file has speech or not.

Any pointer into the right direction will be appreciated.

Thank you!

sanchit-gandhi

Mar 24, 2023

Hey @CarelessWhisperer ! If you don't want to do any fine-tuning, you could use a voice activity detection model before the Whisper model to segment out the parts where there's active speaker speech, see https://huggingface.co/pyannote/voice-activity-detection#🎹-voice-activity-detection

Otherwise, you can fine-tune Whisper on a dataset that does not have the non-speech annotations - this way it will learn not to predict these tokens

sanchit-gandhi

Mar 24, 2023

•

edited Mar 27, 2023

Actually I've thought of an even better way than the previous two! What you can do is find out the tokens for these non-speech sounds, and add them to the suppress_tokens list in the generation config (see GenerationConfig.suppress_tokens and generation_config.json#L126). If you do this, the tokens will be suppressed during generation.

The ones that we have already are:

from transformers import WhisperProcessor
from transformers import GenerationConfig

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")

suppress_tokens = processor.batch_decode(generation_config.suppress_tokens)
print(suppress_tokens)

Print Output:

['"',
 '#',
 '(',
 ')',
 '*',
 '+',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 ' -',
 ' "',
 ' (',
 ' [',
 ' �',
 '>>',
 ' >>',
 '--',
 " '",
 ' ♪',
 ' --',
 ' *',
 ' :',
 ' /',
 ' <',
 '「',
 '」',
 '�',
 ' #',
 ' ♫',
 '♪',
 ' ]',
 ' +',
 ' =',
 ' -(',
 ' )',
 ' ♪♪',
 '))',
 ' @',
 ' {',
 ' ~',
 ' \\',
 ' >',
 ' ;',
 ' >>>',
 '♫',
 ' -[',
 ' ((',
 ' ("',
 '『',
 '』',
 ' |',
 ' ^',
 '---',
 ' 「',
 ' ♬',
 '♪♪',
 ' _',
 ' )))',
 ' `',
 '}}',
 ' ♪♪♪',
 ' ))',
 ' ---',
 ' ♩',
 '♬',
 ' <<',
 ' }',
 " ('",
 '<|startoftranscript|>',
 '<|startoflm|>',
 '<|startofprev|>',
 '<|nocaptions|>']

So you just need to find the token ids for these non-speech sounds, and then add them to generation_config.suppress_tokens:

generation_config.suppress_tokens.append([TOKEN_ID])

Or model.generation_config.suppress_tokens if you've loaded the model from pre-trained.

MartialTerran

Dec 1, 2024

I am not sure that using a config file to suppress tokens during generation is going to necessarily prevent ANY TOKEN from being generated in response to a received nonspeech audio. Suppressing some tokens would simply make the probability of a different token HIGHER?
"suppress_tokens (List[int], optional) — A list of tokens that will be suppressed at generation. The SupressTokens logit processor will set their log probs to -inf so that they are not sampled."
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.suppress_tokens

IF the "different token" selected is not a special token, the selected token will be a closest-word token? So, he will still get an undesired token in response to non-speech sounds? Why activate the Whisper model at all if the tokens will be unwanted? So, he may still be better off with a pre-filter such as "a voice activity detection model before the Whisper model to segment out the parts where there's active speaker speech, see https://huggingface.co/pyannote/voice-activity-detection#🎹-voice-activity-detection Thus, silence and non-speech, detected and used for deactivating the Whisper model (not activating the Whisper model at all except when "there's active speaker speech", would reduce Whisper compute) would be optimal?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment