Why not implement a feature that allows you to use sub's as reference files to get perfect timing?

#29
by RECCCQ - opened

(note I am not a programmer) Similar to sub2srs this would work something like this. The idea is you input your reference sub and audio/vid. whisper will then take each timing from the reference subtitle file and create a new one from line 1 to? it will take the timing from the referenced sub and listen to the audio and only run on those specific parts based on the referenced sub and transcribe those parts. this eliminates many of its weaknesses. By giving it already time-stamped audio from the reference sub whisper doesn't have to run through a bunch of irrelevant noise and get all confused and mixed up. by doing this. whisper won't bundle a bunch of text together as it's not running on the whole video. It's only running through the timings the referenced sub played. so in a way, this will also make it faster because it won't have to scan the entire video for speech. obv this is not a permanent solution as youtube videos don't have subs however, this serves as an alternate solution for many other aspects of shows. I am not a programmer so feel free to enlighten me on what's wrong on whatever. I feel it would be really amazing if you could pull this off through. also, it is known that some subs are not perfectly timed so, it would help to add a delay button to push them back a couple of milliseconds.

RECCCQ changed discussion title from Why not implement a feature that allows you to use sub-reference files to get perfect timing? to Why not implement a feature that allows you to use sub's as reference files to get perfect timing?
RECCCQ changed discussion status to closed

Ah, sorry for not responding earlier.

I see, so you essentially want to use a input SRT file as Voice Activity Detection, and only run Whisper on sections where there is audio.

But Whisper-WebUI already supports this through the "VAD" option, which is on by default and uses Silero-VAD to detect the presence of speech. It then effectively splits the audio into chunks based on the presence of speech/no speech, and passes the split chunks to Whisper for better timing accuracy/reduced hallucinations. Take a look at this figure to see how this actually works in practice:

Chunks and sections.png

VAD will detect the speech sections (marked as Speech #1, Speech #2, etc.), and then Whisper-WebUI will group these into chunks (chunk #1, chunk #2), which is what gets passed to Whisper. There is a maximum duration for these chunks to avoid some of the timing issues as mentioned.

Now, you could replace the VAD with an SRT file to get even more accurate speech timings, but it probably wouldn't make much of a difference in the end. In fact, it might be slightly worse, as Silero-VAD creates much shorter speech sections than subtitle lines in an SRT file, making it easier to distribute these sections into larger chunks for Whisper.

But if you have an SRT file - why do you need to run Whisper on the audio? For translation? You will get much better translations if you use a model designed for translating, or simply by using ChatGPT or similar. For instance, I often use Whisper to transcribe a video on YouTube, and then paste the transcription (using SubtitleEdit Auto translate via copy-paste) into GPT-4 to translate it to English. This yields much better translations than Whisper, which only has 1 550M parameters.

Sign up or log in to comment