Spaces:
Running
on
L4
Running
on
L4
Even when the speakers starts talking after 10 sec, Whisper make the first timestamp to start at sec 0. How could I change that?
#77
by
romain130492
- opened
Hello
I'm using Whisper,
when having a video with a speaker starting his speech at sec 10, I'm getting the first timestamp to be at sec 1. instead of sec 10.
Here is my config:
Config
POST v1/audio/transcriptions
{
model:"whisper-1"
file:"...mp3"
response_format:"srt",
prompt:"Hello, welcome to my lecture"
}
Output:
1
00:00:01,000 --> 00:00:14,000
Why are there both successful and struggling entrepreneurs?
2
00:00:15,000 --> 00:00:23,000
Many customers prefer to watch videos to enjoy online content.
3
00:00:24,000 --> 00:00:32,000
an other sentences.
- I believe
1
it should be00:00:10,000 --> 00:00:14,000
, since there is no one talking at all for 10 sec. - Also, the
3
, the speakers starts again talking at sec 28, but I'm getting the timestamp to be at sec 24. The silence is simply included in the timestamp with Whisper
Any idea how I could fix that, maybe using a prompt?
Thanks!
Hey @romain130492 - for this you can use word-level timestamps. See https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb and https://github.com/m-bain/whisperX