Spaces:
Runtime error
Runtime error
Update README
Browse files- app.py +1 -1
- docs/options.md +23 -3
app.py
CHANGED
@@ -209,7 +209,7 @@ def create_ui(inputAudioMaxDuration, share=False, server_name: str = None):
|
|
209 |
ui_description += " audio and is also a multi-task model that can perform multilingual speech recognition "
|
210 |
ui_description += " as well as speech translation and language identification. "
|
211 |
|
212 |
-
ui_description += "\n\n\n\nFor longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option."
|
213 |
|
214 |
if inputAudioMaxDuration > 0:
|
215 |
ui_description += "\n\n" + "Max audio file length: " + str(inputAudioMaxDuration) + " s"
|
|
|
209 |
ui_description += " audio and is also a multi-task model that can perform multilingual speech recognition "
|
210 |
ui_description += " as well as speech translation and language identification. "
|
211 |
|
212 |
+
ui_description += "\n\n\n\nFor longer audio files (>10 minutes) not in English, it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option."
|
213 |
|
214 |
if inputAudioMaxDuration > 0:
|
215 |
ui_description += "\n\n" + "Max audio file length: " + str(inputAudioMaxDuration) + " s"
|
docs/options.md
CHANGED
@@ -33,11 +33,23 @@ the URL.
|
|
33 |
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
|
34 |
|
35 |
## Vad
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
* none
|
37 |
* Run whisper on the entire audio input
|
38 |
* silero-vad
|
39 |
-
* Use Silero VAD to detect sections that contain speech, and run
|
40 |
-
on the gaps between each speech section
|
|
|
|
|
|
|
|
|
|
|
41 |
* silero-vad-skip-gaps
|
42 |
* As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
|
43 |
may cause dialogue to be skipped.
|
@@ -55,4 +67,12 @@ Disables merging of adjacent speech sections if they are this number of seconds
|
|
55 |
The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
|
56 |
larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
|
57 |
a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
|
58 |
-
to each transcribed line. The default value is 1 second.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
|
34 |
|
35 |
## Vad
|
36 |
+
Using a VAD will improve the timing accuracy of each transcribed line, as well as prevent Whisper getting into an infinite
|
37 |
+
loop detecting the same sentence over and over again. The downside is that this may be at a cost to text accuracy, especially
|
38 |
+
with regards to unique words or names that appear in the audio. You can compensate for this by increasing the prompt window.
|
39 |
+
|
40 |
+
Note that English is very well handled by Whisper, and it's less susceptible to issues surrounding bad timings and infinite loops.
|
41 |
+
So you may only need to use a VAD for other languages, such as Japanese, or when the audio is very long.
|
42 |
+
|
43 |
* none
|
44 |
* Run whisper on the entire audio input
|
45 |
* silero-vad
|
46 |
+
* Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run
|
47 |
+
on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently
|
48 |
+
on the non-speech section.
|
49 |
+
* silero-vad-expand-into-gaps
|
50 |
+
* Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded
|
51 |
+
such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections
|
52 |
+
00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.
|
53 |
* silero-vad-skip-gaps
|
54 |
* As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
|
55 |
may cause dialogue to be skipped.
|
|
|
67 |
The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
|
68 |
larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
|
69 |
a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
|
70 |
+
to each transcribed line. The default value is 1 second.
|
71 |
+
|
72 |
+
## VAD - Prompt Window (s)
|
73 |
+
The text of a detected line will be included as a prompt to the next speech section, if the speech section starts at most this
|
74 |
+
number of seconds after the line has finished. For instance, if a line ends at 10:00, and the next speech section starts at
|
75 |
+
10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
|
76 |
+
|
77 |
+
Note that detected lines in gaps between speech sections will not be included in the prompt
|
78 |
+
(if silero-vad or silero-vad-expand-into-gaps) is used.
|