aadnk commited on
Commit
f5884f3
1 Parent(s): f1fe464

Update README

Browse files
Files changed (2) hide show
  1. app.py +1 -1
  2. docs/options.md +23 -3
app.py CHANGED
@@ -209,7 +209,7 @@ def create_ui(inputAudioMaxDuration, share=False, server_name: str = None):
209
  ui_description += " audio and is also a multi-task model that can perform multilingual speech recognition "
210
  ui_description += " as well as speech translation and language identification. "
211
 
212
- ui_description += "\n\n\n\nFor longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option."
213
 
214
  if inputAudioMaxDuration > 0:
215
  ui_description += "\n\n" + "Max audio file length: " + str(inputAudioMaxDuration) + " s"
 
209
  ui_description += " audio and is also a multi-task model that can perform multilingual speech recognition "
210
  ui_description += " as well as speech translation and language identification. "
211
 
212
+ ui_description += "\n\n\n\nFor longer audio files (>10 minutes) not in English, it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option."
213
 
214
  if inputAudioMaxDuration > 0:
215
  ui_description += "\n\n" + "Max audio file length: " + str(inputAudioMaxDuration) + " s"
docs/options.md CHANGED
@@ -33,11 +33,23 @@ the URL.
33
  Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
34
 
35
  ## Vad
 
 
 
 
 
 
 
36
  * none
37
  * Run whisper on the entire audio input
38
  * silero-vad
39
- * Use Silero VAD to detect sections that contain speech, and run whisper on independently on each section. Whisper is also run
40
- on the gaps between each speech section.
 
 
 
 
 
41
  * silero-vad-skip-gaps
42
  * As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
43
  may cause dialogue to be skipped.
@@ -55,4 +67,12 @@ Disables merging of adjacent speech sections if they are this number of seconds
55
  The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
56
  larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
57
  a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
58
- to each transcribed line. The default value is 1 second.
 
 
 
 
 
 
 
 
 
33
  Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
34
 
35
  ## Vad
36
+ Using a VAD will improve the timing accuracy of each transcribed line, as well as prevent Whisper getting into an infinite
37
+ loop detecting the same sentence over and over again. The downside is that this may be at a cost to text accuracy, especially
38
+ with regards to unique words or names that appear in the audio. You can compensate for this by increasing the prompt window.
39
+
40
+ Note that English is very well handled by Whisper, and it's less susceptible to issues surrounding bad timings and infinite loops.
41
+ So you may only need to use a VAD for other languages, such as Japanese, or when the audio is very long.
42
+
43
  * none
44
  * Run whisper on the entire audio input
45
  * silero-vad
46
+ * Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run
47
+ on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently
48
+ on the non-speech section.
49
+ * silero-vad-expand-into-gaps
50
+ * Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded
51
+ such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections
52
+ 00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.
53
  * silero-vad-skip-gaps
54
  * As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
55
  may cause dialogue to be skipped.
 
67
  The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
68
  larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
69
  a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
70
+ to each transcribed line. The default value is 1 second.
71
+
72
+ ## VAD - Prompt Window (s)
73
+ The text of a detected line will be included as a prompt to the next speech section, if the speech section starts at most this
74
+ number of seconds after the line has finished. For instance, if a line ends at 10:00, and the next speech section starts at
75
+ 10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
76
+
77
+ Note that detected lines in gaps between speech sections will not be included in the prompt
78
+ (if silero-vad or silero-vad-expand-into-gaps) is used.