File size: 13,336 Bytes
40311b7
b6ac700
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50167d4
b6ac700
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e80889
 
 
b6ac700
 
 
 
 
 
 
 
 
 
 
 
 
 
70d1342
 
 
 
 
6fc9a01
70d1342
 
 
 
 
b6ac700
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70d1342
 
 
 
 
 
 
50167d4
 
 
40311b7
50167d4
 
 
 
40311b7
50167d4
 
 
ec7cc5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50167d4
40311b7
50167d4
40311b7
50167d4
 
 
40311b7
50167d4
40311b7
50167d4
 
 
40311b7
50167d4
40311b7
d0c7a01
 
 
 
85d6c89
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# Standard Options
To transcribe or translate an audio file, you can either copy an URL from a website (all [websites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md) 
supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (*.*)" 
in the file selector to select any file type, including video files) or use the microphone.

For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option, especially if you are using the `large-v1` model. Note that `large-v2` is a lot more forgiving, but you may still want to use a VAD with a slightly higher "VAD - Max Merge Size (s)" (60 seconds or more).

## Model
Select the model that Whisper will use to transcribe the audio:

| Size      | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|-----------|------------|--------------------|--------------------|---------------|----------------|
| tiny      | 39 M       | tiny.en            | tiny               | ~1 GB         | ~32x           |
| base      | 74 M       | base.en            | base               | ~1 GB         | ~16x           |
| small     | 244 M      | small.en           | small              | ~2 GB         | ~6x            |
| medium    | 769 M      | medium.en          | medium             | ~5 GB         | ~2x            |
| large     | 1550 M     | N/A                | large              | ~10 GB        | 1x             |
| large-v2  | 1550 M     | N/A                | large              | ~10 GB        | 1x             |
| large-v3  | 1550 M     | N/A                | large              | ~10 GB        | 1x             |

## Language

Select the language, or leave it empty for Whisper to automatically detect it. 

Note that if the selected language and the language in the audio differs, Whisper may start to translate the audio to the selected 
language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.

## Inputs
The options "URL (YouTube, etc.)", "Upload Files" or "Micriphone Input" allows you to send an audio input to the model.

### Multiple Files
Note that the UI will only process either the given URL or the upload files (including microphone) - not both. 

But you can upload multiple files either through the "Upload files" option, or as a playlist on YouTube. Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section. When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.

## Task
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.

## Vad
Using a VAD will improve the timing accuracy of each transcribed line, as well as prevent Whisper getting into an infinite
loop detecting the same sentence over and over again. The downside is that this may be at a cost to text accuracy, especially
with regards to unique words or names that appear in the audio. You can compensate for this by increasing the prompt window. 

Note that English is very well handled by Whisper, and it's less susceptible to issues surrounding bad timings and infinite loops. 
So you may only need to use a VAD for other languages, such as Japanese, or when the audio is very long.

* none
  * Run whisper on the entire audio input
* silero-vad
   * Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run 
     on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently 
     on the non-speech section.
* silero-vad-expand-into-gaps
   * Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded
     such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections 
     00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.
* silero-vad-skip-gaps
   * As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but 
     may cause dialogue to be skipped.
* periodic-vad
   * Create sections of speech every 'VAD - Max Merge Size' seconds. This is very fast and simple, but will potentially break 
     a sentence or word in two.

## VAD - Merge Window
If set, any adjacent speech sections that are at most this number of seconds apart will be automatically merged.

## VAD - Max Merge Size (s)
Disables merging of adjacent speech sections if they are this number of seconds long.

## VAD - Process Timeout (s)
This configures the number of seconds until a process is killed due to inactivity, freeing RAM and video memory. The default value is 30 minutes.

## VAD - Padding (s)
The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of 
a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp 
to each transcribed line. The default value is 1 second.

## VAD - Prompt Window (s)
The text of a detected line will be included as a prompt to the next speech section, if the speech section starts at most this
number of seconds after the line has finished. For instance, if a line ends at 10:00, and the next speech section starts at
10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).

Note that detected lines in gaps between speech sections will not be included in the prompt 
(if silero-vad or silero-vad-expand-into-gaps) is used.

## Diarization

If checked, Pyannote will be used to detect speakers in the audio, and label them as (SPEAKER 00), (SPEAKER 01), etc. 

This requires a HuggingFace API key to function, which can be supplied with the `--auth_token` command line option for the CLI,
set in the `config.json5` file for the GUI, or provided via the `HF_ACCESS_TOKEN` environment variable.

## Diarization - Speakers

The number of speakers to detect. If set to 0, Pyannote will attempt to detect the number of speakers automatically.

# Command Line Options

Both `app.py` and `cli.py` also accept command line options, such as the ability to enable parallel execution on multiple
CPU/GPU cores, the default model name/VAD and so on. Consult the README in the root folder for more information.

# Additional Options

In addition to the above, there's also a "Full" options interface that allows you to set all the options available in the Whisper 
model. The options are as follows:

## Initial Prompt
Optional text to provide as a prompt for the first 30 seconds window. Whisper will attempt to use this as a starting point for the transcription, but you can
also get creative and specify a style or format for the output of the transcription.

For instance, if you use the prompt "hello how is it going always use lowercase no punctuation goodbye one two three start stop i you me they", Whisper will 
be biased to output lower capital letters and no punctuation, and may also be biased to output the words in the prompt more often.

## Temperature
The temperature to use when sampling. Default is 0 (zero). A higher temperature will result in more random output, while a lower temperature will be more deterministic.

## Best Of - Non-zero temperature
The number of candidates to sample from when sampling with non-zero temperature. Default is 5.

## Beam Size - Zero temperature
The number of beams to use in beam search when sampling with zero temperature. Default is 5.

## Patience - Zero temperature
The patience value to use in beam search when sampling with zero temperature. As in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search.

## Length Penalty - Any temperature
The token length penalty coefficient (alpha) to use when sampling with any temperature. As in https://arxiv.org/abs/1609.08144, uses simple length normalization by default.

## Suppress Tokens - Comma-separated list of token IDs
A comma-separated list of token IDs to suppress during sampling. The default value of "-1" will suppress most special characters except common punctuations.

## Condition on previous text
If True, provide the previous output of the model as a prompt for the next window. Disabling this may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop.

## FP16
Whether to perform inference in fp16. True by default.

## Temperature increment on fallback
The temperature to increase when falling back when the decoding fails to meet either of the thresholds below. Default is 0.2.

## Compression ratio threshold
If the gzip compression ratio is higher than this value, treat the decoding as failed. Default is 2.4.

## Logprob threshold
If the average log probability is lower than this value, treat the decoding as failed. Default is -1.0.

## No speech threshold
If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence. Default is 0.6.

## Diarization - Min Speakers

The minimum number of speakers for Pyannote to detect.

## Diarization - Max Speakers

The maximum number of speakers for Pyannote to detect.

## Repetition Penalty
- ctranslate2: repetition_penalty  
This parameter only takes effect in [faster-whisper (ctranslate2)](https://github.com/SYSTRAN/faster-whisper/issues/478).
Penalty applied to the score of previously generated tokens (set > 1 to penalize).

## No Repeat Ngram Size
- ctranslate2: no_repeat_ngram_size  
This parameter only takes effect in [faster-whisper (ctranslate2)](https://github.com/SYSTRAN/faster-whisper/issues/478).
Prevent repetitions of ngrams with this size (set 0 to disable).

## Whisper Filter options
**This is an experimental feature and may potentially filter out correct transcription results.**  

when enabled, can effectively improve the whisper hallucination, especially for the large-v3 version of the whisper model.  

Observations for transcriptions:
1. duration: calculated by subtracting start from end, it might indicate hallucinated results when inversely proportional to text length.
1. segment_last: the last result for each segment during VAD transcription has a certain probability of being a hallucinated result.
1. avg_logprob: average log probability, ranging from logprob_threshold (default: -1) to 0, is better when a larger value. A value lower than -0.9 might suggest a poor result.
1. compression_ratio: gzip compression ratio, ranging from 0 to compression_ratio_threshold (default: 2.4), a higher positive value is preferable. If it is lower than 0.9, it might indicate suboptimal results.
1. no_speech_prob: no_speech(<|nospeech|> token) probability, ranging from 0 to no_speech_threshold (default: 0.6), a smaller positive value is preferable. If it exceeds 0.1, it might suggest suboptimal results.

Four sets of filtering conditions have now been established, utilizing text length, duration length, as well as the avg_logprob, compression_ratio, and no_speech_prob parameters returned by Whisper.
1. avg_logprob < -0.9
1. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.5
1. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.07, compression_ratio < 0.9
1. (durationLen < 1.5 || segment_last), compression_ratio < 0.9, no_speech_prob > 0.1

## Translation - Batch Size
- transformers: batch_size  
When the pipeline will use DataLoader (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference this is not always beneficial.
- ctranslate2: max_batch_size  
The maximum batch size.

## Translation - No Repeat Ngram Size
- transformers: no_repeat_ngram_size  
Value that will be used by default in the generate method of the model for no_repeat_ngram_size. If set to int > 0, all ngrams of that size can only occur once.
- ctranslate2: no_repeat_ngram_size  
Prevent repetitions of ngrams with this size (set 0 to disable).

## Translation - Num Beams
- transformers: num_beams  
Number of beams for beam search that will be used by default in the generate method of the model. 1 means no beam search.
- ctranslate2: beam_size  
Beam size (1 for greedy search).

## Translation - Torch Dtype float16
- transformers: torch_dtype=torch.float16  
Load the float32 translation model with float16 when the system supports GPU (reducing VRAM usage, not applicable to models that have already been quantized, such as Ctranslate2, GPTQ, GGUF)

## Translation - Using Bitsandbytes
- transformers: load_in_8bit, load_in_4bit
Load the float32 translation model into mixed-8bit or 4bit precision quantized model when the system supports GPU (reducing VRAM usage, not applicable to models that have already been quantized, such as Ctranslate2, GPTQ, GGUF)