Spaces:
Sleeping
Sleeping
File size: 5,031 Bytes
084aa80 a95d6a8 084aa80 a95d6a8 084aa80 a95d6a8 084aa80 f5884f3 084aa80 f5884f3 084aa80 a95d6a8 084aa80 a95d6a8 084aa80 a95d6a8 084aa80 7f502b4 f5884f3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
# Options
To transcribe or translate an audio file, you can either copy an URL from a website (all [websites](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)
supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (*.*)"
in the file selector to select any file type, including video files) or use the microphone.
For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option.
## Model
Select the model that Whisper will use to transcribe the audio:
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|--------|------------|--------------------|--------------------|---------------|----------------|
| tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
| base | 74 M | base.en | base | ~1 GB | ~16x |
| small | 244 M | small.en | small | ~2 GB | ~6x |
| medium | 769 M | medium.en | medium | ~5 GB | ~2x |
| large | 1550 M | N/A | large | ~10 GB | 1x |
## Language
Select the language, or leave it empty for Whisper to automatically detect it.
Note that if the selected language and the language in the audio differs, Whisper may start to translate the audio to the selected
language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.
## Inputs
The options "URL (YouTube, etc.)", "Upload Audio" or "Micriphone Input" allows you to send an audio input to the model.
Note that the UI will only process the first valid input - i.e. if you enter both an URL and upload an audio, it will only process
the URL.
## Task
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
## Vad
Using a VAD will improve the timing accuracy of each transcribed line, as well as prevent Whisper getting into an infinite
loop detecting the same sentence over and over again. The downside is that this may be at a cost to text accuracy, especially
with regards to unique words or names that appear in the audio. You can compensate for this by increasing the prompt window.
Note that English is very well handled by Whisper, and it's less susceptible to issues surrounding bad timings and infinite loops.
So you may only need to use a VAD for other languages, such as Japanese, or when the audio is very long.
* none
* Run whisper on the entire audio input
* silero-vad
* Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run
on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently
on the non-speech section.
* silero-vad-expand-into-gaps
* Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded
such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections
00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.
* silero-vad-skip-gaps
* As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
may cause dialogue to be skipped.
* periodic-vad
* Create sections of speech every 'VAD - Max Merge Size' seconds. This is very fast and simple, but will potentially break
a sentence or word in two.
## VAD - Merge Window
If set, any adjacent speech sections that are at most this number of seconds apart will be automatically merged.
## VAD - Max Merge Size (s)
Disables merging of adjacent speech sections if they are this number of seconds long.
## VAD - Padding (s)
The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
to each transcribed line. The default value is 1 second.
## VAD - Prompt Window (s)
The text of a detected line will be included as a prompt to the next speech section, if the speech section starts at most this
number of seconds after the line has finished. For instance, if a line ends at 10:00, and the next speech section starts at
10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
Note that detected lines in gaps between speech sections will not be included in the prompt
(if silero-vad or silero-vad-expand-into-gaps) is used. |