Spaces:
Running
Running
File size: 6,508 Bytes
0f77492 e991395 3f70034 2a4fa08 3f70034 7eaa482 3f70034 32509de 8da8748 083d0cc 0494311 c93db3f 9e0a638 736cf38 d2efc2f 736cf38 8449dc3 3f70034 3439f98 3f70034 79ab38d 4ec36d4 79ab38d 9e7cb60 3439f98 b551682 bb0ce27 b551682 ea7fc1c 4ac0473 b551682 bb0ce27 b551682 ea7fc1c 6227586 ea7fc1c bb0ce27 ea7fc1c b551682 bb0ce27 ea7fc1c aa37c1f ea7fc1c 2ff9a65 1f5a277 2ff9a65 3d753e0 2ff9a65 b4bc808 2ff9a65 9f9378c 2ff9a65 a97e9b5 c93db3f 2ba8130 c93db3f 32509de c93db3f 2ba8130 d1438f4 32509de ab7c9b0 2109221 ab7c9b0 d8c2ba0 b5b4923 552112c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# Whisper-WebUI
A Gradio-based browser interface for [Whisper](https://github.com/openai/whisper). You can use it as an Easy Subtitle Generator!
![Whisper WebUI](https://github.com/jhj0517/Whsiper-WebUI/blob/master/screenshot.png)
## Notebook
If you wish to try this on Colab, you can do it in [here](https://colab.research.google.com/github/jhj0517/Whisper-WebUI/blob/master/notebook/whisper-webui.ipynb)!
# Feature
- Select the Whisper implementation you want to use between :
- [openai/whisper](https://github.com/openai/whisper)
- [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) (used by default)
- [Vaibhavs10/insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper)
- Generate subtitles from various sources, including :
- Files
- Youtube
- Microphone
- Currently supported subtitle formats :
- SRT
- WebVTT
- txt ( only text file without timeline )
- Speech to Text Translation
- From other languages to English. ( This is Whisper's end-to-end speech-to-text translation feature )
- Text to Text Translation
- Translate subtitle files using Facebook NLLB models
- Translate subtitle files using DeepL API
- Pre-processing audio input with [Silero VAD](https://github.com/snakers4/silero-vad).
- Pre-processing audio input to separate BGM with [UVR](https://github.com/Anjok07/ultimatevocalremovergui).
- Post-processing with speaker diarization using the [pyannote](https://huggingface.co/pyannote/speaker-diarization-3.1) model.
- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below.
1. https://huggingface.co/pyannote/speaker-diarization-3.1
2. https://huggingface.co/pyannote/segmentation-3.0
# Installation and Running
- ## Running with Pinokio
The app is able to run with [Pinokio](https://github.com/pinokiocomputer/pinokio).
1. Install [Pinokio Software](https://program.pinokio.computer/#/?id=install).
2. Open the software and search for Whisper-WebUI and install it.
3. Start the Whisper-WebUI and connect to the `http://localhost:7860`.
- ## Running with Docker
1. Install and launch [Docker-Desktop](https://www.docker.com/products/docker-desktop/).
2. Git clone the repository
```sh
git clone https://github.com/jhj0517/Whisper-WebUI.git
```
3. Build the image ( Image is about 7GB~ )
```sh
docker compose build
```
4. Run the container
```sh
docker compose up
```
5. Connect to the WebUI with your browser at `http://localhost:7860`
If needed, update the [`docker-compose.yaml`](https://github.com/jhj0517/Whisper-WebUI/blob/master/docker-compose.yaml) to match your environment.
- ## Run Locally
### Prerequisite
To run this WebUI, you need to have `git`, `3.10 <= python <= 3.12`, `FFmpeg`. <br>
And if you're not using an Nvida GPU, or using a different `CUDA` version than 12.4, edit the [`requirements.txt`](https://github.com/jhj0517/Whisper-WebUI/blob/master/requirements.txt) to match your environment.
Please follow the links below to install the necessary software:
- git : [https://git-scm.com/downloads](https://git-scm.com/downloads)
- python : [https://www.python.org/downloads/](https://www.python.org/downloads/) **`3.10 ~ 3.12` is recommended.**
- FFmpeg : [https://ffmpeg.org/download.html](https://ffmpeg.org/download.html)
- CUDA : [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)
After installing FFmpeg, **make sure to add the `FFmpeg/bin` folder to your system PATH!**
### Automatic Installation
1. git clone this repository
```shell
git clone https://github.com/jhj0517/Whisper-WebUI.git
```
2. Run `install.bat` or `install.sh` to install dependencies. (It will create a `venv` directory and install dependencies there.)
3. Start WebUI with `start-webui.bat` or `start-webui.sh` (It will run `python app.py` after activating the venv)
And you can also run the project with command line arguments if you like to, see [wiki](https://github.com/jhj0517/Whisper-WebUI/wiki/Command-Line-Arguments) for a guide to arguments.
# VRAM Usages
This project is integrated with [faster-whisper](https://github.com/guillaumekln/faster-whisper) by default for better VRAM usage and transcription speed.
According to faster-whisper, the efficiency of the optimized whisper model is as follows:
| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
|-------------------|-----------|-----------|-------|-----------------|-----------------|
| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
If you want to use an implementation other than faster-whisper, use `--whisper_type` arg and the repository name.<br>
Read [wiki](https://github.com/jhj0517/Whisper-WebUI/wiki/Command-Line-Arguments) for more info about CLI args.
## Available models
This is Whisper's original VRAM usage table for models.
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
| base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
| small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
`.en` models are for English only, and the cool thing is that you can use the `Translate to English` option from the "large" models!
## TODO๐
- [x] Add DeepL API translation
- [x] Add NLLB Model translation
- [x] Integrate with faster-whisper
- [x] Integrate with insanely-fast-whisper
- [x] Integrate with whisperX ( Only speaker diarization part )
- [x] Add background music separation pre-processing with [UVR](https://github.com/Anjok07/ultimatevocalremovergui)
- [ ] Add fast api script
- [ ] Support real-time transcription for microphone
### Translation ๐
Any PRs translating Japanese, Spanish, French, German, Chinese, or any other language into [translation.yaml](https://github.com/jhj0517/Whisper-WebUI/blob/master/configs/translation.yaml) would be greatly appreciated!
|