Upload 18 files
Browse files- .gitignore +15 -0
- README.md +144 -13
- Voice.py +159 -0
- app_state.py +7 -0
- diarize.py +75 -0
- dub_line.py +135 -0
- language_detection.py +13 -0
- loading subs pseudocode +20 -0
- main.py +0 -0
- requirements-linux310.txt +237 -0
- requirements-win-310.txt +0 -0
- requirements.txt +14 -0
- synth.py +33 -0
- test.py +12 -0
- utils.py +53 -0
- video.py +219 -0
- vocal_isolation.py +47 -0
- weeablind.py +163 -0
.gitignore
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
venv
|
2 |
+
__pycache__
|
3 |
+
.venv
|
4 |
+
output/
|
5 |
+
*.mkv
|
6 |
+
*.wav
|
7 |
+
*.mp3
|
8 |
+
*.mp4
|
9 |
+
*.webm
|
10 |
+
pretrained_models
|
11 |
+
tmp
|
12 |
+
dist
|
13 |
+
build
|
14 |
+
*.spec
|
15 |
+
audio_cache
|
README.md
CHANGED
@@ -1,13 +1,144 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Weeablind
|
2 |
+
|
3 |
+
A program to dub multi-lingual media and anime using modern AI speech synthesis, diarization, language identification, and voice cloning.
|
4 |
+
|
5 |
+
## Why
|
6 |
+
|
7 |
+
Many shows, movies, news segments, interviews, and videos will never receive proper dubs to other languages, and dubbing something from scratch can be an enormous undertaking. This presents a common accessibility hurdle for people with blindness, dyslexia, learning disabilities, or simply folks that don't enjoy reading subtitles. This program aims to create a pleasant alternative for folks facing these struggles.
|
8 |
+
|
9 |
+
This software is a product of war. My sister turned me onto my now-favorite comedy anime "The Disastrous Life of Saiki K." but Netflix never ordered a dub for the 2nd season. I'm blind and cannot and will not ever be able to read subtitles, but I MUST know how the story progresses! Netflix has forced my hand and I will bring AI-dubbed anime to the blind!
|
10 |
+
|
11 |
+
## How
|
12 |
+
|
13 |
+
This project relies on some rudimentary slapping together of some state of the art technologies. It uses numerous audio processing libraries and techniques to analyze and synthesize speech that tries to stay in-line with the source video file. It primarily relies on ffmpeg and pydub for audio and video editing, Coqui TTS for speech synthesis, speechbrain for language identification, and pyannote.audio for speaker diarization.
|
14 |
+
|
15 |
+
You have the option of dubbing every subtitle in the video, setting the s tart and end times, dubbing only foreign-language content, or full-blown multi-speaker dubbing with speaking rate and volume matching.
|
16 |
+
|
17 |
+
## When?
|
18 |
+
|
19 |
+
This project is currently what some might call in alpha. The major, core functionality is in place, and it's possible to use by cloning the repo, but it's only starting to be ready for a first release. There are numerous optimizations, UX, and refactoring that need to be done before a first release candidate. Stay tuned for regular updates, and feel free to extend a hand with contributions, testing, or suggestions if this is something you're interested in.
|
20 |
+
|
21 |
+
## The Name
|
22 |
+
|
23 |
+
I had the idea to call the software Weeablind as a portmanteaux of Weeaboo (someone a little too obsessed with anime), and blind. I might change it to something else in the future like Blindtaku, DubHub, or something similar and more catchy because the software can be used for far more than just anime.
|
24 |
+
|
25 |
+
## Setup
|
26 |
+
|
27 |
+
There are currently no prebuilt-binaries to download, this is something I am looking into, but many of these dependencies are not easy to bundle with something like PyInstaller
|
28 |
+
|
29 |
+
The program works best on Linux, but will also run on Windows.
|
30 |
+
|
31 |
+
### System Prerequisits
|
32 |
+
You will need to install [FFmpeg](https://ffmpeg.org/download.html) on your system and make sure it's callable from terminal or in your system PATH
|
33 |
+
|
34 |
+
For using Coqui TTS, you will also need Espeak-NG which you can get from your package manager on Linux or [here](https://github.com/espeak-ng/espeak-ng/releases) on Windows
|
35 |
+
|
36 |
+
On Windows, pip requires MSVC Build Tools to build Coqui. You can install it here:
|
37 |
+
https://visualstudio.microsoft.com/visual-cpp-build-tools/
|
38 |
+
|
39 |
+
Coqui TTS and Pyannote diarization will also both perform better if you have CUDA set up on your system to use your GPU. This should work out of the box on Linux but getting it set up on Windows takes some doing. This [blog post](https://saturncloud.io/blog/how-to-run-mozilla-ttscoqui-tts-training-with-cuda-on-a-windows-system/) should walk you through the process. If you can't get it working, don't fret, you can still use them on your CPU.
|
40 |
+
|
41 |
+
The latest version of Python works on Linux, but Spleeter only works on 3.10 and Pyannote can be finicky with that too. 3.10 seems to work the best on on Windows. You can get it from the Microsoft Store.
|
42 |
+
|
43 |
+
### Setup from Source
|
44 |
+
To use the project, you'll need to clone the repository and install the dependencies in a virtual enviormonet.
|
45 |
+
|
46 |
+
```
|
47 |
+
git clone https://github.com/FlorianEagox/weeablind.git
|
48 |
+
cd weeablind
|
49 |
+
python3.10 -m venv venv
|
50 |
+
# Windows
|
51 |
+
.\venv\Scripts\activate
|
52 |
+
# Linux
|
53 |
+
source ./venv/bin/activate
|
54 |
+
```
|
55 |
+
This project has a lot of dependencies, and pip can struggle with conflicts, so it's best to install from the lock file like this:
|
56 |
+
```
|
57 |
+
pip install -r requirements-win-310.txt --no-deps
|
58 |
+
```
|
59 |
+
You can try from the regular requirements file, but it can take a heck of a long time and requires some rejiggering sometimes.
|
60 |
+
|
61 |
+
Installing the dependencies can take a hot minute and uses a lot of space (~8 GB).
|
62 |
+
|
63 |
+
If you don't need certain features for instance, language filtering, you can omit speechbrain from the readme.
|
64 |
+
|
65 |
+
once this is completed, you can run the program with
|
66 |
+
|
67 |
+
```
|
68 |
+
python weeablind.py
|
69 |
+
```
|
70 |
+
|
71 |
+
## Usage
|
72 |
+
Start by either selecting a video from your computer or pasting a link to a YT video and pressing enter. It should download the video and lot the subs and audio.
|
73 |
+
|
74 |
+
### Loading a video
|
75 |
+
Once a video is loaded, you can preview the subtitles that will be dubbed. If the wrong language is loaded, or the wrong audio stream, switch to the streams tab and select the correct ones.
|
76 |
+
|
77 |
+
### Cropping
|
78 |
+
You can specify a start and end time if you only need to dub a section of the video, for example to skip the opening theme and credits of a show. Use timecode syntax like 2:17 and press enter.
|
79 |
+
|
80 |
+
### Configuring Voices
|
81 |
+
By default, a "Sample" voice should be initialized. You can play around with different configurations and test the voice before dubbing with the "Sample Voice" button in the "Configure Voices" tab. When you have parameters you're happy with, clicking "Update Voices" will re-asign it to that slot. If you choose the SYSTEM tts engine, the program will use Windows' SAPI5 Narrorator or Linux espeak voices by default. This is extremely fast but sounds very robotic. Selecting Coqui gives you a TON of options to play around with, but you will be prompted to download often very heavy TTS models. VCTK/VITS is my favorite model to dub with as it's very quick, even on CPU, and there are hundreds of speakers to choose from. It is loaded by default. If you have ran diarization, you can select different voices from the listbox and change their properties as well.
|
82 |
+
|
83 |
+
### Language Filtering
|
84 |
+
In the subtitles tab, you filter the subtitles to exclude lines spoken in your selected language so only the foreign language gets dubbed. This is useful for multi-lingual videos, but not videos all in one language.
|
85 |
+
|
86 |
+
### Diarization
|
87 |
+
Running diarization will attempt to assign the correct speaker to all the subtitles and generate random voices for the total number of speakers detected. In the futre, you'll be able to specify the diarization pipeline and number of speakers if you know ahead of time. Diarization is only useful for videos with multiple speakers and the accuracy can very massively.
|
88 |
+
|
89 |
+
### Background Isolation
|
90 |
+
In the "Streams" tab, you can run vocal isolation which will attempt to remove the vocals from your source video track but retain the background. If you're using a multi-lingual video and running language filtering as well, you'll need to run that first to keep the english (or whatever source language's vocals).
|
91 |
+
|
92 |
+
### Dubbing
|
93 |
+
Once you've configured things how you like, you can press the big, JUICY run dubbing button. This can take a while to run. Once completed, you should have something like "MyVideo-dubbed.mkv" in the `output` directory. This is your finished video!
|
94 |
+
|
95 |
+
## Things to do
|
96 |
+
- A better filtering system for language detection. Maybe inclusive and exclusive or confidence threshhold
|
97 |
+
- Find some less copyrighted multi-lingual / non-english content to display demos publicly
|
98 |
+
- de-anglicanization it so the user can select their target language instead of just English
|
99 |
+
- FIX PYDUB'S STUPID ARRAY DISTORTION so we don't have to perform 5 IO operations per dub!!!
|
100 |
+
- ~~run a vocal isolation / remover on the source audio to remove / mitigate the original speakers?~~
|
101 |
+
- ~~A proper setup guide for all platforms~~
|
102 |
+
- remove or fix the broken espeak implementation to be cross-platform
|
103 |
+
- ~~Uninitialized, singletons for heavy models upon startup (e.g. only intialize pyannote/speechbrain pipelines when needed)~~
|
104 |
+
- Abstraction for singletons of Coqui voices using the same model to reduce memory footprint
|
105 |
+
- ~~GUI tab to list and select audio / subtitle streams w/ FFMPEG~~
|
106 |
+
- ~~Move the tabs into their own classes~~
|
107 |
+
- ~~Add labels and screen reader landmarks to all the controls~~
|
108 |
+
- ~~Single speaker or multi speaker control switch~~
|
109 |
+
- ~~Download YouTube video with Closed Captions~~
|
110 |
+
- ~~GUI to select start and end time for dubbing~~
|
111 |
+
- Throw up a Flask server on my website so you can try it with minimal features.
|
112 |
+
- Use OCR to generate subtitles for videos that don't have sub streams
|
113 |
+
- Use OCR for non-text based subtitles
|
114 |
+
- Make a cool logo?
|
115 |
+
- Learn how to package python programs as binaries to make releases
|
116 |
+
- ~~Remove the copyrighted content from this repo (sorry not sorry TV Tokyo)~~
|
117 |
+
- Save and import config files for later
|
118 |
+
- ~~Support for all subtitle formats~~
|
119 |
+
- Maybe slap in an ASR library for videos without subtitles?
|
120 |
+
- Maybe support for magnet URLs or the arrLib to pirate media (who knows???)
|
121 |
+
|
122 |
+
### Diarization
|
123 |
+
- Filter subtitles by the selected voice from the listbox
|
124 |
+
- Select from multiple diarization models / pipelines
|
125 |
+
- Optimize audio trakcs for diarizaiton by isolating lines speech based on subtitle timings
|
126 |
+
- Investigate Diart?
|
127 |
+
|
128 |
+
### TTS
|
129 |
+
|
130 |
+
- ~~Rework the speed control to use PyDub to speed up audio.~~
|
131 |
+
- ~~match the volume of the speaker to TTS~~
|
132 |
+
- Checkbox to remove sequential subtitle entries and entries that are tiny, e.g. "nom" "nom" "nom" "nom"
|
133 |
+
- investigate voice conversion?
|
134 |
+
- Build an asynchronous queue of operations to perform
|
135 |
+
- Started - Asynchronous GUI for Coqui model downloads
|
136 |
+
- Add support for MyCroft Mimic 3
|
137 |
+
- Add Support for PiperTTS
|
138 |
+
|
139 |
+
### Cloning
|
140 |
+
- Create a cloning mode to select subtitles and export them to a dataset or wav compilation for Coqui XTTS
|
141 |
+
- Use diaries and subtitles to isolate and build training datasets
|
142 |
+
- Build a tool to streamline the manual creation of datasets
|
143 |
+
|
144 |
+
###### (oh god that's literally so many things, the scope of this has gotten so big how will this ever become a thing)
|
Voice.py
ADDED
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from enum import Enum, auto
|
2 |
+
import abc
|
3 |
+
import os
|
4 |
+
import threading
|
5 |
+
from time import sleep
|
6 |
+
from TTS.api import TTS
|
7 |
+
from TTS.utils import manage
|
8 |
+
import pyttsx3
|
9 |
+
from espeakng import ESpeakNG
|
10 |
+
import numpy as np
|
11 |
+
from torch.cuda import is_available
|
12 |
+
import time
|
13 |
+
|
14 |
+
class Voice(abc.ABC):
|
15 |
+
class VoiceType(Enum):
|
16 |
+
ESPEAK = "ESpeak"
|
17 |
+
COQUI = "Coqui TTS"
|
18 |
+
SYSTEM = "System Voices"
|
19 |
+
|
20 |
+
def __new__(cls, voice_type, init_args=[], name="Unnamed"):
|
21 |
+
if cls is Voice:
|
22 |
+
if voice_type == cls.VoiceType.ESPEAK:
|
23 |
+
return super().__new__(ESpeakVoice)
|
24 |
+
elif voice_type == cls.VoiceType.COQUI:
|
25 |
+
return super().__new__(CoquiVoice)
|
26 |
+
elif voice_type == cls.VoiceType.SYSTEM:
|
27 |
+
return super().__new__(SystemVoice)
|
28 |
+
else:
|
29 |
+
return super().__new__(cls)
|
30 |
+
|
31 |
+
def __init__(self, voice_type, init_args=[], name="Unnamed"):
|
32 |
+
self.voice_type = voice_type
|
33 |
+
self.name = name
|
34 |
+
self.voice_option = None
|
35 |
+
|
36 |
+
@abc.abstractmethod
|
37 |
+
def speak(self, text, file_name):
|
38 |
+
pass
|
39 |
+
|
40 |
+
def set_speed(self, speed):
|
41 |
+
pass
|
42 |
+
|
43 |
+
@abc.abstractmethod
|
44 |
+
def set_voice_params(self, voice=None, pitch=None):
|
45 |
+
pass
|
46 |
+
|
47 |
+
@abc.abstractmethod
|
48 |
+
def list_voice_options(self):
|
49 |
+
pass
|
50 |
+
|
51 |
+
def calibrate_rate(self):
|
52 |
+
output_path = './output/calibration.wav'
|
53 |
+
calibration_phrase_long = "In the early morning light, a vibrant scene unfolds as the quick brown fox jumps gracefully over the lazy dog. The fox's russet fur glistens in the sun, and its swift movements captivate onlookers. With a leap of agility, it soars through the air, showcasing its remarkable prowess. Meanwhile, the dog, relaxed and unperturbed, watches with half-closed eyes, acknowledging the fox's spirited display. The surrounding nature seems to hold its breath, enchanted by this charming spectacle. The gentle rustling of leaves and the distant chirping of birds provide a soothing soundtrack to this magical moment. The two animals, one lively and the other laid-back, showcase the beautiful harmony of nature, an ageless dance that continues to mesmerize all who witness it."
|
54 |
+
calibration_phrase_chair = "A chair is a piece of furniture with a raised surface used to sit on, commonly for use by one person. Chairs are most often supported by four legs and have a back; however, a chair can have three legs or could have a different shape. A chair without a back or arm rests is a stool, or when raised up, a bar stool."
|
55 |
+
calibration_phrase = "Hello? Testing, testing. Is.. is this thing on? Ah! Hello Gordon! I'm... assuming that's your real name... You wouldn't lie to us. Would you? Well... You finally did it! You survived the resonance cascade! You brought us all to hell and back, alive! You made it to the ultimate birthday bash at the end of the world! You beat the video game! And... now I imagine you'll... shut it down. Move on with your life. Onwards and upwards, ay Gordon? I don't.. know... how much longer I have to send this to you so I'll try to keep it brief. Not my specialty. Perhaps this is presumptuous of me but... Must this really be the end of our time together? Perhaps you could take the science team's data, transfer us somewhere else, hmm? Now... it doesn't have to be Super Punch-Out for the Super Nintendo Entertainment System. Maybe a USB drive, or a spare floppy disk. You could take us with you! We could see the world! We could... I'm getting a little ahead of myself, surely. Welp! The option's always there! You changed our lives, Gordon. I'd like to think it was for the better. And I don't know what's going to happen to us once you exit the game for good. But I know we'll never forget you. I hope you won't forget us. Well... This is where I get off. Goodbye Gordon!"
|
56 |
+
self.speak(calibration_phrase, output_path)
|
57 |
+
|
58 |
+
def get_wpm(words, duration):
|
59 |
+
return (len(words.split(' ')) / duration * 60)
|
60 |
+
|
61 |
+
class ESpeakVoice(Voice):
|
62 |
+
def __init__(self, init_args=[], name="Unnamed"):
|
63 |
+
super().__init__(Voice.VoiceType.ESPEAK, init_args, name)
|
64 |
+
self.set_voice_params(init_args)
|
65 |
+
|
66 |
+
def speak(self, text, file_name):
|
67 |
+
self.voice.synth_wav(text, file_name)
|
68 |
+
|
69 |
+
def set_speed(self, speed):
|
70 |
+
# current_speaker.set_speed(60*int((len(text.split(' ')) / (sub.end.total_seconds() - sub.start.total_seconds()))))
|
71 |
+
self.voice.speed = speed
|
72 |
+
|
73 |
+
def set_voice_params(self, voice=None, pitch=None):
|
74 |
+
if voice:
|
75 |
+
self.voice.voice = voice
|
76 |
+
if pitch:
|
77 |
+
self.voice.pitch = pitch
|
78 |
+
|
79 |
+
def list_voice_options(self):
|
80 |
+
# Optionally, you can return available voice options for ESpeak here
|
81 |
+
pass
|
82 |
+
|
83 |
+
class CoquiVoice(Voice):
|
84 |
+
def __init__(self, init_args=None, name="Coqui Voice"):
|
85 |
+
super().__init__(Voice.VoiceType.COQUI, init_args, name)
|
86 |
+
self.voice = TTS().to('cuda' if is_available() else 'cpu')
|
87 |
+
self.langs = ["All Languages"] + list({lang.split("/")[1] for lang in self.voice.list_models()})
|
88 |
+
self.langs.sort()
|
89 |
+
self.selected_lang = 'en'
|
90 |
+
self.is_multispeaker = False
|
91 |
+
self.speaker = None
|
92 |
+
self.speaker_wav = None
|
93 |
+
|
94 |
+
def speak(self, text, file_path=None):
|
95 |
+
if file_path:
|
96 |
+
return self.voice.tts_to_file(
|
97 |
+
text,
|
98 |
+
file_path=file_path,
|
99 |
+
speaker=self.speaker,
|
100 |
+
language= 'en' if self.voice.is_multi_lingual else None,
|
101 |
+
speaker_wav=self.speaker_wav
|
102 |
+
)
|
103 |
+
else:
|
104 |
+
return np.array(self.voice.tts(
|
105 |
+
text,
|
106 |
+
speaker=self.speaker,
|
107 |
+
language= 'en' if self.voice.is_multi_lingual else None
|
108 |
+
))
|
109 |
+
|
110 |
+
def set_voice_params(self, voice=None, speaker=None, speaker_wav=None, progress=None):
|
111 |
+
if voice and voice != self.voice_option:
|
112 |
+
if progress:
|
113 |
+
progress(0, "downloading")
|
114 |
+
download_thread = threading.Thread(target=self.voice.load_tts_model_by_name, args=(voice,))
|
115 |
+
download_thread.start()
|
116 |
+
while(download_thread.is_alive()):
|
117 |
+
# I'll remove this check if they accept my PR c:
|
118 |
+
bar = manage.tqdm_progress if hasattr(manage, "tqdm_progress") else None
|
119 |
+
if bar:
|
120 |
+
progress_value = int(100*(bar.n / bar.total))
|
121 |
+
progress(progress_value, "downloading")
|
122 |
+
time.sleep(0.25) # Adjust the interval as needed
|
123 |
+
progress(-1, "done!")
|
124 |
+
else:
|
125 |
+
self.voice.load_tts_model_by_name(voice)
|
126 |
+
self.voice_option = self.voice.model_name
|
127 |
+
self.is_multispeaker = self.voice.is_multi_speaker
|
128 |
+
self.speaker = speaker
|
129 |
+
|
130 |
+
def list_voice_options(self):
|
131 |
+
return self.voice.list_models()
|
132 |
+
|
133 |
+
def is_model_downloaded(self, model_name):
|
134 |
+
return os.path.exists(os.path.join(self.voice.manager.output_prefix, self.voice.manager._set_model_item(model_name)[1]))
|
135 |
+
|
136 |
+
def list_speakers(self):
|
137 |
+
return self.voice.speakers if self.voice.is_multi_speaker else []
|
138 |
+
|
139 |
+
class SystemVoice(Voice):
|
140 |
+
def __init__(self, init_args=[], name="Unnamed"):
|
141 |
+
super().__init__(Voice.VoiceType.SYSTEM, init_args, name)
|
142 |
+
self.voice = pyttsx3.init()
|
143 |
+
self.voice_option = self.voice.getProperty('voice')
|
144 |
+
|
145 |
+
def speak(self, text, file_name):
|
146 |
+
self.voice.save_to_file(text, file_name)
|
147 |
+
self.voice.runAndWait()
|
148 |
+
return file_name
|
149 |
+
|
150 |
+
def set_speed(self, speed):
|
151 |
+
self.voice.setProperty('rate', speed)
|
152 |
+
|
153 |
+
def set_voice_params(self, voice=None, pitch=None):
|
154 |
+
if voice:
|
155 |
+
self.voice.setProperty('voice', voice)
|
156 |
+
self.voice_option = self.voice.getProperty('voice')
|
157 |
+
|
158 |
+
def list_voice_options(self):
|
159 |
+
return [voice.name for voice in self.voice.getProperty('voices')]
|
app_state.py
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from Voice import Voice
|
2 |
+
|
3 |
+
video = None
|
4 |
+
speakers = [Voice(Voice.VoiceType.COQUI, name="Sample")]
|
5 |
+
speakers[0].set_voice_params('tts_models/en/vctk/vits', 'p326') # p340
|
6 |
+
current_speaker = speakers[0]
|
7 |
+
sample_speaker = current_speaker
|
diarize.py
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# This file contains all functions related to diarizing a video including optimization and processing a speech diary (rttm file)
|
2 |
+
# These functions use a functional approach as I didn't wanted to group them and not bloat the video class with such specific functions
|
3 |
+
# Perhaps going forward I should abstract diary entries as their own objects similar to dub_line, but I haven't decidded yet as diaries might be useful for voice cloning as well
|
4 |
+
|
5 |
+
import app_state
|
6 |
+
import utils
|
7 |
+
from Voice import Voice
|
8 |
+
from pyannote.audio import Pipeline
|
9 |
+
import torchaudio.transforms as T
|
10 |
+
import torchaudio
|
11 |
+
import random
|
12 |
+
|
13 |
+
pipeline = None
|
14 |
+
|
15 |
+
# Read RTTM files generated by Pyannote into an array containing the speaker, start, and end of their speech in the audio
|
16 |
+
def load_diary(file):
|
17 |
+
diary = []
|
18 |
+
with open(file, 'r', encoding='utf-8') as diary_file:
|
19 |
+
for line in diary_file.read().strip().split('\n'):
|
20 |
+
line_values = line.split(' ')
|
21 |
+
diary.append([line_values[7], float(line_values[3]), float(line_values[4])])
|
22 |
+
total_speakers = len(set(line[0] for line in diary))
|
23 |
+
app_state.speakers = initialize_speakers(total_speakers)
|
24 |
+
return diary
|
25 |
+
|
26 |
+
# Time Shift the speech diary to be in line with the start time
|
27 |
+
def update_diary_timing(diary, start_time):
|
28 |
+
return [[int(line[0].split('_')[1]), line[1] + start_time, line[2]] for line in diary]
|
29 |
+
|
30 |
+
def initialize_speakers(speaker_count):
|
31 |
+
speakers = []
|
32 |
+
speaker_options = app_state.sample_speaker.list_speakers()
|
33 |
+
for i in range(speaker_count):
|
34 |
+
speakers.append(Voice(Voice.VoiceType.COQUI, f"Voice {i}"))
|
35 |
+
speakers[i].set_voice_params('tts_models/en/vctk/vits', random.choice(speaker_options))
|
36 |
+
return speakers
|
37 |
+
|
38 |
+
def find_nearest_speaker(diary, sub):
|
39 |
+
return diary[
|
40 |
+
utils.find_nearest(
|
41 |
+
[diary_entry[1] for diary_entry in diary],
|
42 |
+
sub.start
|
43 |
+
)
|
44 |
+
][0]
|
45 |
+
|
46 |
+
|
47 |
+
|
48 |
+
def optimize_audio_diarization(video):
|
49 |
+
crop = video.crop_audio(True)
|
50 |
+
waveform, sample_rate = torchaudio.load(crop)
|
51 |
+
# Apply noise reduction
|
52 |
+
noise_reduce = T.Vad(sample_rate=sample_rate)
|
53 |
+
clean_waveform = noise_reduce(waveform)
|
54 |
+
|
55 |
+
# Normalize audio
|
56 |
+
normalize = T.Resample(orig_freq=sample_rate, new_freq=sample_rate)
|
57 |
+
normalized_waveform = normalize(clean_waveform)
|
58 |
+
|
59 |
+
return normalized_waveform, sample_rate
|
60 |
+
|
61 |
+
def run_diarization(video):
|
62 |
+
global pipeline # Probably should move this to app state?
|
63 |
+
if not pipeline:
|
64 |
+
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0", use_auth_token="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")
|
65 |
+
import torch
|
66 |
+
pipeline.to(torch.device("cuda"))
|
67 |
+
output = utils.get_output_path(video.file, ".rttm")
|
68 |
+
optimized, sample_rate = optimize_audio_diarization(video)
|
69 |
+
diarization = pipeline({"waveform": optimized, "sample_rate": sample_rate})
|
70 |
+
with open(output, "w") as rttm:
|
71 |
+
diarization.write_rttm(rttm)
|
72 |
+
diary = load_diary(output)
|
73 |
+
diary = update_diary_timing(diary, video.start_time)
|
74 |
+
for sub in video.subs_adjusted:
|
75 |
+
sub.voice = find_nearest_speaker(diary, sub)
|
dub_line.py
ADDED
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from dataclasses import dataclass
|
2 |
+
from Voice import Voice
|
3 |
+
import ffmpeg
|
4 |
+
import utils
|
5 |
+
import app_state
|
6 |
+
import srt
|
7 |
+
from re import compile, sub as substitute
|
8 |
+
from pydub import AudioSegment
|
9 |
+
from audiotsm import wsola
|
10 |
+
from audiotsm.io.wav import WavReader, WavWriter
|
11 |
+
from audiotsm.io.array import ArrayReader, ArrayWriter
|
12 |
+
from speechbrain.pretrained import EncoderClassifier
|
13 |
+
import numpy as np
|
14 |
+
from language_detection import detect_language
|
15 |
+
remove_xml = compile(r'<[^>]+>|\{[^}]+\}')
|
16 |
+
language_identifier_model = None # EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp")
|
17 |
+
|
18 |
+
@dataclass
|
19 |
+
class DubbedLine:
|
20 |
+
start: float
|
21 |
+
end: float
|
22 |
+
text: str
|
23 |
+
index: int
|
24 |
+
voice: int = 0
|
25 |
+
language: str = ""
|
26 |
+
|
27 |
+
# This is highly inefficient as it writes and reads the same file many times
|
28 |
+
def dub_line_file(self, match_volume=True, output=False):
|
29 |
+
output_path = utils.get_output_path(str(self.index), '.wav', path='files')
|
30 |
+
tts_audio = app_state.speakers[self.voice].speak(self.text, output_path)
|
31 |
+
rate_adjusted = self.match_rate(tts_audio, self.end-self.start)
|
32 |
+
segment = AudioSegment.from_wav(rate_adjusted)
|
33 |
+
if match_volume:
|
34 |
+
segment = self.match_volume(app_state.video.get_snippet(self.start, self.end), segment)
|
35 |
+
if output:
|
36 |
+
segment.export(output_path, format='wav')
|
37 |
+
return segment
|
38 |
+
|
39 |
+
# This should ideally be a much more efficient way to dub.
|
40 |
+
# All functions should pass around numpy arrays rather than reading and writting files. For some reason though, it gives distroted results
|
41 |
+
def dub_line_ram(self, output=True):
|
42 |
+
output_path = utils.get_output_path(str(self.index), '.wav', path='files')
|
43 |
+
tts_audio = app_state.speakers[self.voice].speak(self.text)
|
44 |
+
rate_adjusted = self.match_rate_ram(tts_audio, self.end-self.start)
|
45 |
+
data = rate_adjusted / np.max(np.abs(rate_adjusted))
|
46 |
+
# This causes some kind of wacky audio distrotion we NEED to fix ;C
|
47 |
+
audio_as_int = (data * (2**15)).astype(np.int16).tobytes()
|
48 |
+
segment = AudioSegment(
|
49 |
+
audio_as_int,
|
50 |
+
frame_rate=22050,
|
51 |
+
sample_width=2,
|
52 |
+
channels=1
|
53 |
+
)
|
54 |
+
if output:
|
55 |
+
segment.export(output_path, format='wav')
|
56 |
+
return segment
|
57 |
+
|
58 |
+
def match_rate(self, target_path, source_duration, destination_path=None, clamp_min=0, clamp_max=4):
|
59 |
+
if destination_path == None:
|
60 |
+
destination_path = target_path.split('.')[0] + '-timeshift.wav'
|
61 |
+
duration = float(ffmpeg.probe(target_path)["format"]["duration"])
|
62 |
+
rate = duration*1/source_duration
|
63 |
+
rate = np.clip(rate, clamp_min, clamp_max)
|
64 |
+
with WavReader(target_path) as reader:
|
65 |
+
with WavWriter(destination_path, reader.channels, reader.samplerate) as writer:
|
66 |
+
tsm = wsola(reader.channels, speed=rate)
|
67 |
+
tsm.run(reader, writer)
|
68 |
+
return destination_path
|
69 |
+
|
70 |
+
def match_rate_ram(self, target, source_duration, outpath=None, clamp_min=0.8, clamp_max=2.5):
|
71 |
+
num_samples = len(target)
|
72 |
+
target = target.reshape(1, num_samples)
|
73 |
+
duration = num_samples / 22050
|
74 |
+
rate = duration*1/source_duration
|
75 |
+
rate = np.clip(rate, clamp_min, clamp_max)
|
76 |
+
reader = ArrayReader(target)
|
77 |
+
tsm = wsola(reader.channels, speed=rate)
|
78 |
+
if not outpath:
|
79 |
+
rate_adjusted = ArrayWriter(channels=1)
|
80 |
+
tsm.run(reader, rate_adjusted)
|
81 |
+
return rate_adjusted.data
|
82 |
+
else:
|
83 |
+
rate_adjusted = WavWriter(outpath, 1, 22050)
|
84 |
+
tsm.run(reader, rate_adjusted)
|
85 |
+
rate_adjusted.close()
|
86 |
+
return outpath
|
87 |
+
|
88 |
+
def match_volume(self, source_snippet, target):
|
89 |
+
# ratio = source_snippet.rms / (target.rms | 1)
|
90 |
+
ratio = source_snippet.dBFS - target.dBFS
|
91 |
+
# adjusted_audio = target.apply_gain(ratio)
|
92 |
+
adjusted_audio = target + ratio
|
93 |
+
return adjusted_audio
|
94 |
+
# adjusted_audio.export(output_path, format="wav")
|
95 |
+
|
96 |
+
def get_language(self, source_snippet):
|
97 |
+
if not self.language:
|
98 |
+
self.language = detect_language(source_snippet)
|
99 |
+
return self.language
|
100 |
+
|
101 |
+
|
102 |
+
def filter_junk(subs, minimum_duration=0.1, remove_repeats=True):
|
103 |
+
filtered = []
|
104 |
+
previous = ""
|
105 |
+
for sub in subs:
|
106 |
+
if (sub.end - sub.start) > minimum_duration:
|
107 |
+
if sub.text != previous:
|
108 |
+
filtered.append(sub)
|
109 |
+
previous = sub.text
|
110 |
+
return filtered
|
111 |
+
|
112 |
+
# This function is designed to handle two cases
|
113 |
+
# 1 We just have a path to an srt that we want to import
|
114 |
+
# 2 You have a file containing subs, but not srt (a video file, a vtt, whatever)
|
115 |
+
# In this case, we must extract or convert the subs to srt, and then read it in (export then import)
|
116 |
+
def load_subs(import_path="", extract_subs_path=False, filter=True):
|
117 |
+
if extract_subs_path: # For importing an external subtitles file
|
118 |
+
(
|
119 |
+
ffmpeg
|
120 |
+
.input(extract_subs_path)
|
121 |
+
.output(import_path)
|
122 |
+
.global_args('-loglevel', 'error')
|
123 |
+
.run(overwrite_output=True)
|
124 |
+
)
|
125 |
+
with open(import_path, "r", encoding="utf-8") as f:
|
126 |
+
original_subs = list(srt.parse(f.read()))
|
127 |
+
return filter_junk([
|
128 |
+
DubbedLine(
|
129 |
+
sub.start.total_seconds(),
|
130 |
+
sub.end.total_seconds(),
|
131 |
+
substitute(remove_xml, '', sub.content),
|
132 |
+
sub.index
|
133 |
+
)
|
134 |
+
for sub in original_subs
|
135 |
+
])
|
language_detection.py
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# This is used to detect the spoken language in an audio file
|
2 |
+
# I wanted to abstract it to it's own file, just like vocal isolation & diarization
|
3 |
+
from speechbrain.pretrained import EncoderClassifier
|
4 |
+
|
5 |
+
language_identifier_model = None
|
6 |
+
|
7 |
+
def detect_language(file):
|
8 |
+
global language_identifier_model
|
9 |
+
if not language_identifier_model:
|
10 |
+
language_identifier_model = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp") #, run_opts={"device":"cuda"})
|
11 |
+
signal = language_identifier_model.load_audio(file)
|
12 |
+
prediction = language_identifier_model.classify_batch(signal)
|
13 |
+
return prediction[3][0].split(' ')[1]
|
loading subs pseudocode
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
user loads video
|
2 |
+
video is file:
|
3 |
+
file has subs?
|
4 |
+
load the first subs
|
5 |
+
display all subs
|
6 |
+
user selects new subs:
|
7 |
+
load subs with given stream index
|
8 |
+
video is YT link:
|
9 |
+
download all subs (if any)
|
10 |
+
subs?
|
11 |
+
display the subs
|
12 |
+
user selects subs (vtt)
|
13 |
+
convert the subs to srt
|
14 |
+
load subs
|
15 |
+
there are no subs!?!?!:
|
16 |
+
This is the spooky zone
|
17 |
+
offer to upload a subtitle file?
|
18 |
+
|
19 |
+
offer to attempt video OCR???
|
20 |
+
attempt ASR + Translation? This would be fucking insane don't do this please don't add this feature this is literally impossible, right???
|
main.py
ADDED
File without changes
|
requirements-linux310.txt
ADDED
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
absl-py==2.0.0
|
2 |
+
accelerate==0.24.1
|
3 |
+
aiohttp==3.8.6
|
4 |
+
aiosignal==1.3.1
|
5 |
+
alembic==1.12.1
|
6 |
+
antlr4-python3-runtime==4.9.3
|
7 |
+
anyascii==0.3.2
|
8 |
+
anyio==3.7.1
|
9 |
+
appdirs==1.4.4
|
10 |
+
asteroid-filterbanks==0.4.0
|
11 |
+
astunparse==1.6.3
|
12 |
+
async-timeout==4.0.3
|
13 |
+
attrs==23.1.0
|
14 |
+
audioread==3.0.1
|
15 |
+
audiotsm==0.1.2
|
16 |
+
Babel==2.13.1
|
17 |
+
bangla==0.0.2
|
18 |
+
blinker==1.7.0
|
19 |
+
bnnumerizer==0.0.2
|
20 |
+
bnunicodenormalizer==0.1.6
|
21 |
+
Brotli==1.1.0
|
22 |
+
cachetools==5.3.2
|
23 |
+
certifi==2023.7.22
|
24 |
+
cffi==1.16.0
|
25 |
+
charset-normalizer==3.3.2
|
26 |
+
clean-fid==0.1.35
|
27 |
+
click==7.1.2
|
28 |
+
clip-anytorch==2.5.2
|
29 |
+
colorama==0.4.6
|
30 |
+
coloredlogs==15.0.1
|
31 |
+
colorlog==6.7.0
|
32 |
+
contourpy==1.1.1
|
33 |
+
coqpit==0.0.17
|
34 |
+
cycler==0.12.1
|
35 |
+
Cython==0.29.30
|
36 |
+
dateparser==1.1.8
|
37 |
+
decorator==5.1.1
|
38 |
+
docker-pycreds==0.4.0
|
39 |
+
docopt==0.6.2
|
40 |
+
einops==0.6.1
|
41 |
+
encodec==0.1.1
|
42 |
+
exceptiongroup==1.1.3
|
43 |
+
ffmpeg-python==0.2.0
|
44 |
+
filelock==3.13.1
|
45 |
+
Flask==2.3.3
|
46 |
+
flatbuffers==1.12
|
47 |
+
fonttools==4.43.1
|
48 |
+
frozenlist==1.4.0
|
49 |
+
fsspec==2023.6.0
|
50 |
+
ftfy==6.1.1
|
51 |
+
future==0.18.3
|
52 |
+
g2pkk==0.1.2
|
53 |
+
gast==0.4.0
|
54 |
+
gitdb==4.0.11
|
55 |
+
GitPython==3.1.40
|
56 |
+
google-auth==2.23.4
|
57 |
+
google-auth-oauthlib==0.4.6
|
58 |
+
google-pasta==0.2.0
|
59 |
+
greenlet==3.0.1
|
60 |
+
grpcio==1.59.2
|
61 |
+
gruut==2.2.3
|
62 |
+
gruut-ipa==0.13.0
|
63 |
+
gruut-lang-de==2.0.0
|
64 |
+
gruut-lang-en==2.0.0
|
65 |
+
gruut-lang-es==2.0.0
|
66 |
+
gruut-lang-fr==2.0.2
|
67 |
+
h11==0.12.0
|
68 |
+
h2==4.1.0
|
69 |
+
h5py==3.10.0
|
70 |
+
hpack==4.0.0
|
71 |
+
httpcore==0.13.7
|
72 |
+
httpx==0.19.0
|
73 |
+
huggingface-hub==0.18.0
|
74 |
+
humanfriendly==10.0
|
75 |
+
hyperframe==6.0.1
|
76 |
+
HyperPyYAML==1.2.2
|
77 |
+
idna==3.4
|
78 |
+
imageio==2.31.6
|
79 |
+
inflect==5.6.2
|
80 |
+
itsdangerous==2.1.2
|
81 |
+
jamo==0.4.1
|
82 |
+
jieba==0.42.1
|
83 |
+
Jinja2==3.1.2
|
84 |
+
joblib==1.3.2
|
85 |
+
jsonlines==1.2.0
|
86 |
+
jsonmerge==1.9.2
|
87 |
+
jsonschema==4.19.2
|
88 |
+
jsonschema-specifications==2023.7.1
|
89 |
+
julius==0.2.7
|
90 |
+
k-diffusion==0.0.16
|
91 |
+
keras==2.9.0
|
92 |
+
Keras-Preprocessing==1.1.2
|
93 |
+
kiwisolver==1.4.5
|
94 |
+
kornia==0.7.0
|
95 |
+
lazy_loader==0.3
|
96 |
+
libclang==16.0.6
|
97 |
+
librosa==0.10.0
|
98 |
+
lightning==2.1.0
|
99 |
+
lightning-utilities==0.9.0
|
100 |
+
llvmlite==0.40.1
|
101 |
+
Mako==1.2.4
|
102 |
+
Markdown==3.5.1
|
103 |
+
markdown-it-py==3.0.0
|
104 |
+
MarkupSafe==2.1.3
|
105 |
+
matplotlib==3.7.3
|
106 |
+
mdurl==0.1.2
|
107 |
+
mpmath==1.3.0
|
108 |
+
msgpack==1.0.7
|
109 |
+
multidict==6.0.4
|
110 |
+
mutagen==1.47.0
|
111 |
+
networkx==2.8.8
|
112 |
+
nltk==3.8.1
|
113 |
+
norbert==0.2.1
|
114 |
+
num2words==0.5.13
|
115 |
+
numba==0.57.0
|
116 |
+
numpy==1.22.0
|
117 |
+
nvidia-cublas-cu12==12.1.3.1
|
118 |
+
nvidia-cuda-cupti-cu12==12.1.105
|
119 |
+
nvidia-cuda-nvrtc-cu12==12.1.105
|
120 |
+
nvidia-cuda-runtime-cu12==12.1.105
|
121 |
+
nvidia-cudnn-cu12==8.9.2.26
|
122 |
+
nvidia-cufft-cu12==11.0.2.54
|
123 |
+
nvidia-curand-cu12==10.3.2.106
|
124 |
+
nvidia-cusolver-cu12==11.4.5.107
|
125 |
+
nvidia-cusparse-cu12==12.1.0.106
|
126 |
+
nvidia-nccl-cu12==2.18.1
|
127 |
+
nvidia-nvjitlink-cu12==12.3.52
|
128 |
+
nvidia-nvtx-cu12==12.1.105
|
129 |
+
oauthlib==3.2.2
|
130 |
+
omegaconf==2.3.0
|
131 |
+
onnxruntime-gpu==1.16.1
|
132 |
+
opt-einsum==3.3.0
|
133 |
+
optuna==3.4.0
|
134 |
+
packaging==23.1
|
135 |
+
pandas==1.5.3
|
136 |
+
pathtools==0.1.2
|
137 |
+
Pillow==10.0.1
|
138 |
+
platformdirs==3.11.0
|
139 |
+
pooch==1.8.0
|
140 |
+
primePy==1.3
|
141 |
+
protobuf==3.20.3
|
142 |
+
psutil==5.9.6
|
143 |
+
py-espeak-ng==0.1.8
|
144 |
+
pyannote.audio==3.0.1
|
145 |
+
pyannote.core==5.0.0
|
146 |
+
pyannote.database==5.0.1
|
147 |
+
pyannote.metrics==3.2.1
|
148 |
+
pyannote.pipeline==3.0.1
|
149 |
+
pyasn1==0.5.0
|
150 |
+
pyasn1-modules==0.3.0
|
151 |
+
pycparser==2.21
|
152 |
+
pycryptodomex==3.19.0
|
153 |
+
pydub==0.25.1
|
154 |
+
Pygments==2.16.1
|
155 |
+
pynndescent==0.5.10
|
156 |
+
pyparsing==3.1.1
|
157 |
+
pypinyin==0.49.0
|
158 |
+
pysbd==0.3.4
|
159 |
+
python-crfsuite==0.9.9
|
160 |
+
python-dateutil==2.8.2
|
161 |
+
pytorch-lightning==2.1.0
|
162 |
+
pytorch-metric-learning==2.3.0
|
163 |
+
pyttsx3==2.90
|
164 |
+
pytz==2023.3.post1
|
165 |
+
PyYAML==6.0.1
|
166 |
+
referencing==0.30.2
|
167 |
+
regex==2023.10.3
|
168 |
+
requests==2.31.0
|
169 |
+
requests-oauthlib==1.3.1
|
170 |
+
resize-right==0.0.2
|
171 |
+
rfc3986==1.5.0
|
172 |
+
rich==13.6.0
|
173 |
+
rpds-py==0.10.6
|
174 |
+
rsa==4.9
|
175 |
+
ruamel.yaml==0.18.4
|
176 |
+
ruamel.yaml.clib==0.2.8
|
177 |
+
safetensors==0.4.0
|
178 |
+
scikit-image==0.22.0
|
179 |
+
scikit-learn==1.3.0
|
180 |
+
scipy==1.11.3
|
181 |
+
semver==3.0.2
|
182 |
+
sentencepiece==0.1.99
|
183 |
+
sentry-sdk==1.34.0
|
184 |
+
setproctitle==1.3.3
|
185 |
+
shellingham==1.5.4
|
186 |
+
six==1.16.0
|
187 |
+
smmap==5.0.1
|
188 |
+
sniffio==1.3.0
|
189 |
+
sortedcontainers==2.4.0
|
190 |
+
soundfile==0.12.1
|
191 |
+
soxr==0.3.7
|
192 |
+
speechbrain==0.5.15
|
193 |
+
spleeter==2.4.0
|
194 |
+
SQLAlchemy==2.0.23
|
195 |
+
srt==3.5.3
|
196 |
+
sympy==1.12
|
197 |
+
tabulate==0.9.0
|
198 |
+
tbb==2021.10.0
|
199 |
+
tensorboard==2.9.1
|
200 |
+
tensorboard-data-server==0.6.1
|
201 |
+
tensorboard-plugin-wit==1.8.1
|
202 |
+
tensorboardX==2.6.2.2
|
203 |
+
tensorflow==2.9.3
|
204 |
+
tensorflow-estimator==2.9.0
|
205 |
+
tensorflow-io-gcs-filesystem==0.34.0
|
206 |
+
termcolor==2.3.0
|
207 |
+
threadpoolctl==3.2.0
|
208 |
+
tifffile==2023.9.26
|
209 |
+
tokenizers==0.13.3
|
210 |
+
torch==2.1.0
|
211 |
+
torch-audiomentations==0.11.0
|
212 |
+
torch-pitch-shift==1.2.4
|
213 |
+
torchaudio==2.1.0
|
214 |
+
torchdiffeq==0.2.3
|
215 |
+
torchmetrics==1.2.0
|
216 |
+
torchsde==0.2.6
|
217 |
+
torchvision==0.16.0
|
218 |
+
tqdm==4.64.1
|
219 |
+
trainer==0.0.31
|
220 |
+
trampoline==0.1.2
|
221 |
+
transformers==4.33.3
|
222 |
+
triton==2.1.0
|
223 |
+
TTS==0.19.1
|
224 |
+
typer==0.3.2
|
225 |
+
typing_extensions==4.8.0
|
226 |
+
tzlocal==5.2
|
227 |
+
umap-learn==0.5.4
|
228 |
+
Unidecode==1.3.7
|
229 |
+
urllib3==2.0.7
|
230 |
+
wandb==0.15.12
|
231 |
+
wcwidth==0.2.9
|
232 |
+
websockets==12.0
|
233 |
+
Werkzeug==3.0.1
|
234 |
+
wrapt==1.15.0
|
235 |
+
wxPython==4.2.1
|
236 |
+
yarl==1.9.2
|
237 |
+
yt-dlp==2023.10.13
|
requirements-win-310.txt
ADDED
Binary file (8.92 kB). View file
|
|
requirements.txt
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
tts # <-- Coqui TTS engine
|
2 |
+
pyannote.audio
|
3 |
+
ffmpeg-python
|
4 |
+
srt
|
5 |
+
py-espeak-ng
|
6 |
+
pydub
|
7 |
+
# pyAudio # <--- Needed on Windows, breaks on Linux
|
8 |
+
-f https://extras.wxpython.org/wxPython4/extras/linux/gtk3/ubuntu-22.04
|
9 |
+
wxpython
|
10 |
+
pyttsx3 # <-- System TTS engine
|
11 |
+
yt-dlp # <-- Downloading YT vids
|
12 |
+
audiotsm # <-- Audio timestretching
|
13 |
+
speechbrain # <-- Audio Language Identification
|
14 |
+
spleeter # <-- Vocal / Background isolation
|
synth.py
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Formerly the prototypical file, synth. Now it's just a graveyard of functions that may never return?
|
2 |
+
from pydub import AudioSegment
|
3 |
+
|
4 |
+
|
5 |
+
import concurrent.futures
|
6 |
+
from utils import get_output_path
|
7 |
+
|
8 |
+
|
9 |
+
# This function was intended to run with multiprocessing, but Coqui won't play nice with that.
|
10 |
+
def dub_task(sub, i):
|
11 |
+
print(f"{i}/{len(subs_adjusted)}")
|
12 |
+
try:
|
13 |
+
return dub_line_ram(sub)
|
14 |
+
# empty_audio = empty_audio.overlay(line, sub.start*1000)
|
15 |
+
except Exception as e:
|
16 |
+
print(e)
|
17 |
+
with open(f"output/errors/{i}-rip.txt", 'w') as f:
|
18 |
+
f.write(e)
|
19 |
+
# total_errors += 1
|
20 |
+
|
21 |
+
# This may be used for multithreading?
|
22 |
+
def combine_segments():
|
23 |
+
empty_audio = AudioSegment.silent(total_duration * 1000, frame_rate=22050)
|
24 |
+
total_errors = 0
|
25 |
+
for sub in subs_adjusted:
|
26 |
+
print(f"{sub.index}/{len(subs_adjusted)}")
|
27 |
+
try:
|
28 |
+
segment = AudioSegment.from_file(f'output/files/{sub.index}.wav')
|
29 |
+
empty_audio = empty_audio.overlay(segment, sub.start*1000)
|
30 |
+
except:
|
31 |
+
total_errors += 1
|
32 |
+
empty_audio.export('new.wav')
|
33 |
+
print(total_errors)
|
test.py
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# This file is just a quick script for whatever I'm testing at the time, it's not really important
|
2 |
+
|
3 |
+
# testing XTTS / VC models
|
4 |
+
|
5 |
+
from TTS.api import TTS
|
6 |
+
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to('cuda')
|
7 |
+
|
8 |
+
# generate speech by cloning a voice using default settings
|
9 |
+
tts.tts_to_file(text="Welcome to DougDoug, where we solve problems that no one has",
|
10 |
+
file_path="/media/tessa/SATA SSD1/AI MODELS/cloning/output/doug.wav",
|
11 |
+
speaker_wav="/media/tessa/SATA SSD1/AI MODELS/cloning/doug.wav",
|
12 |
+
language="en")
|
utils.py
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os.path
|
2 |
+
import app_state
|
3 |
+
import numpy as np
|
4 |
+
from pydub.playback import play
|
5 |
+
from pydub import AudioSegment
|
6 |
+
from torch.cuda import is_available
|
7 |
+
|
8 |
+
APP_NAME = "WeeaBlind"
|
9 |
+
test_video_name = "./output/download.webm"
|
10 |
+
default_sample_path = "./output/sample.wav"
|
11 |
+
test_start_time = 94
|
12 |
+
test_end_time = 1324
|
13 |
+
gpu_detected = is_available()
|
14 |
+
|
15 |
+
def create_output_dir():
|
16 |
+
path = './output/files'
|
17 |
+
if not os.path.exists(path):
|
18 |
+
os.makedirs(path)
|
19 |
+
|
20 |
+
def get_output_path(input, suffix, prefix='', path=''):
|
21 |
+
filename = os.path.basename(input)
|
22 |
+
filename_without_extension = os.path.splitext(filename)[0]
|
23 |
+
return os.path.join(os.path.dirname(os.path.abspath(__file__)), 'output', path, f"{prefix}{filename_without_extension}{suffix}")
|
24 |
+
|
25 |
+
def timecode_to_seconds(timecode):
|
26 |
+
parts = list(map(float, timecode.split(':')))
|
27 |
+
seconds = parts[-1]
|
28 |
+
if len(parts) > 1:
|
29 |
+
seconds += parts[-2] * 60
|
30 |
+
if len(parts) > 2:
|
31 |
+
seconds += parts[-3] * 3600
|
32 |
+
return seconds
|
33 |
+
|
34 |
+
def seconds_to_timecode(seconds):
|
35 |
+
hours = int(seconds // 3600)
|
36 |
+
minutes = int((seconds % 3600) // 60)
|
37 |
+
seconds = seconds % 60
|
38 |
+
timecode = ""
|
39 |
+
if hours:
|
40 |
+
timecode += f"{hours}:"
|
41 |
+
if minutes:
|
42 |
+
timecode += f"{minutes}:"
|
43 |
+
timecode = f"{timecode}{seconds:05.2f}"
|
44 |
+
return timecode
|
45 |
+
|
46 |
+
# Finds the closest element in an arry to the given value
|
47 |
+
def find_nearest(array, value):
|
48 |
+
return (np.abs(np.asarray(array) - value)).argmin()
|
49 |
+
|
50 |
+
def sampleVoice(text, output=default_sample_path):
|
51 |
+
play(AudioSegment.from_file(app_state.sample_speaker.speak(text, output)))
|
52 |
+
|
53 |
+
snippet_export_path = get_output_path("video_snippet", "wav")
|
video.py
ADDED
@@ -0,0 +1,219 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
The Video class represents a reference to a video from either a file or web link. This class should implement the ncessary info to dub a video.
|
3 |
+
"""
|
4 |
+
|
5 |
+
from io import StringIO
|
6 |
+
import time
|
7 |
+
import ffmpeg
|
8 |
+
from yt_dlp import YoutubeDL
|
9 |
+
import utils
|
10 |
+
from pydub import AudioSegment
|
11 |
+
from dub_line import load_subs
|
12 |
+
import json
|
13 |
+
import numpy as np
|
14 |
+
import librosa
|
15 |
+
import soundfile as sf
|
16 |
+
|
17 |
+
class Video:
|
18 |
+
def __init__(self, video_URL, loading_progress_hook=print):
|
19 |
+
self.start_time = self.end_time = 0
|
20 |
+
self.downloaded = False
|
21 |
+
self.subs = self.subs_adjusted = self.subs_removed = []
|
22 |
+
self.background_track = self.vocal_track = None
|
23 |
+
self.speech_diary = self.speech_diary_adjusted = None
|
24 |
+
self.load_video(video_URL, loading_progress_hook)
|
25 |
+
|
26 |
+
|
27 |
+
# This is responsible for loading the app's audio and subtitles from a video file or YT link
|
28 |
+
def load_video(self, video_path, progress_hook=print):
|
29 |
+
sub_path = ""
|
30 |
+
if video_path.startswith("http"):
|
31 |
+
self.downloaded = True
|
32 |
+
try:
|
33 |
+
video_path, sub_path, self.yt_sub_streams = self.download_video(video_path, progress_hook)
|
34 |
+
except: return
|
35 |
+
progress_hook({"status":"complete"})
|
36 |
+
else:
|
37 |
+
self.downloaded = False
|
38 |
+
self.file = video_path
|
39 |
+
if not (self.downloaded and not sub_path):
|
40 |
+
try:
|
41 |
+
self.subs = self.subs_adjusted = load_subs(utils.get_output_path(self.file, '.srt'), sub_path or video_path)
|
42 |
+
except:
|
43 |
+
progress_hook({"status": "subless"})
|
44 |
+
self.audio = AudioSegment.from_file(video_path)
|
45 |
+
self.duration = float(ffmpeg.probe(video_path)["format"]["duration"])
|
46 |
+
if self.subs:
|
47 |
+
self.update_time(0, self.duration)
|
48 |
+
|
49 |
+
def download_video(self, link, progress_hook=print):
|
50 |
+
options = {
|
51 |
+
'outtmpl': 'output/%(id)s.%(ext)s',
|
52 |
+
'writesubtitles': True,
|
53 |
+
"subtitleslangs": ["all"],
|
54 |
+
"progress_hooks": (progress_hook,)
|
55 |
+
}
|
56 |
+
try:
|
57 |
+
with YoutubeDL(options) as ydl:
|
58 |
+
info = ydl.extract_info(link)
|
59 |
+
return ydl.prepare_filename(info), list(info["subtitles"].values())[0][-1]["filepath"] if info["subtitles"] else None, info["subtitles"]
|
60 |
+
except Exception as e:
|
61 |
+
print('AHHH\n',e,'\nAHHHHHH')
|
62 |
+
progress_hook({"status": "error", "error": e})
|
63 |
+
raise e
|
64 |
+
|
65 |
+
|
66 |
+
def update_time(self, start, end):
|
67 |
+
self.start_time = start
|
68 |
+
self.end_time = end
|
69 |
+
# clamp the subs to the crop time specified
|
70 |
+
start_line = utils.find_nearest([sub.start for sub in self.subs], start)
|
71 |
+
end_line = utils.find_nearest([sub.start for sub in self.subs], end)
|
72 |
+
self.subs_adjusted = self.subs[start_line:end_line]
|
73 |
+
if self.speech_diary:
|
74 |
+
self.update_diary_timing()
|
75 |
+
|
76 |
+
def list_streams(self):
|
77 |
+
probe = ffmpeg.probe(self.file)["streams"]
|
78 |
+
if self.downloaded:
|
79 |
+
subs = [{"name": stream[-1]['name'], "stream": stream[-1]['filepath']} for stream in self.yt_sub_streams.values()]
|
80 |
+
else:
|
81 |
+
subs = [{"name": stream['tags'].get('language', 'unknown'), "stream": stream['index']} for stream in probe if stream["codec_type"] == "subtitle"]
|
82 |
+
return {
|
83 |
+
"audio": [stream for stream in probe if stream["codec_type"] == "audio"],
|
84 |
+
"subs": subs
|
85 |
+
}
|
86 |
+
|
87 |
+
def get_snippet(self, start, end):
|
88 |
+
return self.audio[start*1000:end*1000]
|
89 |
+
|
90 |
+
# Crops the video's audio segment to reduce memory size
|
91 |
+
def crop_audio(self, isolated_vocals):
|
92 |
+
# ffmpeg -i .\saiki.mkv -vn -ss 84 -to 1325 crop.wav
|
93 |
+
source_file = self.vocal_track if isolated_vocals and self.vocal_track else self.file
|
94 |
+
output = utils.get_output_path(source_file, "-crop.wav")
|
95 |
+
(
|
96 |
+
ffmpeg
|
97 |
+
.input(self.file, ss=self.start_time, to=self.end_time)
|
98 |
+
.output(output)
|
99 |
+
.global_args('-loglevel', 'error')
|
100 |
+
.global_args('-vn')
|
101 |
+
.run(overwrite_output=True)
|
102 |
+
)
|
103 |
+
return output
|
104 |
+
|
105 |
+
def filter_multilingual_subtiles(self, progress_hook=print, exclusion="English"):
|
106 |
+
multi_lingual_subs = []
|
107 |
+
removed_subs = []
|
108 |
+
# Speechbrain is being a lil bitch about this path on Windows all of the sudden
|
109 |
+
snippet_path = "video_snippet.wav" # utils.get_output_path('video_snippet', '.wav')
|
110 |
+
for i, sub in enumerate(self.subs_adjusted):
|
111 |
+
self.get_snippet(sub.start, sub.end).export(snippet_path, format="wav")
|
112 |
+
if sub.get_language(snippet_path) != exclusion:
|
113 |
+
multi_lingual_subs.append(sub)
|
114 |
+
else:
|
115 |
+
removed_subs.append(sub)
|
116 |
+
progress_hook(i, f"{i}/{len(self.subs_adjusted)}: {sub.text}")
|
117 |
+
self.subs_adjusted = multi_lingual_subs
|
118 |
+
self.subs_removed = removed_subs
|
119 |
+
progress_hook(-1, "done")
|
120 |
+
|
121 |
+
# This funxion is is used to only get the snippets of the audio that appear in subs_adjusted after language filtration or cropping, irregardless of the vocal splitting.
|
122 |
+
# This should be called AFTER filter multilingual and BEFORE vocal isolation. Not useful yet
|
123 |
+
# OKAY THERE HAS TO BE A FASTER WAY TO DO THIS X_X
|
124 |
+
|
125 |
+
# def isolate_subs(self):
|
126 |
+
# base = AudioSegment.silent(duration=self.duration*1000, frame_rate=self.audio.frame_rate, channels=self.audio.channels, frame_width=self.audio.frame_width)
|
127 |
+
# samples = np.array(base.get_array_of_samples())
|
128 |
+
# frame_rate = base.frame_rate
|
129 |
+
|
130 |
+
# for sub in self.subs_adjusted:
|
131 |
+
# copy = np.array(self.get_snippet(sub.start, sub.end).get_array_of_samples())
|
132 |
+
# start_sample = int(sub.start * frame_rate)
|
133 |
+
# end_sample = int(sub.end * frame_rate)
|
134 |
+
|
135 |
+
# # Ensure that the copy array has the same length as the region to replace
|
136 |
+
# copy = copy[:end_sample - start_sample] # Trim if necessary
|
137 |
+
|
138 |
+
# samples[start_sample:end_sample] = copy
|
139 |
+
|
140 |
+
# return AudioSegment(
|
141 |
+
# samples.tobytes(),
|
142 |
+
# frame_rate=frame_rate,
|
143 |
+
# sample_width=base.sample_width, # Adjust sample_width as needed (2 bytes for int16)
|
144 |
+
# channels=base.channels
|
145 |
+
# )
|
146 |
+
|
147 |
+
def isolate_subs(self, subs):
|
148 |
+
empty_audio = AudioSegment.silent(self.duration * 1000, frame_rate=self.audio.frame_rate)
|
149 |
+
empty_audio = self.audio
|
150 |
+
first_sub = subs[0]
|
151 |
+
empty_audio = empty_audio[0:first_sub.start].silent((first_sub.end-first_sub.start)*1000)
|
152 |
+
for i, sub in enumerate(subs[:-1]):
|
153 |
+
print(sub.text)
|
154 |
+
empty_audio = empty_audio[sub.end:subs[i+1].start].silent((subs[i+1].start-sub.end)*1000, frame_rate=empty_audio.frame_rate, channels=empty_audio.channels, sample_width=empty_audio.sample_width, frame_width=empty_audio.frame_width)
|
155 |
+
|
156 |
+
return empty_audio
|
157 |
+
|
158 |
+
def run_dubbing(self, progress_hook=None):
|
159 |
+
total_errors = 0
|
160 |
+
operation_start_time = time.process_time()
|
161 |
+
empty_audio = AudioSegment.silent(self.duration * 1000, frame_rate=22050)
|
162 |
+
status = ""
|
163 |
+
# with concurrent.futures.ThreadPoolExecutor(max_workers=100) as pool:
|
164 |
+
# tasks = [pool.submit(dub_task, sub, i) for i, sub in enumerate(subs_adjusted)]
|
165 |
+
# for future in concurrent.futures.as_completed(tasks):
|
166 |
+
# pass
|
167 |
+
for i, sub in enumerate(self.subs_adjusted):
|
168 |
+
status = f"{i}/{len(self.subs_adjusted)}"
|
169 |
+
progress_hook(i, f"{status}: {sub.text}")
|
170 |
+
try:
|
171 |
+
line = sub.dub_line_file(False)
|
172 |
+
empty_audio = empty_audio.overlay(line, sub.start*1000)
|
173 |
+
except Exception as e:
|
174 |
+
print(e)
|
175 |
+
total_errors += 1
|
176 |
+
self.dub_track = empty_audio.export(utils.get_output_path(self.file, '-dubtrack.wav'), format="wav").name
|
177 |
+
progress_hook(i+1, "Mixing New Audio")
|
178 |
+
self.mix_av(mixing_ratio=1)
|
179 |
+
progress_hook(-1)
|
180 |
+
print(f"TOTAL TIME TAKEN: {time.process_time() - operation_start_time}")
|
181 |
+
# print(total_errors)
|
182 |
+
|
183 |
+
# This runs an ffmpeg command to combine the audio, video, and subtitles with a specific ratio of how loud to make the dubtrack
|
184 |
+
def mix_av(self, mixing_ratio=1, dubtrack=None, output_path=None):
|
185 |
+
# i hate python, plz let me use self in func def
|
186 |
+
if not dubtrack: dubtrack = self.dub_track
|
187 |
+
if not output_path: output_path = utils.get_output_path(self.file, '-dubbed.mkv')
|
188 |
+
|
189 |
+
input_video = ffmpeg.input(self.file)
|
190 |
+
input_audio = input_video.audio
|
191 |
+
if self.background_track:
|
192 |
+
input_audio = ffmpeg.input(self.background_track)
|
193 |
+
input_dub = ffmpeg.input(dubtrack).audio
|
194 |
+
|
195 |
+
mixed_audio = ffmpeg.filter([input_audio, input_dub], 'amix', duration='first', weights=f"1 {mixing_ratio}")
|
196 |
+
|
197 |
+
output = (
|
198 |
+
# input_video['s']
|
199 |
+
ffmpeg.output(input_video['v'], mixed_audio, output_path, vcodec="copy", acodec="aac")
|
200 |
+
.global_args('-loglevel', 'error')
|
201 |
+
.global_args('-shortest')
|
202 |
+
)
|
203 |
+
ffmpeg.run(output, overwrite_output=True)
|
204 |
+
|
205 |
+
# Change the subs to either a file or a different stream from the video file
|
206 |
+
def change_subs(self, stream_index=-1):
|
207 |
+
if self.downloaded:
|
208 |
+
sub_path = list(self.yt_sub_streams.values())[stream_index][-1]['filepath']
|
209 |
+
self.subs = self.subs_adjusted = load_subs(utils.get_output_path(sub_path, '.srt'), sub_path)
|
210 |
+
else:
|
211 |
+
# ffmpeg -i output.mkv -map 0:s:1 frick.srt
|
212 |
+
sub_path = utils.get_output_path(self.file, '.srt')
|
213 |
+
ffmpeg.input(self.file).output(sub_path, map=f"0:s:{stream_index}").run(overwrite_output=True)
|
214 |
+
self.subs = self.subs_adjusted = load_subs(sub_path)
|
215 |
+
|
216 |
+
def change_audio(self, stream_index=-1):
|
217 |
+
audio_path = utils.get_output_path(self.file, f"-${stream_index}.wav")
|
218 |
+
ffmpeg.input(self.file).output(audio_path, map=f"0:a:{stream_index}").run(overwrite_output=True)
|
219 |
+
self.audio = AudioSegment.from_file(audio_path)
|
vocal_isolation.py
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from spleeter.separator import Separator
|
2 |
+
from spleeter.audio import adapter
|
3 |
+
from pydub import AudioSegment
|
4 |
+
import numpy as np
|
5 |
+
import utils
|
6 |
+
|
7 |
+
separator = None # Separator('spleeter:2stems')
|
8 |
+
# I don't have any clue on how to make this work yet, just ignore for now. Ideally we'd never have to serialize the audio to wav and then rea read it but alas, bad implementations of PCM will be the death of me
|
9 |
+
def seperate_ram(video):
|
10 |
+
audio_loader = adapter.AudioAdapter.default()
|
11 |
+
sample_rate = 44100
|
12 |
+
audio = video.audio
|
13 |
+
# arr = np.array(audio.get_array_of_samples(), dtype=np.float32).reshape((-1, audio.channels)) / (
|
14 |
+
# 1 << (8 * audio.sample_width - 1)), audio.frame_rate
|
15 |
+
arr = np.array(audio.get_array_of_samples())
|
16 |
+
audio, _ = audio_loader.load_waveform(arr)
|
17 |
+
# waveform, _ = audio_loader.load('/path/to/audio/file', sample_rate=sample_rate)
|
18 |
+
|
19 |
+
print("base audio\n", base_audio, "\n")
|
20 |
+
# Perform the separation :
|
21 |
+
# prediction = separator.separate(audio)
|
22 |
+
|
23 |
+
def seperate_file(video, isolate_subs=True):
|
24 |
+
global separator
|
25 |
+
if not separator:
|
26 |
+
separator = Separator('spleeter:2stems')
|
27 |
+
source_audio_path = utils.get_output_path(video.file, '-audio.wav')
|
28 |
+
isolated_path = utils.get_output_path(video.file, '-isolate.wav')
|
29 |
+
separator.separate_to_file(
|
30 |
+
(video.audio).export(source_audio_path, format="wav").name,
|
31 |
+
'./output/',
|
32 |
+
filename_format='{filename}-{instrument}.{codec}'
|
33 |
+
)
|
34 |
+
# separator.separate_to_file(
|
35 |
+
# video.isolate_subs().export(source_audio_path, format="wav").name,
|
36 |
+
# './output/',
|
37 |
+
# filename_format='{filename}-{instrument}.{codec}'
|
38 |
+
# )
|
39 |
+
background_track = utils.get_output_path(source_audio_path, '-accompaniment.wav')
|
40 |
+
# If we removed primary langauge subs from a multilingual video, we'll need to add them back to the background.
|
41 |
+
if video.subs_removed:
|
42 |
+
background = AudioSegment.from_file(background_track)
|
43 |
+
for sub in video.subs_removed:
|
44 |
+
background = background.overlay(video.get_snippet(sub.start, sub.end), int(sub.start*1000))
|
45 |
+
background.export(background_track, format="wav")
|
46 |
+
video.background_track = background_track
|
47 |
+
video.vocal_track = utils.get_output_path(isolated_path, '-vocals.wav')
|
weeablind.py
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import wx
|
2 |
+
import wx.adv
|
3 |
+
from Voice import Voice
|
4 |
+
from pydub import AudioSegment
|
5 |
+
from pydub.playback import play
|
6 |
+
from tabs.ConfigureVoiceTab import ConfigureVoiceTab
|
7 |
+
from tabs.SubtitlesTab import SubtitlesTab
|
8 |
+
from tabs.ListStreams import ListStreamsTab
|
9 |
+
import threading
|
10 |
+
import utils
|
11 |
+
from video import Video
|
12 |
+
import app_state
|
13 |
+
from video import Video
|
14 |
+
import json
|
15 |
+
|
16 |
+
class GUI(wx.Panel):
|
17 |
+
def __init__(self, parent):
|
18 |
+
super().__init__(parent)
|
19 |
+
|
20 |
+
# Labels
|
21 |
+
lbl_title = wx.StaticText(self, label="WeeaBlind")
|
22 |
+
lbl_GPU = wx.StaticText(self, label=f"GPU Detected? {utils.gpu_detected}")
|
23 |
+
lbl_GPU.SetForegroundColour((0, 255, 0) if utils.gpu_detected else (255, 0, 0))
|
24 |
+
lbl_main_file = wx.StaticText(self, label="Choose a video file or link to a YouTube video:")
|
25 |
+
lbl_start_time = wx.StaticText(self, label="Start Time:")
|
26 |
+
lbl_end_time = wx.StaticText(self, label="End Time:")
|
27 |
+
|
28 |
+
# Controls
|
29 |
+
btn_choose_file = wx.Button(self, label="Choose File")
|
30 |
+
btn_choose_file.Bind(wx.EVT_BUTTON, self.open_file)
|
31 |
+
|
32 |
+
self.txt_main_file = wx.TextCtrl(self, style=wx.TE_PROCESS_ENTER, value=utils.test_video_name)
|
33 |
+
self.txt_main_file.Bind(wx.EVT_TEXT_ENTER, lambda event: self.load_video(self.txt_main_file.Value))
|
34 |
+
|
35 |
+
self.txt_start = wx.TextCtrl(self, style=wx.TE_PROCESS_ENTER, value=utils.seconds_to_timecode(0))
|
36 |
+
self.txt_end = wx.TextCtrl(self, style=wx.TE_PROCESS_ENTER, value=utils.seconds_to_timecode(0))
|
37 |
+
self.txt_start.Bind(wx.EVT_TEXT_ENTER, self.change_crop_time)
|
38 |
+
self.txt_end.Bind(wx.EVT_TEXT_ENTER, self.change_crop_time)
|
39 |
+
|
40 |
+
self.chk_match_volume = wx.CheckBox(self, label="Match Speaker Volume")
|
41 |
+
self.chk_match_volume.SetValue(True)
|
42 |
+
|
43 |
+
self.lb_voices = wx.ListBox(self, choices=[speaker.name for speaker in app_state.speakers])
|
44 |
+
self.lb_voices.Bind(wx.EVT_LISTBOX, self.on_voice_change)
|
45 |
+
self.lb_voices.Select(0)
|
46 |
+
|
47 |
+
tab_control = wx.Notebook(self)
|
48 |
+
self.tab_voice_config = ConfigureVoiceTab(tab_control, self)
|
49 |
+
tab_control.AddPage(self.tab_voice_config, "Configure Voices")
|
50 |
+
self.tab_subtitles = SubtitlesTab(tab_control, self)
|
51 |
+
tab_control.AddPage(self.tab_subtitles, "Subtitles")
|
52 |
+
self.streams_tab = ListStreamsTab(tab_control, self)
|
53 |
+
tab_control.AddPage(self.streams_tab, "Video Streams")
|
54 |
+
btn_run_dub = wx.Button(self, label="Run Dubbing!")
|
55 |
+
btn_run_dub.Bind(wx.EVT_BUTTON, self.run_dub)
|
56 |
+
sizer = wx.GridBagSizer(vgap=5, hgap=5)
|
57 |
+
|
58 |
+
sizer.Add(lbl_title, pos=(0, 0), span=(1, 2), flag=wx.CENTER | wx.ALL, border=5)
|
59 |
+
sizer.Add(lbl_GPU, pos=(0, 3), span=(1, 1), flag=wx.CENTER | wx.ALL, border=5)
|
60 |
+
sizer.Add(lbl_main_file, pos=(2, 0), span=(1, 2), flag=wx.LEFT | wx.TOP, border=5)
|
61 |
+
sizer.Add(self.txt_main_file, pos=(3, 0), span=(1, 2), flag=wx.EXPAND | wx.LEFT | wx.RIGHT | wx.BOTTOM, border=5)
|
62 |
+
sizer.Add(btn_choose_file, pos=(3, 2), span=(1, 1), flag=wx.ALIGN_RIGHT | wx.RIGHT | wx.BOTTOM, border=5)
|
63 |
+
sizer.Add(lbl_start_time, pos=(4, 0), flag=wx.LEFT | wx.TOP, border=5)
|
64 |
+
sizer.Add(self.txt_start, pos=(4, 1), flag= wx.TOP | wx.RIGHT, border=5)
|
65 |
+
sizer.Add(lbl_end_time, pos=(5, 0), flag=wx.LEFT | wx.TOP, border=5)
|
66 |
+
sizer.Add(self.txt_end, pos=(5, 1), flag= wx.TOP | wx.RIGHT, border=5)
|
67 |
+
sizer.Add(self.chk_match_volume, pos=(6, 0), span=(1, 2), flag=wx.LEFT | wx.TOP, border=5)
|
68 |
+
sizer.Add(self.lb_voices, pos=(7, 0), span=(1, 1), flag=wx.EXPAND | wx.LEFT | wx.TOP, border=5)
|
69 |
+
sizer.Add(tab_control, pos=(7, 1), span=(1, 3), flag=wx.EXPAND | wx.ALL, border=5)
|
70 |
+
sizer.Add(btn_run_dub, pos=(9, 2), span=(1, 1), flag=wx.ALIGN_RIGHT | wx.RIGHT | wx.BOTTOM, border=5)
|
71 |
+
sizer.AddGrowableCol(1)
|
72 |
+
self.tab_voice_config.update_voice_fields(None)
|
73 |
+
|
74 |
+
self.SetSizerAndFit(sizer)
|
75 |
+
|
76 |
+
def open_file(self, evenet):
|
77 |
+
dlg = wx.FileDialog(
|
78 |
+
frame, message="Choose a file",
|
79 |
+
wildcard="*.*",
|
80 |
+
style=wx.FD_OPEN | wx.FD_CHANGE_DIR
|
81 |
+
)
|
82 |
+
if dlg.ShowModal() == wx.ID_OK:
|
83 |
+
self.load_video(dlg.GetPath())
|
84 |
+
dlg.Destroy()
|
85 |
+
|
86 |
+
def load_video(self, video_path):
|
87 |
+
def update_ui():
|
88 |
+
self.txt_main_file.Value = app_state.video.file
|
89 |
+
self.txt_start.SetValue(utils.seconds_to_timecode(app_state.video.start_time))
|
90 |
+
self.txt_end.SetValue(utils.seconds_to_timecode(app_state.video.end_time))
|
91 |
+
self.tab_subtitles.create_entries()
|
92 |
+
|
93 |
+
def initialize_video(progress=True):
|
94 |
+
app_state.video = Video(video_path, update_progress if progress else print)
|
95 |
+
wx.CallAfter(update_ui)
|
96 |
+
wx.CallAfter(self.streams_tab.populate_streams, app_state.video.list_streams())
|
97 |
+
|
98 |
+
if video_path.startswith("http"):
|
99 |
+
dialog = wx.ProgressDialog("Downloading Video", "Download starting", 100, self)
|
100 |
+
|
101 |
+
def update_progress(progress=None):
|
102 |
+
status = progress['status'] if progress else "waiting"
|
103 |
+
total = progress.get("fragment_count", progress.get("total_bytes", 0))
|
104 |
+
if status == "downloading" and total:
|
105 |
+
completed = progress.get("fragment_index", progress.get("downloaded_bytes", 1))
|
106 |
+
percent_complete = int(100 * (completed / total))
|
107 |
+
wx.CallAfter(dialog.Update, percent_complete, f"{status}: {percent_complete}% \n {progress['info_dict'].get('fulltitle', '')}")
|
108 |
+
elif status == "complete":
|
109 |
+
if dialog:
|
110 |
+
wx.CallAfter(dialog.Destroy)
|
111 |
+
elif status == "error":
|
112 |
+
wx.CallAfter(wx.MessageBox,
|
113 |
+
f"Failed to download video with the following Error:\n {str(progress['error'])}",
|
114 |
+
"Error",
|
115 |
+
wx.ICON_ERROR
|
116 |
+
)
|
117 |
+
update_progress({"status": "complete"})
|
118 |
+
|
119 |
+
threading.Thread(target=initialize_video).start()
|
120 |
+
else:
|
121 |
+
initialize_video(False)
|
122 |
+
|
123 |
+
def change_crop_time(self, event):
|
124 |
+
app_state.video.update_time(
|
125 |
+
utils.timecode_to_seconds(self.txt_start.Value),
|
126 |
+
utils.timecode_to_seconds(self.txt_end.Value)
|
127 |
+
)
|
128 |
+
self.tab_subtitles.create_entries()
|
129 |
+
|
130 |
+
def update_voices_list(self):
|
131 |
+
self.lb_voices.Set([speaker.name for speaker in app_state.speakers])
|
132 |
+
self.lb_voices.Select(self.lb_voices.Strings.index(app_state.current_speaker.name))
|
133 |
+
|
134 |
+
def on_voice_change(self, event):
|
135 |
+
app_state.current_speaker = app_state.speakers[self.lb_voices.GetSelection()]
|
136 |
+
app_state.sample_speaker = app_state.current_speaker
|
137 |
+
self.tab_voice_config.update_voice_fields(event)
|
138 |
+
|
139 |
+
def run_dub(self, event):
|
140 |
+
progress_dialog = wx.ProgressDialog(
|
141 |
+
"Dubbing Progress",
|
142 |
+
"Starting...",
|
143 |
+
maximum=len(app_state.video.subs_adjusted) + 1, # +1 for combining phase
|
144 |
+
parent=self,
|
145 |
+
style=wx.PD_APP_MODAL | wx.PD_AUTO_HIDE
|
146 |
+
)
|
147 |
+
dub_thread = None
|
148 |
+
def update_progress(i, text=""):
|
149 |
+
if i == -1:
|
150 |
+
return wx.CallAfter(progress_dialog.Destroy)
|
151 |
+
wx.CallAfter(progress_dialog.Update, i, text)
|
152 |
+
|
153 |
+
dub_thread = threading.Thread(target=app_state.video.run_dubbing, args=(update_progress,))
|
154 |
+
dub_thread.start()
|
155 |
+
|
156 |
+
if __name__ == '__main__':
|
157 |
+
utils.create_output_dir()
|
158 |
+
app = wx.App(False)
|
159 |
+
frame = wx.Frame(None, wx.ID_ANY, utils.APP_NAME, size=(800, 800))
|
160 |
+
frame.Center()
|
161 |
+
gui = GUI(frame)
|
162 |
+
frame.Show()
|
163 |
+
app.MainLoop()
|