Track .wav files with Git LFS and include previous changes
Browse files- .gitattributes +1 -0
- README.md +8 -12
- SAMPLES.md +20 -0
- samples/ground-truth.wav +3 -0
- samples/no-slm-discriminator.wav +3 -0
- samples/with-slm-discriminator.wav +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.wav filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -10,20 +10,22 @@ tags:
|
|
| 10 |
|
| 11 |
# LJSpeech Finetuned StyleTTS 2
|
| 12 |
|
| 13 |
-
This repository
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Checkpoint Details
|
| 16 |
|
| 17 |
This repository includes checkpoints from two separate finetuning runs, located in the following subdirectories:
|
| 18 |
|
| 19 |
-
* **`no-slm-discriminator`**:
|
| 20 |
|
| 21 |
-
* **`with-slm-discriminator`**:
|
| 22 |
|
| 23 |
## Training Details
|
| 24 |
|
| 25 |
-
* **Base Model:**
|
| 26 |
-
* **Finetuning Dataset:** LJSpeech (1 hour subset)
|
| 27 |
* **Number of Epochs:** 50
|
| 28 |
* **Hardware (Run 1 - No SLM):** 1 x NVIDIA RTX 3090
|
| 29 |
* **Hardware (Run 2 - With SLM):** 1 x NVIDIA RTX 3090
|
|
@@ -32,18 +34,12 @@ This repository includes checkpoints from two separate finetuning runs, located
|
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
| 35 |
-
To
|
| 36 |
|
| 37 |
```python
|
| 38 |
import torch
|
| 39 |
|
| 40 |
# Example for loading a checkpoint (adjust paths as needed)
|
| 41 |
-
checkpoint_path_no_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/no-slm-discriminator/epoch_2nd_00049.pth"
|
| 42 |
-
config_path_no_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/no-slm-discriminator/config_ft.yml"
|
| 43 |
-
|
| 44 |
-
checkpoint_no_slm = torch.hub.load_state_dict_from_url(checkpoint_path_no_slm)
|
| 45 |
-
# You would then load this state dictionary into your StyleTTS 2 model
|
| 46 |
-
|
| 47 |
checkpoint_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/epoch_2nd_00049.pth"
|
| 48 |
config_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/config_ft.yml"
|
| 49 |
|
|
|
|
| 10 |
|
| 11 |
# LJSpeech Finetuned StyleTTS 2
|
| 12 |
|
| 13 |
+
This repository hosts checkpoints of a StyleTTS2 model specifically adapted for high-quality single-speaker speech synthesis using the LJSpeech dataset. StyleTTS2 is a state-of-the-art text-to-speech model known for its expressive and natural-sounding voice synthesis achieved through a style diffusion mechanism.
|
| 14 |
+
|
| 15 |
+
Our finetuning process began with a robust multispeaker StyleTTS2 model, pretrained by the original authors on the extensive LibriTTS dataset for 20 epochs. This base model provides a strong foundation in learning general speech characteristics. We then specialized this model by finetuning it on the LJSpeech dataset, which comprises approximately 1 hour of speech data (around 1,000 audio samples) from a single speaker. This targeted finetuning for 50 epochs allows the model to capture the unique voice characteristics and nuances of the LJSpeech speaker. The methodology employed here demonstrates a transferable approach: StyleTTS2 can be effectively adapted to generate speech in virtually any voice, provided sufficient audio samples are available for finetuning.
|
| 16 |
|
| 17 |
## Checkpoint Details
|
| 18 |
|
| 19 |
This repository includes checkpoints from two separate finetuning runs, located in the following subdirectories:
|
| 20 |
|
| 21 |
+
* **`no-slm-discriminator`**: These checkpoints resulted from a finetuning run where the Speech Language Model (WavLM) was intentionally excluded as a discriminator in the style diffusion process. This decision was made due to Out-of-Memory (OOM) errors encountered on a single NVIDIA RTX 3090. Despite this modification, the finetuning proceeded successfully, taking approximately 9 hours, 23 minutes, and 54 seconds on the aforementioned hardware. Checkpoints are available at 5-epoch intervals, ranging from `epoch_2nd_00004.pth` to `epoch_2nd_00049.pth`.
|
| 22 |
|
| 23 |
+
* **`with-slm-discriminator`**: This set of checkpoints comes from a finetuning run that utilized the Speech Language Model (WavLM) as a discriminator, aligning with the default StyleTTS2 configuration. This integration leverages the powerful representations of WavLM to guide the style diffusion process, potentially leading to enhanced speech naturalness. This more computationally intensive run took approximately 2 days and 18 hours to complete on a single NVIDIA RTX 3090. Similar to the other run, checkpoints are provided every 5 epochs, from `epoch_2nd_00004.pth` to `epoch_2nd_00049.pth`.
|
| 24 |
|
| 25 |
## Training Details
|
| 26 |
|
| 27 |
+
* **Base Model:** StyleTTS2 (pretrained on LibriTTS for 20 epochs)
|
| 28 |
+
* **Finetuning Dataset:** LJSpeech (1 hour subset, ~1k samples)
|
| 29 |
* **Number of Epochs:** 50
|
| 30 |
* **Hardware (Run 1 - No SLM):** 1 x NVIDIA RTX 3090
|
| 31 |
* **Hardware (Run 2 - With SLM):** 1 x NVIDIA RTX 3090
|
|
|
|
| 34 |
|
| 35 |
## Usage
|
| 36 |
|
| 37 |
+
To leverage these finetuned StyleTTS 2 checkpoints, ensure you have the original StyleTTS2 codebase properly set up. The provided checkpoints can then be loaded using the framework's designated loading mechanisms, often involving configuration files that specify the model architecture and training parameters. Below is a general Python example illustrating how you might load a checkpoint. Remember to adjust the file paths according to your local setup and the specific loading functions provided by the StyleTTS 2 implementation.
|
| 38 |
|
| 39 |
```python
|
| 40 |
import torch
|
| 41 |
|
| 42 |
# Example for loading a checkpoint (adjust paths as needed)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
checkpoint_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/epoch_2nd_00049.pth"
|
| 44 |
config_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/config_ft.yml"
|
| 45 |
|
SAMPLES.md
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### LJSpeech Female Speaker (GROUND TRUTH)
|
| 2 |
+
<audio controls><source src="https://huggingface.co/ibrazebra/lj-speech-finetuned-styletts2/tree/main/samples/ground-truth.wav" type="audio/wav"></audio>
|
| 3 |
+
> An old lady handed me a pamphlet about saving the Rosenbergs. I looked at that paper and I still remember it for some reason, I don't know why. End quote.
|
| 4 |
+
```
|
| 5 |
+
ɐn ˈoʊld lˈeɪdi hˈændᵻd mˌiː ɐ pˈæmflɪt ɐbˌaʊt sˈeɪvɪŋ ðə ɹˈoʊzənbˌɜːɡz . aɪ lˈʊkt æt ðæt pˈeɪpɚ ænd aɪ stˈɪl ɹᵻmˈɛmbɚɹ ɪt fɔːɹ sˌʌm ɹˈiːzən , aɪ dˈoʊnt nˈoʊ wˈaɪ . ˈɛnd kwˈoʊt .
|
| 6 |
+
```
|
| 7 |
+
|
| 8 |
+
### LJSpeech Female Speaker (FINETUNED MODEL: WITH-SLM-DISCRIMINATOR)
|
| 9 |
+
<audio controls><source src="https://huggingface.co/ibrazebra/lj-speech-finetuned-styletts2/tree/main/samples/with-slm-discriminator.wav" type="audio/wav"></audio>
|
| 10 |
+
> An old lady handed me a pamphlet about saving the Rosenbergs. I looked at that paper and I still remember it for some reason, I don't know why. End quote.
|
| 11 |
+
```
|
| 12 |
+
ɐn ˈoʊld lˈeɪdi hˈændᵻd mˌiː ɐ pˈæmflɪt ɐbˌaʊt sˈeɪvɪŋ ðə ɹˈoʊzənbˌɜːɡz . aɪ lˈʊkt æt ðæt pˈeɪpɚ ænd aɪ stˈɪl ɹᵻmˈɛmbɚɹ ɪt fɔːɹ sˌʌm ɹˈiːzən , aɪ dˈoʊnt nˈoʊ wˈaɪ . ˈɛnd kwˈoʊt .
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
### LJSpeech Female Speaker (FINETUNED MODEL: NO-SLM-DISCRIMINATOR)
|
| 16 |
+
<audio controls><source src="https://huggingface.co/ibrazebra/lj-speech-finetuned-styletts2/tree/main/samples/no-slm-discriminator.wav" type="audio/wav"></audio>
|
| 17 |
+
> An old lady handed me a pamphlet about saving the Rosenbergs. I looked at that paper and I still remember it for some reason, I don't know why. End quote.
|
| 18 |
+
```
|
| 19 |
+
ɐn ˈoʊld lˈeɪdi hˈændᵻd mˌiː ɐ pˈæmflɪt ɐbˌaʊt sˈeɪvɪŋ ðə ɹˈoʊzənbˌɜːɡz . aɪ lˈʊkt æt ðæt pˈeɪpɚ ænd aɪ stˈɪl ɹᵻmˈɛmbɚɹ ɪt fɔːɹ sˌʌm ɹˈiːzən , aɪ dˈoʊnt nˈoʊ wˈaɪ . ˈɛnd kwˈoʊt .
|
| 20 |
+
```
|
samples/ground-truth.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6cc0293255cc59a4f3dec5b3f45ab44f479836938c39d4c057bfd3cee06ffdb2
|
| 3 |
+
size 448996
|
samples/no-slm-discriminator.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:372ef2433f9bf11178a014ded560aa698eda4fb23b463130ca6fe99e8b224ce4
|
| 3 |
+
size 461944
|
samples/with-slm-discriminator.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8ac174a0fecbe5f9876f160f904ab4d59a29d076d880e92dc2711a27cae69c58
|
| 3 |
+
size 493144
|