ibrazebra commited on
Commit
6669277
·
1 Parent(s): 63ee85c

Track .wav files with Git LFS and include previous changes

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.wav filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -10,20 +10,22 @@ tags:
10
 
11
  # LJSpeech Finetuned StyleTTS 2
12
 
13
- This repository contains checkpoints of a StyleTTS 2 model finetuned on the LJSpeech dataset (approximately 1 hour of speech data, around 1k samples) for 50 epochs.
 
 
14
 
15
  ## Checkpoint Details
16
 
17
  This repository includes checkpoints from two separate finetuning runs, located in the following subdirectories:
18
 
19
- * **`no-slm-discriminator`**: Contains checkpoints from the finetuning run where the Speech Language Model (WavLM) was **not** used as a discriminator due to Out-of-Memory (OOM) challenges on a single NVIDIA RTX 3090. This finetuning took approximately 9 hours 23 minutes and 54 seconds on a single RTX 3090. Checkpoints are available every 5 epochs from `epoch_2nd_00004.pth` to `epoch_2nd_00049.pth`.
20
 
21
- * **`with-slm-discriminator`**: Contains checkpoints from the finetuning run where the Speech Language Model (WavLM) **was** used as a discriminator to the style diffusion process. This finetuning took approximately 2 days and 18 hours on a single NVIDIA RTX 3090 (default configuration). Checkpoints are available every 5 epochs from `epoch_2nd_00004.pth` to `epoch_2nd_00049.pth`.
22
 
23
  ## Training Details
24
 
25
- * **Base Model:** StyleTTS 2
26
- * **Finetuning Dataset:** LJSpeech (1 hour subset)
27
  * **Number of Epochs:** 50
28
  * **Hardware (Run 1 - No SLM):** 1 x NVIDIA RTX 3090
29
  * **Hardware (Run 2 - With SLM):** 1 x NVIDIA RTX 3090
@@ -32,18 +34,12 @@ This repository includes checkpoints from two separate finetuning runs, located
32
 
33
  ## Usage
34
 
35
- To use these checkpoints, you will need to have the StyleTTS 2 codebase set up. You can then load the checkpoints using the provided configuration files. Here's a general example (you might need to adjust it based on the specific loading mechanisms in StyleTTS 2):
36
 
37
  ```python
38
  import torch
39
 
40
  # Example for loading a checkpoint (adjust paths as needed)
41
- checkpoint_path_no_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/no-slm-discriminator/epoch_2nd_00049.pth"
42
- config_path_no_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/no-slm-discriminator/config_ft.yml"
43
-
44
- checkpoint_no_slm = torch.hub.load_state_dict_from_url(checkpoint_path_no_slm)
45
- # You would then load this state dictionary into your StyleTTS 2 model
46
-
47
  checkpoint_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/epoch_2nd_00049.pth"
48
  config_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/config_ft.yml"
49
 
 
10
 
11
  # LJSpeech Finetuned StyleTTS 2
12
 
13
+ This repository hosts checkpoints of a StyleTTS2 model specifically adapted for high-quality single-speaker speech synthesis using the LJSpeech dataset. StyleTTS2 is a state-of-the-art text-to-speech model known for its expressive and natural-sounding voice synthesis achieved through a style diffusion mechanism.
14
+
15
+ Our finetuning process began with a robust multispeaker StyleTTS2 model, pretrained by the original authors on the extensive LibriTTS dataset for 20 epochs. This base model provides a strong foundation in learning general speech characteristics. We then specialized this model by finetuning it on the LJSpeech dataset, which comprises approximately 1 hour of speech data (around 1,000 audio samples) from a single speaker. This targeted finetuning for 50 epochs allows the model to capture the unique voice characteristics and nuances of the LJSpeech speaker. The methodology employed here demonstrates a transferable approach: StyleTTS2 can be effectively adapted to generate speech in virtually any voice, provided sufficient audio samples are available for finetuning.
16
 
17
  ## Checkpoint Details
18
 
19
  This repository includes checkpoints from two separate finetuning runs, located in the following subdirectories:
20
 
21
+ * **`no-slm-discriminator`**: These checkpoints resulted from a finetuning run where the Speech Language Model (WavLM) was intentionally excluded as a discriminator in the style diffusion process. This decision was made due to Out-of-Memory (OOM) errors encountered on a single NVIDIA RTX 3090. Despite this modification, the finetuning proceeded successfully, taking approximately 9 hours, 23 minutes, and 54 seconds on the aforementioned hardware. Checkpoints are available at 5-epoch intervals, ranging from `epoch_2nd_00004.pth` to `epoch_2nd_00049.pth`.
22
 
23
+ * **`with-slm-discriminator`**: This set of checkpoints comes from a finetuning run that utilized the Speech Language Model (WavLM) as a discriminator, aligning with the default StyleTTS2 configuration. This integration leverages the powerful representations of WavLM to guide the style diffusion process, potentially leading to enhanced speech naturalness. This more computationally intensive run took approximately 2 days and 18 hours to complete on a single NVIDIA RTX 3090. Similar to the other run, checkpoints are provided every 5 epochs, from `epoch_2nd_00004.pth` to `epoch_2nd_00049.pth`.
24
 
25
  ## Training Details
26
 
27
+ * **Base Model:** StyleTTS2 (pretrained on LibriTTS for 20 epochs)
28
+ * **Finetuning Dataset:** LJSpeech (1 hour subset, ~1k samples)
29
  * **Number of Epochs:** 50
30
  * **Hardware (Run 1 - No SLM):** 1 x NVIDIA RTX 3090
31
  * **Hardware (Run 2 - With SLM):** 1 x NVIDIA RTX 3090
 
34
 
35
  ## Usage
36
 
37
+ To leverage these finetuned StyleTTS 2 checkpoints, ensure you have the original StyleTTS2 codebase properly set up. The provided checkpoints can then be loaded using the framework's designated loading mechanisms, often involving configuration files that specify the model architecture and training parameters. Below is a general Python example illustrating how you might load a checkpoint. Remember to adjust the file paths according to your local setup and the specific loading functions provided by the StyleTTS 2 implementation.
38
 
39
  ```python
40
  import torch
41
 
42
  # Example for loading a checkpoint (adjust paths as needed)
 
 
 
 
 
 
43
  checkpoint_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/epoch_2nd_00049.pth"
44
  config_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/config_ft.yml"
45
 
SAMPLES.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### LJSpeech Female Speaker (GROUND TRUTH)
2
+ <audio controls><source src="https://huggingface.co/ibrazebra/lj-speech-finetuned-styletts2/tree/main/samples/ground-truth.wav" type="audio/wav"></audio>
3
+ > An old lady handed me a pamphlet about saving the Rosenbergs. I looked at that paper and I still remember it for some reason, I don't know why. End quote.
4
+ ```
5
+ ɐn ˈoʊld lˈeɪdi hˈændᵻd mˌiː ɐ pˈæmflɪt ɐbˌaʊt sˈeɪvɪŋ ðə ɹˈoʊzənbˌɜːɡz . aɪ lˈʊkt æt ðæt pˈeɪpɚ ænd aɪ stˈɪl ɹᵻmˈɛmbɚɹ ɪt fɔːɹ sˌʌm ɹˈiːzən , aɪ dˈoʊnt nˈoʊ wˈaɪ . ˈɛnd kwˈoʊt .
6
+ ```
7
+
8
+ ### LJSpeech Female Speaker (FINETUNED MODEL: WITH-SLM-DISCRIMINATOR)
9
+ <audio controls><source src="https://huggingface.co/ibrazebra/lj-speech-finetuned-styletts2/tree/main/samples/with-slm-discriminator.wav" type="audio/wav"></audio>
10
+ > An old lady handed me a pamphlet about saving the Rosenbergs. I looked at that paper and I still remember it for some reason, I don't know why. End quote.
11
+ ```
12
+ ɐn ˈoʊld lˈeɪdi hˈændᵻd mˌiː ɐ pˈæmflɪt ɐbˌaʊt sˈeɪvɪŋ ðə ɹˈoʊzənbˌɜːɡz . aɪ lˈʊkt æt ðæt pˈeɪpɚ ænd aɪ stˈɪl ɹᵻmˈɛmbɚɹ ɪt fɔːɹ sˌʌm ɹˈiːzən , aɪ dˈoʊnt nˈoʊ wˈaɪ . ˈɛnd kwˈoʊt .
13
+ ```
14
+
15
+ ### LJSpeech Female Speaker (FINETUNED MODEL: NO-SLM-DISCRIMINATOR)
16
+ <audio controls><source src="https://huggingface.co/ibrazebra/lj-speech-finetuned-styletts2/tree/main/samples/no-slm-discriminator.wav" type="audio/wav"></audio>
17
+ > An old lady handed me a pamphlet about saving the Rosenbergs. I looked at that paper and I still remember it for some reason, I don't know why. End quote.
18
+ ```
19
+ ɐn ˈoʊld lˈeɪdi hˈændᵻd mˌiː ɐ pˈæmflɪt ɐbˌaʊt sˈeɪvɪŋ ðə ɹˈoʊzənbˌɜːɡz . aɪ lˈʊkt æt ðæt pˈeɪpɚ ænd aɪ stˈɪl ɹᵻmˈɛmbɚɹ ɪt fɔːɹ sˌʌm ɹˈiːzən , aɪ dˈoʊnt nˈoʊ wˈaɪ . ˈɛnd kwˈoʊt .
20
+ ```
samples/ground-truth.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6cc0293255cc59a4f3dec5b3f45ab44f479836938c39d4c057bfd3cee06ffdb2
3
+ size 448996
samples/no-slm-discriminator.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:372ef2433f9bf11178a014ded560aa698eda4fb23b463130ca6fe99e8b224ce4
3
+ size 461944
samples/with-slm-discriminator.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ac174a0fecbe5f9876f160f904ab4d59a29d076d880e92dc2711a27cae69c58
3
+ size 493144