train without enhancement

Browse files

train without enhancement again

Files changed (9) hide show

README.md +36 -46
config.json +2 -3
generation_config.json +1 -1
model.safetensors +1 -1
runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707691135.CDT-DESKTOP-LINUX.8611.0 +3 -0
runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707729176.CDT-DESKTOP-LINUX.8611.1 +3 -0
runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707957572.CDT-DESKTOP-LINUX.16926.0 +3 -0
runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707996353.CDT-DESKTOP-LINUX.16926.1 +3 -0
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -10,82 +10,72 @@ datasets:
 model-index:
 - name: speecht5_tts-finetuned-nst-da
   results: []
-metrics:
-- mse
-pipeline_tag: text-to-speech
 ---
 # speecht5_tts-finetuned-nst-da
 This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.3738
 ## Model description
-Given that danish is a low-resource language, not many open-source implementations of a danish text-to-speech synthesizer are available online. As of writing, the only other existing implementations available on 🤗 are [facebook/seamless-streaming](https://huggingface.co/facebook/seamless-streaming) and [audo/seamless-m4t-v2-large](https://huggingface.co/audo/seamless-m4t-v2-large). This model has been developed to provide a simpler alternative that still performs reasonable well, both in terms of output quality and inference time. Additionally, contrary to the aforementioned models, this model also has an associated Space on 🤗 at [JackismyShephard/danish-speech-synthesis](https://huggingface.co/spaces/JackismyShephard/danish-speech-synthesis) which provides an easy interface for danish text-to-speech synthesis, as well as optional speech enhancement.
 ## Intended uses & limitations
-The model is intended for danish text-to-speech synthesis.
-The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts). The model performs best for short to medium length input text and expects input text to contain no more than 600 vocabulary tokens. Additionally, for best performance the model should be given a danish speaker embedding, ideally generated from an audio clip from the training split of [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using [speechbrain/spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb).
-The output of the model is a log-mel spectogram, which should be converted to a waveform using [microsoft/speecht5_hifigan](https://huggingface.co/microsoft/speecht5_hifigan). For better quality output the resulting waveform can be enhanced using [ResembleAI/resemble-enhance](https://huggingface.co/ResembleAI/resemble-enhance).
-An example script showing how to use the model for inference can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned_nst-da-inference.ipynb).
 ## Training and evaluation data
-The model was trained and evaluated on [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using MSE as both loss and metric.
 ## Training procedure
-The script used for training the model can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned-nst-da-training.ipynb)
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 32
-- eval_batch_size: 32
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 20
-- mixed_precision_training: Native AMP
 ### Training results
-| Training Loss | Epoch | Step  | Validation Loss |
-|:-------------:|:-----:|:-----:|:---------------:|
-| 0.463         | 1.0   | 4715  | 0.4169          |
-| 0.4302        | 2.0   | 9430  | 0.3963          |
-| 0.447         | 3.0   | 14145 | 0.3883          |
-| 0.4283        | 4.0   | 18860 | 0.3847          |
-| 0.394         | 5.0   | 23575 | 0.3830          |
-| 0.3934        | 6.0   | 28290 | 0.3812          |
-| 0.3928        | 7.0   | 33005 | 0.3795          |
-| 0.4123        | 8.0   | 37720 | 0.3781          |
-| 0.3851        | 9.0   | 42435 | 0.3785          |
-| 0.4234        | 10.0  | 47150 | 0.3783          |
-| 0.3781        | 11.0  | 51865 | 0.3759          |
-| 0.3951        | 12.0  | 56580 | 0.3782          |
-| 0.4073        | 13.0  | 61295 | 0.3757          |
-| 0.4278        | 14.0  | 66010 | 0.3768          |
-| 0.4172        | 15.0  | 70725 | 0.3747          |
-| 0.3854        | 16.0  | 75440 | 0.3753          |
-| 0.4876        | 17.0  | 80155 | 0.3741          |
-| 0.432         | 18.0  | 84870 | 0.3738          |
-| 0.4435        | 19.0  | 89585 | 0.3745          |
-| 0.4255        | 20.0  | 94300 | 0.3739          |
 ### Framework versions
-- Transformers 4.37.0.dev0
-- Pytorch 2.1.2+cu118
-- Datasets 2.15.0
-- Tokenizers 0.15.0

 model-index:
 - name: speecht5_tts-finetuned-nst-da
   results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
 # speecht5_tts-finetuned-nst-da
 This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
 It achieves the following results on the evaluation set:
+- Loss: 0.3298
 ## Model description
+More information needed
 ## Intended uses & limitations
+More information needed
 ## Training and evaluation data
+More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 16
+- eval_batch_size: 16
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 20
 ### Training results
+| Training Loss | Epoch | Step   | Validation Loss |
+|:-------------:|:-----:|:------:|:---------------:|
+| 0.3762        | 1.0   | 9429   | 0.3670          |
+| 0.3596        | 2.0   | 18858  | 0.3577          |
+| 0.3498        | 3.0   | 28287  | 0.3535          |
+| 0.3356        | 4.0   | 37716  | 0.3414          |
+| 0.3405        | 5.0   | 47145  | 0.3378          |
+| 0.3312        | 6.0   | 56574  | 0.3397          |
+| 0.3326        | 7.0   | 66003  | 0.3377          |
+| 0.3299        | 8.0   | 75432  | 0.3384          |
+| 0.3279        | 9.0   | 84861  | 0.3363          |
+| 0.3203        | 10.0  | 94290  | 0.3335          |
+| 0.3235        | 11.0  | 103719 | 0.3367          |
+| 0.3188        | 12.0  | 113148 | 0.3365          |
+| 0.3141        | 13.0  | 122577 | 0.3324          |
+| 0.3176        | 14.0  | 132006 | 0.3345          |
+| 0.3221        | 15.0  | 141435 | 0.3331          |
+| 0.3157        | 16.0  | 150864 | 0.3317          |
+| 0.314         | 17.0  | 160293 | 0.3298          |
+| 0.3164        | 18.0  | 169722 | 0.3316          |
+| 0.3172        | 19.0  | 179151 | 0.3315          |
+| 0.3179        | 20.0  | 188580 | 0.3318          |
 ### Framework versions
+- Transformers 4.37.2
+- Pytorch 2.1.1+cu121
+- Datasets 2.17.0
+- Tokenizers 0.15.2

config.json CHANGED Viewed

@@ -64,7 +64,6 @@
   "mask_time_length": 10,
   "mask_time_min_masks": 2,
   "mask_time_prob": 0.05,
-  "max_length": 1876,
   "max_speech_positions": 1876,
   "max_text_positions": 600,
   "model_type": "speecht5",
@@ -85,8 +84,8 @@
   "speech_decoder_prenet_layers": 2,
   "speech_decoder_prenet_units": 256,
   "torch_dtype": "float32",
-  "transformers_version": "4.37.0.dev0",
-  "use_cache": false,
   "use_guided_attention_loss": true,
   "vocab_size": 81
 }

   "mask_time_length": 10,
   "mask_time_min_masks": 2,
   "mask_time_prob": 0.05,
   "max_speech_positions": 1876,
   "max_text_positions": 600,
   "model_type": "speecht5",
   "speech_decoder_prenet_layers": 2,
   "speech_decoder_prenet_units": 256,
   "torch_dtype": "float32",
+  "transformers_version": "4.37.2",
+  "use_cache": true,
   "use_guided_attention_loss": true,
   "vocab_size": 81
 }

generation_config.json CHANGED Viewed

@@ -5,5 +5,5 @@
   "eos_token_id": 2,
   "max_length": 1876,
   "pad_token_id": 1,
-  "transformers_version": "4.37.0.dev0"
 }

   "eos_token_id": 2,
   "max_length": 1876,
   "pad_token_id": 1,
+  "transformers_version": "4.37.2"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6fd7115b5fd692e5b4e818cec73302c5178be23b813a6666edfc2e62bb6b3365
 size 577789320

 version https://git-lfs.github.com/spec/v1
+oid sha256:dff4359a3c72168158595b8f34cbf6aa51f98f47264b71477a6115de94707c2f
 size 577789320

runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707691135.CDT-DESKTOP-LINUX.8611.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a7e3f7e969bf2372a4f78976f5ae3f4ad2aee2adc0d9ded3a68ec5239264a31
+size 1216780

runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707729176.CDT-DESKTOP-LINUX.8611.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f52f16751e8076e74d5e9235c61b66c3dc1c1735dcbe5d0c18fbfd617fe9b69c
+size 364

runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707957572.CDT-DESKTOP-LINUX.16926.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d28a273a43910b25316ff0f395da24002c14da259b4735e0418ff9b09bcc234
+size 1216806

runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707996353.CDT-DESKTOP-LINUX.16926.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:716a966597a0b5daf1778ec9ac0fb24885d945fb74b8286facfe19cd079a00b1
+size 364

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2ab472663086f916e33cf2c755647d5168c72207604e204f721d54f6d82a7734
 size 4920

 version https://git-lfs.github.com/spec/v1
+oid sha256:7826ab175be8e113f8950ea58524622eadd9e37fa5d3f1e7b3d7377d06b01a48
 size 4920