JackismyShephard commited on
Commit
93b2b84
1 Parent(s): 02b2725

train without enhancement

Browse files

train without enhancement again

README.md CHANGED
@@ -10,82 +10,72 @@ datasets:
10
  model-index:
11
  - name: speecht5_tts-finetuned-nst-da
12
  results: []
13
- metrics:
14
- - mse
15
- pipeline_tag: text-to-speech
16
  ---
17
 
 
 
 
18
  # speecht5_tts-finetuned-nst-da
19
 
20
  This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
21
  It achieves the following results on the evaluation set:
22
- - Loss: 0.3738
23
 
24
  ## Model description
25
 
26
- Given that danish is a low-resource language, not many open-source implementations of a danish text-to-speech synthesizer are available online. As of writing, the only other existing implementations available on 🤗 are [facebook/seamless-streaming](https://huggingface.co/facebook/seamless-streaming) and [audo/seamless-m4t-v2-large](https://huggingface.co/audo/seamless-m4t-v2-large). This model has been developed to provide a simpler alternative that still performs reasonable well, both in terms of output quality and inference time. Additionally, contrary to the aforementioned models, this model also has an associated Space on 🤗 at [JackismyShephard/danish-speech-synthesis](https://huggingface.co/spaces/JackismyShephard/danish-speech-synthesis) which provides an easy interface for danish text-to-speech synthesis, as well as optional speech enhancement.
27
 
28
  ## Intended uses & limitations
29
 
30
- The model is intended for danish text-to-speech synthesis.
31
-
32
- The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts). The model performs best for short to medium length input text and expects input text to contain no more than 600 vocabulary tokens. Additionally, for best performance the model should be given a danish speaker embedding, ideally generated from an audio clip from the training split of [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using [speechbrain/spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb).
33
-
34
- The output of the model is a log-mel spectogram, which should be converted to a waveform using [microsoft/speecht5_hifigan](https://huggingface.co/microsoft/speecht5_hifigan). For better quality output the resulting waveform can be enhanced using [ResembleAI/resemble-enhance](https://huggingface.co/ResembleAI/resemble-enhance).
35
-
36
- An example script showing how to use the model for inference can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned_nst-da-inference.ipynb).
37
-
38
 
39
  ## Training and evaluation data
40
 
41
- The model was trained and evaluated on [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using MSE as both loss and metric.
42
 
43
  ## Training procedure
44
 
45
- The script used for training the model can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned-nst-da-training.ipynb)
46
-
47
  ### Training hyperparameters
48
 
49
  The following hyperparameters were used during training:
50
- - learning_rate: 1e-05
51
- - train_batch_size: 32
52
- - eval_batch_size: 32
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
  - lr_scheduler_type: linear
56
  - lr_scheduler_warmup_ratio: 0.1
57
  - num_epochs: 20
58
- - mixed_precision_training: Native AMP
59
 
60
  ### Training results
61
 
62
- | Training Loss | Epoch | Step | Validation Loss |
63
- |:-------------:|:-----:|:-----:|:---------------:|
64
- | 0.463 | 1.0 | 4715 | 0.4169 |
65
- | 0.4302 | 2.0 | 9430 | 0.3963 |
66
- | 0.447 | 3.0 | 14145 | 0.3883 |
67
- | 0.4283 | 4.0 | 18860 | 0.3847 |
68
- | 0.394 | 5.0 | 23575 | 0.3830 |
69
- | 0.3934 | 6.0 | 28290 | 0.3812 |
70
- | 0.3928 | 7.0 | 33005 | 0.3795 |
71
- | 0.4123 | 8.0 | 37720 | 0.3781 |
72
- | 0.3851 | 9.0 | 42435 | 0.3785 |
73
- | 0.4234 | 10.0 | 47150 | 0.3783 |
74
- | 0.3781 | 11.0 | 51865 | 0.3759 |
75
- | 0.3951 | 12.0 | 56580 | 0.3782 |
76
- | 0.4073 | 13.0 | 61295 | 0.3757 |
77
- | 0.4278 | 14.0 | 66010 | 0.3768 |
78
- | 0.4172 | 15.0 | 70725 | 0.3747 |
79
- | 0.3854 | 16.0 | 75440 | 0.3753 |
80
- | 0.4876 | 17.0 | 80155 | 0.3741 |
81
- | 0.432 | 18.0 | 84870 | 0.3738 |
82
- | 0.4435 | 19.0 | 89585 | 0.3745 |
83
- | 0.4255 | 20.0 | 94300 | 0.3739 |
84
 
85
 
86
  ### Framework versions
87
 
88
- - Transformers 4.37.0.dev0
89
- - Pytorch 2.1.2+cu118
90
- - Datasets 2.15.0
91
- - Tokenizers 0.15.0
 
10
  model-index:
11
  - name: speecht5_tts-finetuned-nst-da
12
  results: []
 
 
 
13
  ---
14
 
15
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
+ should probably proofread and complete it, then remove this comment. -->
17
+
18
  # speecht5_tts-finetuned-nst-da
19
 
20
  This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
21
  It achieves the following results on the evaluation set:
22
+ - Loss: 0.3298
23
 
24
  ## Model description
25
 
26
+ More information needed
27
 
28
  ## Intended uses & limitations
29
 
30
+ More information needed
 
 
 
 
 
 
 
31
 
32
  ## Training and evaluation data
33
 
34
+ More information needed
35
 
36
  ## Training procedure
37
 
 
 
38
  ### Training hyperparameters
39
 
40
  The following hyperparameters were used during training:
41
+ - learning_rate: 5e-05
42
+ - train_batch_size: 16
43
+ - eval_batch_size: 16
44
  - seed: 42
45
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
46
  - lr_scheduler_type: linear
47
  - lr_scheduler_warmup_ratio: 0.1
48
  - num_epochs: 20
 
49
 
50
  ### Training results
51
 
52
+ | Training Loss | Epoch | Step | Validation Loss |
53
+ |:-------------:|:-----:|:------:|:---------------:|
54
+ | 0.3762 | 1.0 | 9429 | 0.3670 |
55
+ | 0.3596 | 2.0 | 18858 | 0.3577 |
56
+ | 0.3498 | 3.0 | 28287 | 0.3535 |
57
+ | 0.3356 | 4.0 | 37716 | 0.3414 |
58
+ | 0.3405 | 5.0 | 47145 | 0.3378 |
59
+ | 0.3312 | 6.0 | 56574 | 0.3397 |
60
+ | 0.3326 | 7.0 | 66003 | 0.3377 |
61
+ | 0.3299 | 8.0 | 75432 | 0.3384 |
62
+ | 0.3279 | 9.0 | 84861 | 0.3363 |
63
+ | 0.3203 | 10.0 | 94290 | 0.3335 |
64
+ | 0.3235 | 11.0 | 103719 | 0.3367 |
65
+ | 0.3188 | 12.0 | 113148 | 0.3365 |
66
+ | 0.3141 | 13.0 | 122577 | 0.3324 |
67
+ | 0.3176 | 14.0 | 132006 | 0.3345 |
68
+ | 0.3221 | 15.0 | 141435 | 0.3331 |
69
+ | 0.3157 | 16.0 | 150864 | 0.3317 |
70
+ | 0.314 | 17.0 | 160293 | 0.3298 |
71
+ | 0.3164 | 18.0 | 169722 | 0.3316 |
72
+ | 0.3172 | 19.0 | 179151 | 0.3315 |
73
+ | 0.3179 | 20.0 | 188580 | 0.3318 |
74
 
75
 
76
  ### Framework versions
77
 
78
+ - Transformers 4.37.2
79
+ - Pytorch 2.1.1+cu121
80
+ - Datasets 2.17.0
81
+ - Tokenizers 0.15.2
config.json CHANGED
@@ -64,7 +64,6 @@
64
  "mask_time_length": 10,
65
  "mask_time_min_masks": 2,
66
  "mask_time_prob": 0.05,
67
- "max_length": 1876,
68
  "max_speech_positions": 1876,
69
  "max_text_positions": 600,
70
  "model_type": "speecht5",
@@ -85,8 +84,8 @@
85
  "speech_decoder_prenet_layers": 2,
86
  "speech_decoder_prenet_units": 256,
87
  "torch_dtype": "float32",
88
- "transformers_version": "4.37.0.dev0",
89
- "use_cache": false,
90
  "use_guided_attention_loss": true,
91
  "vocab_size": 81
92
  }
 
64
  "mask_time_length": 10,
65
  "mask_time_min_masks": 2,
66
  "mask_time_prob": 0.05,
 
67
  "max_speech_positions": 1876,
68
  "max_text_positions": 600,
69
  "model_type": "speecht5",
 
84
  "speech_decoder_prenet_layers": 2,
85
  "speech_decoder_prenet_units": 256,
86
  "torch_dtype": "float32",
87
+ "transformers_version": "4.37.2",
88
+ "use_cache": true,
89
  "use_guided_attention_loss": true,
90
  "vocab_size": 81
91
  }
generation_config.json CHANGED
@@ -5,5 +5,5 @@
5
  "eos_token_id": 2,
6
  "max_length": 1876,
7
  "pad_token_id": 1,
8
- "transformers_version": "4.37.0.dev0"
9
  }
 
5
  "eos_token_id": 2,
6
  "max_length": 1876,
7
  "pad_token_id": 1,
8
+ "transformers_version": "4.37.2"
9
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6fd7115b5fd692e5b4e818cec73302c5178be23b813a6666edfc2e62bb6b3365
3
  size 577789320
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dff4359a3c72168158595b8f34cbf6aa51f98f47264b71477a6115de94707c2f
3
  size 577789320
runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707691135.CDT-DESKTOP-LINUX.8611.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a7e3f7e969bf2372a4f78976f5ae3f4ad2aee2adc0d9ded3a68ec5239264a31
3
+ size 1216780
runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707729176.CDT-DESKTOP-LINUX.8611.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f52f16751e8076e74d5e9235c61b66c3dc1c1735dcbe5d0c18fbfd617fe9b69c
3
+ size 364
runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707957572.CDT-DESKTOP-LINUX.16926.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d28a273a43910b25316ff0f395da24002c14da259b4735e0418ff9b09bcc234
3
+ size 1216806
runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707996353.CDT-DESKTOP-LINUX.16926.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:716a966597a0b5daf1778ec9ac0fb24885d945fb74b8286facfe19cd079a00b1
3
+ size 364
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2ab472663086f916e33cf2c755647d5168c72207604e204f721d54f6d82a7734
3
  size 4920
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7826ab175be8e113f8950ea58524622eadd9e37fa5d3f1e7b3d7377d06b01a48
3
  size 4920