File size: 6,101 Bytes
e758c57 919b53f e758c57 919b53f 5125c66 e758c57 919b53f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
language: sw
license: cc-by-sa-4.0
tags:
- tensorflowtts
- audio
- text-to-speech
- text-to-mel
inference: false
datasets:
- bookbot/OpenBible_Swahili
---
# LightSpeech MFA SW v1
LightSpeech MFA SW v1 is a text-to-mel-spectrogram model based on the [LightSpeech](https://arxiv.org/abs/2102.04040) architecture. This model was trained from scratch on a real audio dataset. The list of real speakers include:
- sw-KE-OpenBible
We trained an acoustic Swahili model on our speech corpus using [Montreal Forced Aligner v2.0.0](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) and used it as the duration extractor. That model, and consequently our model, uses the IPA phone set for Swahili. We used [gruut](https://github.com/rhasspy/gruut) for phonemization purposes. We followed these [steps](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/mfa_extraction) to perform duration extraction.
This model was trained using the [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) framework. All training was done on a Scaleway RENDER-S VM with a Tesla P100 GPU. All necessary scripts used for training could be found in this [Github Fork](https://github.com/bookbot-hive/TensorFlowTTS), as well as the [Training metrics](https://huggingface.co/bookbot/lightspeech-mfa-sw-v1/tensorboard) logged via Tensorboard.
## Model
| Model | Config | SR (Hz) | Mel range (Hz) | FFT / Hop / Win (pt) | #steps |
| ----------------------- | --------------------------------------------------------------------------------- | ------- | -------------- | -------------------- | ------ |
| `lightspeech-mfa-sw-v1` | [Link](https://huggingface.co/bookbot/lightspeech-mfa-sw-v1/blob/main/config.yml) | 44.1K | 20-11025 | 2048 / 512 / None | 200K |
## Training Procedure
<details>
<summary>Feature Extraction Setting</summary>
hop_size: 512 # Hop size.
format: "npy"
</details>
<details>
<summary>Network Architecture Setting</summary>
model_type: lightspeech
lightspeech_params:
dataset: "swahiliipa"
n_speakers: 1
encoder_hidden_size: 256
encoder_num_hidden_layers: 3
encoder_num_attention_heads: 2
encoder_attention_head_size: 16
encoder_intermediate_size: 1024
encoder_intermediate_kernel_size:
- 5
- 25
- 13
- 9
encoder_hidden_act: "mish"
decoder_hidden_size: 256
decoder_num_hidden_layers: 3
decoder_num_attention_heads: 2
decoder_attention_head_size: 16
decoder_intermediate_size: 1024
decoder_intermediate_kernel_size:
- 17
- 21
- 9
- 13
decoder_hidden_act: "mish"
variant_prediction_num_conv_layers: 2
variant_predictor_filter: 256
variant_predictor_kernel_size: 3
variant_predictor_dropout_rate: 0.5
num_mels: 80
hidden_dropout_prob: 0.2
attention_probs_dropout_prob: 0.1
max_position_embeddings: 2048
initializer_range: 0.02
output_attentions: False
output_hidden_states: False
</details>
<details>
<summary>Data Loader Setting</summary>
batch_size: 8 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
eval_batch_size: 16
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
mel_length_threshold: 32 # remove all targets has mel_length <= 32
is_shuffle: true # shuffle dataset after each epoch.
</details>
<details>
<summary>Optimizer & Scheduler Setting</summary>
optimizer_params:
initial_learning_rate: 0.0001
end_learning_rate: 0.00005
decay_steps: 150000 # < train_max_steps is recommend.
warmup_proportion: 0.02
weight_decay: 0.001
gradient_accumulation_steps: 2
var_train_expr:
null # trainable variable expr (eg. 'embeddings|encoder|decoder' )
# must separate by |. if var_train_expr is null then we
# training all variable
</details>
<details>
<summary>Interval Setting</summary>
train_max_steps: 200000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
</details>
<details>
<summary>Other Setting</summary>
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
</details>
## How to Use
```py
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel, AutoProcessor
lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v1")
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v1")
text, speaker_name = "Hello World", "sw-KE-OpenBible"
input_ids = processor.text_to_sequence(text)
mel, duration_outputs, _ = lightspeech.inference(
input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
speaker_ids=tf.convert_to_tensor(
[processor.speakers_map[speaker_name]], dtype=tf.int32
),
speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)
```
## Disclaimer
Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.
## Authors
LightSpeech MFA SW v1 was trained and evaluated by [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on Scaleway.
## Framework versions
- TensorFlowTTS 1.8
- TensorFlow 2.7.0
|