# Mozilla TTS on CPU Real-Time Speech Synthesis with TFLite

**These models are converted from released [PyTorch models](https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing) using our TF utilities provided in Mozilla TTS.**

#### **Notebook Details**
These TFLite models support TF 2.3rc0 and for different versions you might need to regenerate them. 

TFLite optimizations degrades the TTS model performance and we do not apply
any optimization for the vocoder model due to the same reason. If you like to
keep the quality, consider to regenerate TFLite model accordingly.

Models optimized with TFLite can be slow on a regular CPU since it is optimized
specifically for lower-end systems.

---



#### **Model Details** 
We use Tacotron2 and MultiBand-Melgan models and LJSpeech dataset.

Tacotron2 is trained using [Double Decoder Consistency](https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/) (DDC) only for 130K steps (3 days) with a single GPU.

MultiBand-Melgan is trained 1.45M steps with real spectrograms.

Note that both model performances can be improved with more training.


### Download TF Models and configs

In [None]:
!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

Downloading...
From: https://drive.google.com/uc?id=17PYXCmTe0el_SLTwznrt3vOArNGMGo5v
To: /content/tts_model.tflite
30.1MB [00:00, 36.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc
To: /content/config.json
100% 9.53k/9.53k [00:00<00:00, 7.38MB/s]


In [None]:
!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

Downloading...
From: https://drive.google.com/uc?id=1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO
To: /content/vocoder_model.tflite
10.2MB [00:00, 16.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu
To: /content/config_vocoder.json
100% 6.76k/6.76k [00:00<00:00, 11.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU
To: /content/scale_stats.npy
100% 10.5k/10.5k [00:00<00:00, 16.6MB/s]


### Setup Libraries

In [None]:
# need it for char to phoneme conversion
! sudo apt-get install espeak

Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following package was automatically installed and is no longer required:
 libnvidia-common-440
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
 espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
 espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 35 not upgraded.
Need to get 1,219 kB of archives.
After this operation, 3,031 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libportaudio2 amd64 19.6.0-1 [64.6 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsonic0 amd64 0.2.0-6 [13.4 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 espeak-data amd64 1.48.04+dfsg-5 [934 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libespeak1 amd64 1.48.04+dfsg-5 [145 kB]
Get:5

In [None]:
!git clone https://github.com/mozilla/TTS

Cloning into 'TTS'...
remote: Enumerating objects: 107, done.[K
remote: Counting objects: 100% (107/107), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 7252 (delta 51), reused 68 (delta 28), pack-reused 7145[K
Receiving objects: 100% (7252/7252), 115.36 MiB | 11.38 MiB/s, done.
Resolving deltas: 100% (4892/4892), done.


In [None]:
%cd TTS
!git checkout c7296b3
!pip install -r requirements.txt
!python setup.py install
!pip install tensorflow==2.3.0rc0
%cd ..

/content/TTS
Note: checking out 'c7296b3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

 git checkout -b 

HEAD is now at c7296b3 add module requirement
Collecting Unidecode>=0.4.20
[?25l Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K |████████████████████████████████| 245kB 2.7MB/s 
Collecting tensorboardX
[?25l Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)
[K |████████████████████████████████| 317kB 11.6MB/s 
Collecting soundfile

### Define TTS function

In [None]:
def run_vocoder(mel_spec):
 vocoder_inputs = mel_spec[None, :, :]
 # get input and output details
 input_details = vocoder_model.get_input_details()
 # reshape input tensor for the new input shape
 vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)
 vocoder_model.allocate_tensors()
 detail = input_details[0]
 vocoder_model.set_tensor(detail['index'], vocoder_inputs)
 # run the model
 vocoder_model.invoke()
 # collect outputs
 output_details = vocoder_model.get_output_details()
 waveform = vocoder_model.get_tensor(output_details[0]['index'])
 return waveform 


def tts(model, text, CONFIG, p):
 t_1 = time.time()
 waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
 truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,
 backend='tflite')
 waveform = run_vocoder(mel_postnet_spec.T)
 waveform = waveform[0, 0]
 rtf = (time.time() - t_1) / (len(waveform) / ap.sample_rate)
 tps = (time.time() - t_1) / len(waveform)
 print(waveform.shape)
 print(" > Run-time: {}".format(time.time() - t_1))
 print(" > Real-time factor: {}".format(rtf))
 print(" > Time per step: {}".format(tps))
 IPython.display.display(IPython.display.Audio(waveform, rate=CONFIG.audio['sample_rate'])) 
 return alignment, mel_postnet_spec, stop_tokens, waveform

### Load TF Models

In [None]:
import os
import torch
import time
import IPython

from TTS.tf.utils.tflite import load_tflite_model
from TTS.tf.utils.io import load_checkpoint
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.synthesis import synthesis

In [None]:
# runtime settings
use_cuda = False

In [None]:
# model paths
TTS_MODEL = "tts_model.tflite"
TTS_CONFIG = "config.json"
VOCODER_MODEL = "vocoder_model.tflite"
VOCODER_CONFIG = "config_vocoder.json"

In [None]:
# load configs
TTS_CONFIG = load_config(TTS_CONFIG)
VOCODER_CONFIG = load_config(VOCODER_CONFIG)

In [None]:
# load the audio processor
ap = AudioProcessor(**TTS_CONFIG.audio) 

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024


In [None]:
# LOAD TTS MODEL
# multi speaker 
speaker_id = None
speakers = []

# load the models
model = load_tflite_model(TTS_MODEL)
vocoder_model = load_tflite_model(VOCODER_MODEL)

## Run Inference

In [None]:
sentence = "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, ap)

(185856,)
 > Run-time: 3.8069238662719727
 > Real-time factor: 0.45162849859449977
 > Time per step: 2.048206938938661e-05
