Grapheme-based statistical parametric synthesizer for Kinyarwanda

A Grapheme-based approach was chosen because they give acceptable performances for low-resource languages. For instance, this model was trained on approximately 5 hours of Kinyarwanda audios with their corresponding transcriptions, no further language-specific information was provided. The Festvox suite of tools was employed to build the model, and the Flite engine was used to generate a small, and portable executable file for this model. Currently, this model can only be run on Linux.

Model description

To build the voice, we needed to map graphemes to their corresponding phonemes. In this work the UniTran-based approach to building the voice. The graphemes are converted to UTF-8 code points, then these are converted to guessed phonetic transcription in X-Sampa. After obtaining the phonemes, on each one of them we use an HMM model from the Clustergen framework to obtain important features. These features are then used to train RandomForest(20 decision trees) to predict spectral features. It achieves an MCD of 5.03.

Limitations and Recommendations

The voice produced lacks in crispness and in some cases ignore tonal information which is indispensable in Kinyarwanda. We believe that with a large corpus of linguistic information the voice would sound more natural.

Usage

Use the following to convert text to a wav file:

./flite_du_kin_tts -f kinyarwanda.txt kinyarwanda.wav

And to use a terminal prompt, use:

./flite_du_kin_tts -t "Muraho Rwanda" kinyarwanda.wav