Emotional Profile
Phenomenon of obtaining Affective TTS voices when using 4x speed
styles in StyleTTS2.
- This phenomenon is utilised in https://github.com/audeering/shift.
- StyleTTS2 sounds cooler if given 4x speed audio as style rather than actual natural speech of 1x speed
|
|
Mimic-3 English Mimic-3 English - (Harvard) |
StyleTTS - (Mimic-3 English) |
Mimic-3 English 4x Mimic-3 En. 4x - (Harvard) |
StyleTTS2 - (Mimic-3 English 4x) |
Human - (EmoDB) Human - (LibriSpeech) |
StyleTTS2 - (EmoDB) StyleTTS2 - (LibriSpeech) |
Mimic-3 Foreign Mimic-3 Foreign - (Harvard) |
StyleTTS2 - (Mimic-3 Foreign) |
Mimix-3 Foreign 4x Mimic-3 Foreign 4x - (Harvard) |
StyleTTS2 - (Mimic-3 Foreign 4x) |
Character Error Rate
Naturalness MOS and CER (%) for StyleTTS2 having human speech (EmoDB) as style vs having synthetic speech (Mimic-3) as style.
NMOS / CER (%) | ||||
---|---|---|---|---|
Style Audio | StyleTTS 2 | |||
MOS | CER (%) | MOS | CER (%) | |
Mimic-3 English 1x speed | 2.9 | 0.92 | 3.6 | 0.72 |
Mimic-3 English 4x speed | 3.7 | 12.90 | 4.1 | 0.59 |
Mimic-3 Foreign 1x speed | 1.7 | 62.63 | 3.4 | 0.85 |
Mimic-3 Foreign 4x speed | 2.7 | 82.67 | 4.0 | 1.15 |
EmoDB | 4.7 | 77.81 | 4.0 | 1.15 |
Emotional Profile of StyleTTS2 using TTS styles
Our aim is to compare the naturalness of StyleTTS2 using synthetic style vs natural style. For this, we apply Speech Emotion Recognition (SER) to observe the emotional profile of StyleTTS2 output over the duration of the 720 Harvard sentences. We use two publicly available SER detectors: One for Arousal, Dominance, and Valence (A/D/V) wav2small and one for categorical emotions WavLM MSP . The WavLM outputs probabilities of the emotional categories of Happy, Anger, Sad, Fear, Disgust, among others. Arousal indicates voice excitement, Dominance shows how imposing a voice is, and Valence reveals negativity / positivity.
As generator of synthetic speech styles, we use the Mimic-3 TTS system that provides 134 English voices and 204 Foreign voices. Our Artificial StyleTTS2 along with pre-generated styles is also available in https://audeering.github.io/shift/.
Visualizations
We synthesize the 720 Harvard sentences (in standard order) via StyleTTS2, using five different choices of style audio:
Mimic-3 English Style of 1x or 4x speed
We use styles of 1x or 4x speed to see their effect on prosody manipulation. The speed is inherently generated by Mimic-3 avoiding artefacts of post-processing. We synthesize a different style for each Harvard Sentence using a different voice of Mimic-3. Audio examples are given above.
Mimic-3 Foreign Style of 1x or 4x speed
The Foreign voices of Mimic-3 can pronounce the Harvard Sentences, although with an accent. We use them as styles that are from diverse languages not seen during training of StyleTTS2. Again, we generate foreign styles of 1x speed, 4x speed.
Natural Speech style
In both Figures the Grey shadow indicates the emotional profile of StyleTTS using natural speech as style from EmoDB. EmoDB is a corpus of (acted) highly expressive noise-free natural speech.
Figures
Notice the increase of emotion probabilities when feeding 4x speed
style .wav to StyleTTS2, generated by speed up in Mimic-3 TTS.
Figures above show the probability of emotions as well as the level of Arousal/Dominance/Valence detected at the output of StyleTTS2 over the course of 720 Harvard sentences, with different styles.
Probabilities of emotion appear similar for different styles, due to same text, as text sentiment overwhelms SER detectors. Emotion Looks at Text or Voice ?
Higher MOS via TTS than Natural Speech: StyleTTS2 achieves MOS = 4.1 and lower CER = 0.59% by using Mimic-3 English 4x speed TTS style instead of natural speech style EmoDB
or LibriSpeech
. Subtle differences between Figure -Left / -Right, as the rise of valence, show the tonality variation brought by the use of 4x speed style. Different style yields slightly different duration, causing a misalignment of Grey and Blue lines. The high MOS is proportional to high Valence and Happy probability (blue line). StyleTTS2 is not affected by the intelligibility of style, and achieves very low CER = 1.15%
for natural Speech and even lower CER = 0.59%
for Mimic-3 English 4x speed. Disgust / Anger is diminished when using Mimic-3 4x speed styles. Valence and Happiness for StyleTTS2 via Mimic-3 (blue line at 4x speed) is almost always higher than StyleTTS2 via EmoDB (natural speech) irrespective of the language. Actual words in style audio do not affect the valence of StyleTTS2, however, extra punctuation placed in the text of style such as ...!!!;" produces un-natural noises by Mimic-3 like
hah/hiss/scratch" sounds. When those noises are fed to StyleTTS2, they trigger it to generate audible backchannels, such as sighs and breaths that are pleasant to hear. Listen to above audios! We run StyleTTS2 using default embedding style and pitch curve calculation. Mimic-3 styles are also synthesized using default settings of Mimic-3 except for the variation of speed to 1x / 4x.
CONCLUSION
We discovered that synthetic speech styles amplify valence and happy emotion in the output of StyleTTS2. We also found out that accelerated Mimic-3 synthetic speech style, increases MOS = 4.1 vs MOS = 4.0 achieved by the use of human speech style.