Emotional Profile

#2
by dkounadis - opened

Phenomenon of obtaining Affective TTS voices when using 4x speed styles in StyleTTS2.

  • This phenomenon is utilised in https://github.com/audeering/shift.
  • StyleTTS2 sounds cooler if given 4x speed audio as style rather than actual natural speech of 1x speed

Style

StyleTTS2

Mimic-3 English

Mimic-3 English - (Harvard)

StyleTTS - (Mimic-3 English)

Mimic-3 English 4x

Mimic-3 En. 4x - (Harvard)

StyleTTS2 - (Mimic-3 English 4x)

Human - (EmoDB)

Human - (LibriSpeech)

StyleTTS2 - (EmoDB)

StyleTTS2 - (LibriSpeech)

Mimic-3 Foreign

Mimic-3 Foreign - (Harvard)

StyleTTS2 - (Mimic-3 Foreign)

Mimix-3 Foreign 4x

Mimic-3 Foreign 4x - (Harvard)

StyleTTS2 - (Mimic-3 Foreign 4x)

Character Error Rate

Naturalness MOS and CER (%) for StyleTTS2 having human speech (EmoDB) as style vs having synthetic speech (Mimic-3) as style.

NMOS / CER (%)
Style AudioStyleTTS 2
MOS CER (%) MOS CER (%)
Mimic-3 English 1x speed 2.9 0.92 3.6 0.72
Mimic-3 English 4x speed 3.7 12.90 4.1 0.59
Mimic-3 Foreign 1x speed 1.7 62.63 3.4 0.85
Mimic-3 Foreign 4x speed 2.7 82.67 4.0 1.15
EmoDB 4.7 77.81 4.0 1.15

Emotional Profile of StyleTTS2 using TTS styles

Our aim is to compare the naturalness of StyleTTS2 using synthetic style vs natural style. For this, we apply Speech Emotion Recognition (SER) to observe the emotional profile of StyleTTS2 output over the duration of the 720 Harvard sentences. We use two publicly available SER detectors: One for Arousal, Dominance, and Valence (A/D/V) wav2small and one for categorical emotions WavLM MSP . The WavLM outputs probabilities of the emotional categories of Happy, Anger, Sad, Fear, Disgust, among others. Arousal indicates voice excitement, Dominance shows how imposing a voice is, and Valence reveals negativity / positivity.

As generator of synthetic speech styles, we use the Mimic-3 TTS system that provides 134 English voices and 204 Foreign voices. Our Artificial StyleTTS2 along with pre-generated styles is also available in https://audeering.github.io/shift/.

Visualizations

We synthesize the 720 Harvard sentences (in standard order) via StyleTTS2, using five different choices of style audio:

Mimic-3 English Style of 1x or 4x speed

We use styles of 1x or 4x speed to see their effect on prosody manipulation. The speed is inherently generated by Mimic-3 avoiding artefacts of post-processing. We synthesize a different style for each Harvard Sentence using a different voice of Mimic-3. Audio examples are given above.

Mimic-3 Foreign Style of 1x or 4x speed

The Foreign voices of Mimic-3 can pronounce the Harvard Sentences, although with an accent. We use them as styles that are from diverse languages not seen during training of StyleTTS2. Again, we generate foreign styles of 1x speed, 4x speed.

Natural Speech style

In both Figures the Grey shadow indicates the emotional profile of StyleTTS using natural speech as style from EmoDB. EmoDB is a corpus of (acted) highly expressive noise-free natural speech.

Figures

Notice the increase of emotion probabilities when feeding 4x speed style .wav to StyleTTS2, generated by speed up in Mimic-3 TTS.

fig_english_WIN=40_HOP=10_HFdisc.png

fig_foreign_WIN=40_HOP=10_HFdisc.png

Figures above show the probability of emotions as well as the level of Arousal/Dominance/Valence detected at the output of StyleTTS2 over the course of 720 Harvard sentences, with different styles.

Probabilities of emotion appear similar for different styles, due to same text, as text sentiment overwhelms SER detectors. Emotion Looks at Text or Voice ?

Higher MOS via TTS than Natural Speech: StyleTTS2 achieves MOS = 4.1 and lower CER = 0.59% by using Mimic-3 English 4x speed TTS style instead of natural speech style EmoDB or LibriSpeech. Subtle differences between Figure -Left / -Right, as the rise of valence, show the tonality variation brought by the use of 4x speed style. Different style yields slightly different duration, causing a misalignment of Grey and Blue lines. The high MOS is proportional to high Valence and Happy probability (blue line). StyleTTS2 is not affected by the intelligibility of style, and achieves very low CER = 1.15% for natural Speech and even lower CER = 0.59% for Mimic-3 English 4x speed. Disgust / Anger is diminished when using Mimic-3 4x speed styles. Valence and Happiness for StyleTTS2 via Mimic-3 (blue line at 4x speed) is almost always higher than StyleTTS2 via EmoDB (natural speech) irrespective of the language. Actual words in style audio do not affect the valence of StyleTTS2, however, extra punctuation placed in the text of style such as ...!!!;" produces un-natural noises by Mimic-3 like hah/hiss/scratch" sounds. When those noises are fed to StyleTTS2, they trigger it to generate audible backchannels, such as sighs and breaths that are pleasant to hear. Listen to above audios! We run StyleTTS2 using default embedding style and pitch curve calculation. Mimic-3 styles are also synthesized using default settings of Mimic-3 except for the variation of speed to 1x / 4x.

dkounadis changed discussion title from Mimic-3 style / StyleTTS2 output to StyleTTS2 using Mimic-3 styles
dkounadis changed discussion title from StyleTTS2 using Mimic-3 styles to StyleTTS2 via Mimic-3 styles
dkounadis changed discussion title from StyleTTS2 via Mimic-3 styles to StyleTTS2 driven by Mimic-3 style
dkounadis changed discussion title from StyleTTS2 driven by Mimic-3 style to StyleTTS2 driven by Mimic-3 styles
dkounadis changed discussion title from StyleTTS2 driven by Mimic-3 styles to StyleTTS2 using Mimic-3 styles
dkounadis changed discussion title from StyleTTS2 using Mimic-3 styles to StyleTTS2 using TTS 4x speed style
dkounadis changed discussion title from StyleTTS2 using TTS 4x speed style to StyleTTS2 using 4x speed style
dkounadis changed discussion title from StyleTTS2 using 4x speed style to StyleTTS2 using 4x-speed audio as style
dkounadis changed discussion title from StyleTTS2 using 4x-speed audio as style to StyleTTS2 using 4x-speed audio as speaker

CONCLUSION

We discovered that synthetic speech styles amplify valence and happy emotion in the output of StyleTTS2. We also found out that accelerated Mimic-3 synthetic speech style, increases MOS = 4.1 vs MOS = 4.0 achieved by the use of human speech style.

dkounadis changed discussion title from StyleTTS2 using 4x-speed audio as speaker to Emotional Profile of StyleTTS2 using TTS Styles
dkounadis changed discussion title from Emotional Profile of StyleTTS2 using TTS Styles to Emotional Profile of StyleTTS2 using Synthetic Styles
dkounadis changed discussion title from Emotional Profile of StyleTTS2 using Synthetic Styles to Emotional Profile of StyleTTS2
dkounadis changed discussion title from Emotional Profile of StyleTTS2 to Emotional Profile - StyleTTS2
dkounadis changed discussion title from Emotional Profile - StyleTTS2 to Emotional Profile of StyleTTS2
dkounadis changed discussion title from Emotional Profile of StyleTTS2 to Emotional Profile of STYLETTS2
dkounadis changed discussion title from Emotional Profile of STYLETTS2 to Emotional Profile

Sign up or log in to comment