Emotional Profile

by dkounadis - opened Sep 4, 2024

Discussion

dkounadis

Owner Sep 4, 2024

•

edited about 1 month ago

Phenomenon of obtaining Affective TTS voices when using 4x speed artificial styles in StyleTTS2.

This phenomenon is utilised in demo
StyleTTS2 sounds cooler if given 4x speed audio as style rather than actual natural speech of 1x speed

Style

StyleTTS2

Mimic-3 English

Mimic-3 English - (Harvard)

StyleTTS - (Mimic-3 English)

Mimic-3 English 4x

Mimic-3 En. 4x - (Harvard)

StyleTTS2 - (Mimic-3 English 4x)

Human - (EmoDB)

Human - (LibriSpeech)

StyleTTS2 - (EmoDB)

StyleTTS2 - (LibriSpeech)

Mimic-3 Foreign

Mimic-3 Foreign - (Harvard)

StyleTTS2 - (Mimic-3 Foreign)

Mimix-3 Foreign 4x

Mimic-3 Foreign 4x - (Harvard)

StyleTTS2 - (Mimic-3 Foreign 4x)

Achieving high naturalness in speech synthesis by cloning (artificial) synthetic speakers might seem counterintuitive. However, natural speech has environmental noise that adulterates voice cloning.
Prosody of TTS can be controlled via a audio file (speaker) or via text description. Audio filEs allow for precise indication of prosody. However are polluted by background noise. Text is inherently limited by non-descriptiveness of non-verbal paralinguistics by words. SHIFT TTS focus is an in between zone: What if we use synthetic (TTS audio) speakers instead of natural speech, to drive a second TTS (StyleTTS2) algorithm.

Our intuition is that synthetic speech voice cloning is an interesting form of TTS which enables pleasant narrations of long Txt.

Character Error Rate

Naturalness MOS and CER (%) for StyleTTS2 having human speech (EmoDB) as style vs having synthetic speech (Mimic-3) as style.

NMOS / CER (%)
Style Audio			StyleTTS 2
	MOS	CER (%)	MOS	CER (%)
Mimic-3 English 1x speed	2.9	0.92	3.6	0.72
Mimic-3 English 4x speed	3.7	12.90	4.1	0.59
Mimic-3 Foreign 1x speed	1.7	62.63	3.4	0.85
Mimic-3 Foreign 4x speed	2.7	82.67	4.0	1.15
EmoDB	4.7	77.81	4.0	1.15

Emotional Profile of StyleTTS2 using TTS styles

we apply Speech Emotion Recognition (SER) to observe the emotional profile of StyleTTS2 in synthesizing the 720 Harvard sentences. We use two publicly available SER detectors: One for Arousal, Dominance, and Valence (A/D/V) wav2small and one for categorical emotions WavLM MSP . The WavLM outputs probabilities of the emotional categories of Happy, Anger, Sad, Fear, Disgust, among others. Arousal indicates voice excitement, Dominance shows how imposing a voice is, and Valence reveals negativity / positivity.

As generator of synthetic speech styles, we use the Mimic-3 TTS system that provides 134 English voices and 204 Foreign voices. Our Artificial StyleTTS2 along with pre-generated styles is also available in https://audeering.github.io/shift/.

Visualizations

We synthesize the 720 Harvard sentences (in standard order) via StyleTTS2, using five different choices of style audio:

Mimic-3 English Style of 1x or 4x speed

We use styles of 1x or 4x speed to see their effect on prosody manipulation. The speed is inherently generated by Mimic-3 avoiding artefacts of post-processing. We synthesize a different style for each Harvard Sentence using a different voice of Mimic-3. Audio examples are given above.

Mimic-3 Foreign Style of 1x or 4x speed

The Foreign voices of Mimic-3 can pronounce the Harvard Sentences, although with an accent. We use them as styles that are from diverse languages not seen during training of StyleTTS2. Again, we generate foreign styles of 1x speed, 4x speed.

Natural Speech style

Grey shadow is the emotional levels of StyleTTS using natural speech styles taken from EmoDB. EmoDB is a corpus of (acted) highly expressive noise-free natural speech.

Figures

Notice the increase of emotion probabilities when feeding 4x speed style .wav to StyleTTS2, generated by speed up in Mimic-3 TTS.

code #1 / code #2

Figures above show the probability of emotions as well as the level of Arousal/Dominance/Valence detected at the output of StyleTTS2 over the course of 720 Harvard sentences, with different styles.

Probabilities of emotion appear similar for different styles, due to same text, as text sentiment overwhelms SER detectors. Emotion Looks at Text or Voice ?

Higher MOS via TTS than Natural Speech: StyleTTS2 achieves MOS = 4.1 & lower CER = 0.59% by using Mimic-3 English 4x speed TTS audio as styles rather than natural speech style: EmoDB or LibriSpeech show worse MOS. Subtle differences between Figures -Left / -Right, as the rise of valence, show the tonality variation brought by the use of 4x speed style. Different style yields slightly different duration, causing a misalignment of Grey and Blue lines. The high MOS is proportional to high Valence and Happy probability (blue line). StyleTTS2 is not affected by the intelligibility of style, and achieves very low CER = 1.15% for natural Speech and even lower CER = 0.59% for Mimic-3 English 4x speed. Disgust / Anger is diminished when using Mimic-3 4x speed styles. Valence and Happiness for StyleTTS2 via Mimic-3 (blue line at 4x speed) is almost always higher than StyleTTS2 via EmoDB (natural speech) irrespective of the language. Actual words in style audio do not affect the valence of StyleTTS2, however, extra punctuation placed in the text of style such as ...!!!;" produces un-natural noises by Mimic-3 like hah/hiss/scratch" sounds. When those noises are fed to StyleTTS2, they trigger it to generate audible backchannels, such as sighs and breaths that are pleasant to hear. Listen to above audios! We run StyleTTS2 using default embedding style and pitch curve calculation. Mimic-3 styles are also synthesized using default settings of Mimic-3 except for the variation of speed to 1x / 4x.

SHIFT Horizon Google docs

dkounadis changed discussion title from Mimic-3 style / StyleTTS2 output to StyleTTS2 using Mimic-3 styles Sep 4, 2024

dkounadis changed discussion title from StyleTTS2 using Mimic-3 styles to StyleTTS2 via Mimic-3 styles Sep 4, 2024

dkounadis changed discussion title from StyleTTS2 via Mimic-3 styles to StyleTTS2 driven by Mimic-3 style Sep 4, 2024

dkounadis changed discussion title from StyleTTS2 driven by Mimic-3 style to StyleTTS2 driven by Mimic-3 styles Sep 4, 2024

dkounadis changed discussion title from StyleTTS2 driven by Mimic-3 styles to StyleTTS2 using Mimic-3 styles Sep 4, 2024

dkounadis changed discussion title from StyleTTS2 using Mimic-3 styles to StyleTTS2 using TTS 4x speed style Sep 30, 2024

dkounadis changed discussion title from StyleTTS2 using TTS 4x speed style to StyleTTS2 using 4x speed style Sep 30, 2024

dkounadis changed discussion title from StyleTTS2 using 4x speed style to StyleTTS2 using 4x-speed audio as style Sep 30, 2024

dkounadis changed discussion title from StyleTTS2 using 4x-speed audio as style to StyleTTS2 using 4x-speed audio as speaker Sep 30, 2024

dkounadis

Owner Oct 2, 2024

•

edited Oct 2, 2024

CONCLUSION

We discovered that synthetic speech styles amplify valence and happy emotion in the output of StyleTTS2. We also found out that accelerated Mimic-3 synthetic speech style, increases MOS = 4.1 vs MOS = 4.0 achieved by the use of human speech style.

dkounadis changed discussion title from StyleTTS2 using 4x-speed audio as speaker to Emotional Profile of StyleTTS2 using TTS Styles Oct 2, 2024

dkounadis changed discussion title from Emotional Profile of StyleTTS2 using TTS Styles to Emotional Profile of StyleTTS2 using Synthetic Styles Oct 2, 2024

dkounadis changed discussion title from Emotional Profile of StyleTTS2 using Synthetic Styles to Emotional Profile of StyleTTS2 Oct 2, 2024

dkounadis changed discussion title from Emotional Profile of StyleTTS2 to Emotional Profile - StyleTTS2 Oct 2, 2024

dkounadis changed discussion title from Emotional Profile - StyleTTS2 to Emotional Profile of StyleTTS2 Oct 2, 2024

dkounadis changed discussion title from Emotional Profile of StyleTTS2 to Emotional Profile of STYLETTS2 Oct 2, 2024

dkounadis changed discussion title from Emotional Profile of STYLETTS2 to Emotional Profile Oct 2, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment