|
<!DOCTYPE html> |
|
<html lang="en"> |
|
|
|
<head> |
|
<meta charset="UTF-8" /> |
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> |
|
<title>MiniMax-Speech Tech Report | Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</title> |
|
<meta name="description" |
|
content=" MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech" /> |
|
<meta name="keywords" content="latex.css,css library,class-less css,latex css" /> |
|
<meta property="og:title" |
|
content="MiniMax-Speech Tech Report | Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder" /> |
|
<meta property="og:url" content="https://huggingface.co/spaces/MiniMaxAI/MiniMax-Speech-Tech-Report" /> |
|
<meta property="og:description" |
|
content=" MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech" /> |
|
<meta property="og:type" content="website" /> |
|
|
|
<link rel="stylesheet" href="style.css" /> |
|
</head> |
|
|
|
<body id="top" class="text-justify"> |
|
<header |
|
style="background-image: url('assets/images/header-bg.jpeg'); background-size: cover; background-position: center; padding: 1rem 0; border-radius: 1rem;"> |
|
<h1>MiniMax-Speech</h1> |
|
<h4 style="font-size: 1.3rem; line-height: 1; text-align: center;">Intrinsic Zero-Shot Text-to-Speech |
|
with a |
|
Learnable Speaker |
|
Encoder</h4> |
|
<p class="author"> |
|
MiniMax Team <span class="date">May 2025</span><br /> |
|
<a style="font-size: 1.1rem;" target="_blank" |
|
href="https://huggingface.co/spaces/MiniMaxAI/MiniMax-Speech-Tech-Report/blob/main/MiniMax_Speech.pdf">[Tech |
|
Report]</a> |
|
</p> |
|
</header> |
|
|
|
<div class="abstract"> |
|
<h2>Abstract</h2> |
|
<p style="text-align: left;"> |
|
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates |
|
high-quality |
|
speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio |
|
without |
|
requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre |
|
consistent with |
|
the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high |
|
similarity to |
|
the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed |
|
Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and |
|
subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning |
|
metrics |
|
(Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. |
|
Another |
|
key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, |
|
is its |
|
extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion |
|
control |
|
via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional |
|
voice |
|
cloning (PVC) by fine-tuning timbre features with additional data. Welcome to visit |
|
<a href="https://www.minimax.io/audio">MiniMax Audio</a> and |
|
explore our powerful TTS features. |
|
</p> |
|
</div> |
|
|
|
<nav role="navigation" class="toc"> |
|
<h2>Contents</h2> |
|
<ol> |
|
<li> |
|
<a href="#architecture-overview">Architecture Overview</a> |
|
</li> |
|
<li> |
|
<a href="#expressiveness-demonstrations">Expressiveness Demonstrations</a> |
|
<ol> |
|
<li><a href="#showcase-with-high-versatility">Showcase with High Versatility</a></li> |
|
<li><a href="#showcase-with-multiple-generation-attempts">Showcase with Multiple Generation Attempts</a></li> |
|
</ol> |
|
</li> |
|
<li><a href="#zero-shot-vs-one-shot-demonstrations">Zero-Shot vs. One-Shot Demonstrations</a></li> |
|
<li><a href="#multilingual-and-cross-lingual-capabilities-demonstrations">Multilingual and Cross-Lingual |
|
Capabilities Demonstrations</a></li> |
|
<li><a href="#flow-vae-vs-vae-comparisons">Flow-VAE vs. VAE Comparisons</a></li> |
|
<li><a href="#professional-voice-clone-pvc-demonstrations">Professional Voice Clone (PVC) Demonstrations</a></li> |
|
<li><a href="#emotion-control-demonstrations">Emotion Control Demonstrations</a></li> |
|
<li><a href="#text-prompted-voice-generation-demonstrations">Text-Prompted Voice Generation Demonstrations</a> |
|
</li> |
|
<li><a href="#comparison-of-voice-naturalness">Comparison of voice |
|
naturalness with the previous generation products</a></li> |
|
</ol> |
|
</nav> |
|
|
|
<main> |
|
<article> |
|
<div class="article-block"> |
|
<h2 id="architecture-overview">Architecture Overview</h2> |
|
<figure> |
|
<img src="assets/images/system-overview.jpg" loading="lazy" alt="System Architecture" width="100%" |
|
height="auto" /> |
|
<figcaption> |
|
An overview of the architecture of MiniMax-Speech. |
|
</figcaption> |
|
</figure> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="expressiveness-demonstrations">Expressiveness Demonstrations</h2> |
|
<h3 id="showcase-with-high-versatility">Showcase with High Versatility</h3> |
|
<div class="scroll-wrapper"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col" style="width: 40%;">Description</th> |
|
<th scope="col" style="width: 30%; text-align: center;">Source Audio</th> |
|
<th scope="col" style="width: 30%; text-align: center;">Generated Audio</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
A Compelling and Persuasive Speaker Voice |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Marketing_Voice_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Compelling%20and%20Persuasive.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
A Clear and Explanatory Voice with Broad Emotional Dynamics Across Different Texts |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Science_Voice_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Explanatory%20Broad%20Emotional.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
Another Explanatory Voice with Supernatural Prosody, <br> |
|
Featuring Distinct Ethnic and Age Characteristics |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Sociology_Sourse.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Explanatory Supernatural Prosody.MP3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
A Warm and Magnetic Voice that Brings Comfort |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Warm%20and%20Magnetic_Sourse.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Warm%20and%20Magnetic.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
An ASMR Whispering Voice with Generated Breathing and Sound Effects |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Breathy%20ASMR_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Breathy%20ASMR.MP3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
A Robotic Voice with Rich Bass Resonance and Spatial Presence |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Lucky%20Robot_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Lucky%20Robot.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
A Sardonic Mature Female Voice |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Onee-san_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Onee-san.wav" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
<h3 id="showcase-with-multiple-generation-attempts">Showcase with Multiple Generation Attempts, Post-Processing |
|
Audio Effects and Added Sound Effects</h3> |
|
<div class="scroll-wrapper"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col" style="width: 50%;">Description</th> |
|
<th scope="col" style="width: 50%; text-align: center;">Generated Audio</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
A Husky Male Voice: From Soft Murmur to Excitement to Anger, then to Whispers |
|
</td> |
|
<td> |
|
<audio class="audio-lg" src="assets/audios/Murmur-Excitement-Anger-%20Whispers.MP3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
An Angry Female Voice: From Soft Murmur to Rage to Reminiscence, then to Weeping |
|
</td> |
|
<td> |
|
<audio class="audio-lg" src="assets/audios/Neutral-Rage-Reminiscence-Weeping.MP3" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="zero-shot-vs-one-shot-demonstrations">Zero-Shot vs. One-Shot Demonstrations</h2> |
|
<p> |
|
ZeroShot maintains speaker identity while generating more natural emotions, pauses, and other expressive |
|
features based |
|
on the text content, whereas OneShot adheres more strictly to the speaker characteristics (prosody, speech |
|
rate, |
|
emotions, etc.) demonstrated in the audio prompt (The additional input that OneShot has compared to ZeroShot, |
|
see |
|
technical report for details). |
|
</p> |
|
<div class="scroll-wrapper" style="margin-top: 2rem;"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col">Source Audio</th> |
|
<th scope="col">Text</th> |
|
<th scope="col">Zero-Shot Version</th> |
|
<th scope="col">One-Shot Version</th> |
|
<th scope="col">Elevenlabs Multilingual_v2</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_Prompt.WAV" controls></audio> |
|
</td> |
|
<td> |
|
命运就算颠沛流离,<br> |
|
命运就算曲折离奇,<br> |
|
命运就算恐吓着你,<br> |
|
做人没趣味。<br> |
|
别流泪,心酸,更不应舍弃。<br> |
|
我愿能,一生永远陪伴你。 |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_ZeroShot.mp3" controls></audio> |
|
Preserving Distinctive Voice<br> |
|
Timbre and Expressive <br> |
|
Prosody with Regularized <br> |
|
Pausing and Speech Rate |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_Oneshot.mp3" controls></audio> |
|
Better Reproduction of<br> |
|
Prompt's Exaggerated Speech<br> |
|
Rate and Characteristic<br> |
|
Phrase-Initial Pauses |
|
</td> |
|
<td> |
|
Cantonese not supported |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_Prompt.WAV" controls></audio> |
|
</td> |
|
<td> |
|
你们这些躲在道德高地的懦夫,<br> |
|
敢承认自己对本我的恐惧吗?<br> |
|
回答我!嗯?你回答我!<br> |
|
Look in my eyes!<br> |
|
老子写梦的解析时<br> |
|
你们还在玩泥巴,<br> |
|
我精神分析引论每个字母都能<br> |
|
刺穿文明社会的虚伪面具,<br> |
|
我解剖潜意识就像<br> |
|
外科医生划开皮肤。<br> |
|
是不是啊?说话! |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_ZeroShot.mp3" controls></audio> |
|
Capable of Generating<br> |
|
Relatively Calmer Emotions<br> |
|
while Preserving Voice<br> |
|
Identity |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_OneShot.mp3" controls></audio> |
|
Consistently Reproducing the<br> |
|
Angry Emotion from Prompt<br> |
|
in Every Utterance |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Breaking Down Mandarin.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_Prompt.MP3" controls></audio> |
|
</td> |
|
<td> |
|
Would you believe what happened at the<br> |
|
grocery store today? My goodness! The<br> |
|
avocados were on sale - half price! Half<br> |
|
price! I bought twenty of them! |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_ZeroShot.MP3" controls></audio> |
|
Effectively follows textual cues<br> |
|
for both longer and shorter<br> |
|
inter-sentence pauses |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_OneShot.MP3" controls></audio> |
|
Better reproduces the<br> |
|
exaggerated high pitch<br> |
|
characteristic of anime voices<br> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Quirky%20Female%20English.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_Prompt.MP3" controls></audio> |
|
</td> |
|
<td> |
|
Oh my gosh, like, I literally can't believe<br> |
|
what just happened! Um, so basically, I was,<br> |
|
you know, just sitting there in class,<br> |
|
right? And then, ugh, this totally weird<br> |
|
thing happened - like, seriously weird! Wait,<br> |
|
wait... Should I even be talking about this?<br> |
|
Ugh, whatever. |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_ZeroShot.MP3" |
|
controls></audio> |
|
Effectively follows textual cues<br> |
|
for both longer and shorter<br> |
|
inter-sentence pauses |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_OneShot.MP3" controls></audio> |
|
Better reproduces the<br> |
|
exaggerated high pitch<br> |
|
characteristic of anime voices<br> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Neurotic%20Teenage%20English.mp3" |
|
controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="multilingual-and-cross-lingual-capabilities-demonstrations">Multilingual and Cross-Lingual Capabilities |
|
Demonstrations</h2> |
|
<p>Speech-02-HD maintains high naturalness in less common languages while demonstrating significant advantages |
|
in |
|
Standard |
|
Chinese pronunciation accuracy.</p> |
|
<div class="scroll-wrapper" style="margin-top: 2rem;"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col">Languages</th> |
|
<th scope="col">Source Audio</th> |
|
<th scope="col">Text</th> |
|
<th scope="col">MiniMax<br>Speech_02_HD</th> |
|
<th scope="col">ElevenLabs<br>Multilingual_v2</th> |
|
<th scope="col">OpenAI<br>TTS_1_HD<br>(*not cloned voice)</th> |
|
</tr> |
|
|
|
<tr class="border-bottom-thin"> |
|
<th>Thai</th> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Thai_Male_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
สวัสดีค่ะ วันนี้อากาศดีมากเลย<br> |
|
คุณจะไปทานอาหารกลางวันที่ไหนคะ<br> |
|
ฉันกำลังคิดว่าจะไปร้านอาหารไทยแถวนี้<br> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Thai.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Thai not perfectly supported |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Thai.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/OpenAI_Thai.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
|
|
<tr class="border-bottom-thin"> |
|
<th>Vietnamese</th> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Vietnamese_Female_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
Tôi đang đọc một cuốn sách rất hay về lịch sử Việt Nam.<br> |
|
Những câu chuyện về văn hóa truyền<br> |
|
thống thật sự rất thú vị.<br> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Vietnamese.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Vietnamese not perfectly supported |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Vietnamese.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/OpenAI_Vietnamese.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
|
|
<tr class="border-bottom-thin"> |
|
<th>Czech</th> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Czech_Female_Sourse.wav" controls></audio> |
|
</td> |
|
<td> |
|
Ranní mlha se pomalu zvedá nad řekou,<br> |
|
zatímco první paprsky slunce prosvítají mezi stromy.<br> |
|
Ptáci začínají svůj ranní koncert.<br> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Czech.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Czech.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/OpenAI_Czech.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
|
|
<tr class="border-bottom-thin"> |
|
<th>Polish</th> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Polish_Male_Sourse.wav" controls></audio>、 |
|
</td> |
|
<td> |
|
Młoda sowa siedzi cicho na gałęzi sosny,<br> |
|
obserwując leśną polanę w świetle księżyca.<br> |
|
Wiatr delikatnie porusza liśćmi drzew.<br> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Polish.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Polish.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/OpenAI_Polish.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
|
|
<tr class="border-bottom-thin"> |
|
<th>Japanese</th> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Japanese_DominantMan_Sourse.mp3" controls></audio> |
|
</td> |
|
<td> |
|
電車が遅延している影響で、渋谷駅がとても混雑<br> |
|
しています。次の山手線は約10分後に到着<br> |
|
予定です。お急ぎのお客様は、他の路線も<br> |
|
ご利用ください。 |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Japanese.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ElevenLabs_Japanese_Dominant_Man.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/OpenAI_Japanese.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
<p style="margin-top: 4rem;">Speech-02-HD has superior performance in zero-shot cross-lingual scenarios.</p> |
|
<div class="scroll-wrapper" style="margin-top: 2rem;"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col">Original Language</th> |
|
<th scope="col">Source Audio</th> |
|
<th scope="col">Mixed Language</th> |
|
<th scope="col">Text</th> |
|
<th scope="col">MiniMax<br>Speech_02_HD</th> |
|
<th scope="col">ElevenLabs<br>Multilingual_v2</th> |
|
<th scope="col">OpenAI<br>TTS_1_HD<br>(*not cloned voice)</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td>English</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Wong_Sourse.mp3" controls></audio> |
|
</td> |
|
<td>English + Mandarin</td> |
|
<td> |
|
Kiddo! Come come come, 学如逆水行舟,不进则退。<br> |
|
I see you're using AI tools already - so smart!<br> |
|
But eh, cannot just rely on tools only lah!<br> |
|
The future belongs to those who can work alongside AI,<br> |
|
not those scared of it. |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/English-Mandarin.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/ElevenLabs_English-Mandarin.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/OpenAI_English-Mandarin.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td>Mandarin</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ShiBanYu_Sourse.mp3" controls></audio> |
|
</td> |
|
<td>Mandarin + Cantonese</td> |
|
<td> |
|
老铁啊,多谢晒你送我呢本,广州话正音字典,咁好嘢喎!<br> |
|
我呢个大老爷们儿学广州话真系好难㗎!成日都分唔清声调啊。<br> |
|
嗱,而家有咗呢本书,什么都好啦。 |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Mandarin-Cantonese.wav" controls></audio> |
|
</td> |
|
<td> |
|
Cantonese not supported |
|
</td> |
|
<td> |
|
Cantonese not supported |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td>Mandarin</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/ShuanQ_Sourse.mp3" controls></audio> |
|
</td> |
|
<td>Mandarin + English</td> |
|
<td> |
|
The people said, 桂林's scenery is the first under heaven.<br> |
|
Yet in my opinion, 阳朔 scenery is better than 桂林。<br> |
|
群峰倒影山浮水,无水无山不入神。 |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Mandarin-English.WAV" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/ElevenLabs_Mandarin-English.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/OpenAI_Mandarin-English.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td>English</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/CoCo_Sourse.mp3" controls></audio> |
|
</td> |
|
<td>English + Spanish</td> |
|
<td> |
|
Mi abuelita always told me "el que persevera, alcanza".<br> |
|
If you persevere, you'll achieve your dreams!<br> |
|
Guess what! They choose me to play the lead role in our BIG show! |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/English-Spanish.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/ElevenLabs_English-Spanish.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/OpenAI_English-Spanish.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td>Japanese</td> |
|
<td> |
|
<audio class="audio-sm" src="assets/audios/Powerful_Girl_Sourse.mp3" controls></audio> |
|
</td> |
|
<td>Japanese + Korean</td> |
|
<td> |
|
最近の天気予報によりますと、今週末は桜の開花に最適<br> |
|
な気温になる予定です。<br> |
|
東京都内の各公園では花見客で賑わうことが予想されますが、<br> |
|
서울에서도 벚꽃이 피기 시작했다고 하네요.<br> |
|
이번 주말에는 여의도 공원에서 벚꽃 축제가 열린다고 하니<br> |
|
많은 분들이 찾아오실 것 같습니다. |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Japanese-Korean.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/ElevenLabs_Japanese-Korean.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/OpenAI_Japanese-Korean.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
<p>*Although OpenAI currently does not support voice cloning functionality, we still wish to conduct comparative |
|
listening |
|
tests with its excellent naturalness as a reference.</p> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="flow-vae-vs-vae-comparisons">Flow-VAE vs. VAE Comparison</h2> |
|
<p>Flow-VAE is less likely to produce the following instabilities.</p> |
|
<div class="scroll-wrapper" style="margin-top: 2rem;"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col" style="text-align: center;">Source Audio</th> |
|
<th scope="col" style="text-align: center;">Flow-VAE</th> |
|
<th scope="col" style="text-align: center;">VAE</th> |
|
<th scope="col" style="text-align: center;">Differences</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td style="width: 25%"> |
|
<audio src="assets/audios/Condition1.wav" controls></audio> |
|
</td> |
|
<td style="width: 25%"> |
|
<audio src="assets/audios/FlowVAE1.wav" controls></audio> |
|
</td> |
|
<td style="width: 25%"> |
|
<audio src="assets/audios/VAE1.wav" controls></audio> |
|
</td> |
|
<td> |
|
Flow-VAE reproduces more continuous<br> |
|
and natural reverberation |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio src="assets/audios/Condition2.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio src="assets/audios/FlowVAE2.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio src="assets/audios/VAE2.wav" controls></audio> |
|
</td> |
|
<td> |
|
VAE introduces unwanted<br> |
|
high-frequency components |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<audio src="assets/audios/Conditon3.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio src="assets/audios/FlowVAE3.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio src="assets/audios/VAE3.wav" controls></audio> |
|
</td> |
|
<td> |
|
VAE produces electronic-sounding<br> |
|
artifacts at the beginning |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="professional-voice-clone-pvc-demonstrations">Professional Voice Clone (PVC) Demonstrations</h2> |
|
<p>For more complex dialectal accents and tonal characteristics, PVC can reproduce these features while |
|
maintaining high |
|
naturalness based on the text content.</p> |
|
<div class="scroll-wrapper" style="margin-top: 2rem;"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col" style="text-align: center;">Source Audio</th> |
|
<th scope="col" style="text-align: center;">Zero-Shot</th> |
|
<th scope="col" style="text-align: center;">PVC</th> |
|
<th scope="col" style="text-align: center;">Differences</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td style="width: 25%"> |
|
<audio src="assets/audios/JosephBrodsky_Source.wav" controls></audio> |
|
</td> |
|
<td style="width: 25%"> |
|
<audio src="assets/audios/JosephBrodsky_Fast.mp3" controls></audio> |
|
</td> |
|
<td style="width: 25%"> |
|
<audio src="assets/audios/JosephBrodsky_PVC.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Like the ZeroShot version, the PVC<br> |
|
version has rising sentence-final intonation,<br> |
|
but distinctively sustains this<br> |
|
elevated pitch instead of the typical<br> |
|
pitch declination found in common<br> |
|
declarative sentences |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio src="assets/audios/TianJin_Source.wav" controls></audio> |
|
</td> |
|
<td> |
|
<audio src="assets/audios/TianJin_Fast.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio src="assets/audios/TianJin_PVC.mp3" controls></audio> |
|
</td> |
|
<td> |
|
With more materials, the model not only<br> |
|
reproduces the speaker's voice characteristics<br> |
|
but also accurately captures more<br> |
|
dialectal features |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="emotion-control-demonstrations">Emotion Control Demonstrations</h2> |
|
<h3>Source Audio for Refreshing Young Man</h3> |
|
<audio src="assets/audios/Mandarin_Refreshing_Young_Man_Sourse.mp3" controls></audio> |
|
<h3>DEMO</h3> |
|
<div class="scroll-wrapper"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col">Neutral</th> |
|
<th scope="col" style="min-width: 120px;">Emotion</th> |
|
<th scope="col">Text</th> |
|
<th scope="col">Emotion Control Audio</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Neutral1.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Surprised |
|
</td> |
|
<td> |
|
天哪!我完全没想到会在这里遇见你,<br> |
|
都过去这么多年了,你一点都没变! |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Surprised.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Neutral2.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Disgusted |
|
</td> |
|
<td> |
|
这个地方实在太脏乱了,到处都是垃圾和难闻的气味儿,<br> |
|
我一秒钟都不想多待。 |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Disgusted.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Neutral3.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Fearful |
|
</td> |
|
<td> |
|
深夜回家的路上,我清楚地听见身后有脚步声在跟着我,<br> |
|
可是回头却什么都看不见。 |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Fearful.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Neutral4.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Angry |
|
</td> |
|
<td> |
|
我付出了这么多,换来的却是这样的背叛!<br> |
|
你怎么可以这样对待我的信任! |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Angry.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Neutral5.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Sad |
|
</td> |
|
<td> |
|
躺在床上翻来覆去,心里压着说不出的难过和沮丧,<br> |
|
昨天晚上又失眠了。 |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Sad.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Neutral6.mp3" controls></audio> |
|
</td> |
|
<td> |
|
Happy |
|
</td> |
|
<td> |
|
和好朋友一起在院子里烧烤,聊着有趣的故事,<br> |
|
享受着美食和欢乐的时光。 |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Happy.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="text-prompted-voice-generation-demonstrations">Text-Prompted Voice Generation Demonstrations</h2> |
|
<div class="scroll-wrapper"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col">Prompt</th> |
|
<th scope="col">Text</th> |
|
<th scope="col" style="text-align: center;">Audio</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
男性中年声音,说中文,音色浑厚醇厚,带有自然的磁性,<br> |
|
语速偏慢,音量适中,音调偏低沉。声音整体给人沉稳可靠的感觉,<br> |
|
在深度访谈场景中表现出专业性和亲和力,音质清晰,吐字规整有力。 |
|
</td> |
|
<td> |
|
在这个安静的夜晚,让我们一起走进《人生笔记》这本书。<br> |
|
作者用平实的文字记录下生活中的点点滴滴,<br> |
|
让我们看到平凡中的真善美。<br> |
|
今天,我们先来读第一章:'生活的痕迹'...... |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/深度访谈男中年.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
说中文的女青年,音色偏甜美,语速比较快,<br> |
|
说话时带着一种轻快的感觉,整体音调较高,像是在直播带货,<br> |
|
整体氛围比较活跃,声音清晰,听起来很有亲和力。 |
|
</td> |
|
<td> |
|
亲爱的宝宝们,等了好久的神仙面霜终于到货啦!<br> |
|
你们看这个包装是不是超级精致?<br> |
|
我自己已经用了一个月了,效果真的绝绝子!<br> |
|
而且这次活动价真的太划算了,错过真的会后悔的哦~ |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/直播带货女青年.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
中国男性声音,听着像是青年,音色清亮,语速比较快,<br> |
|
说话很有激情,像是在解说比赛,声音中带着紧张和兴奋的感觉。 |
|
</td> |
|
<td> |
|
漂亮!这个进攻太精彩了!张伟突破防线,<br> |
|
一个漂亮的转身,球传到禁区,王超跟上,射门!<br> |
|
球进了!难以置信的精彩配合,现场观众都沸腾了! |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/体育解说男青年.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
中国女青年的声音,音色清脆,说话速度偏快,语调活泼,<br> |
|
像是在做游戏直播,声音中带着愉快的感觉,整体音调较高,<br> |
|
整体氛围比较轻松。 |
|
</td> |
|
<td> |
|
啊!这里有个宝箱!让我们看看里面是什么~<br> |
|
哇!是传说中的紫色装备!运气也太好了吧!<br> |
|
谢谢小伙伴们的打赏,我们继续往前探索...... |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/游戏主播女青年.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
English-speaking female voice, sounding relatively young,<br> |
|
with a sweet and pleasant tone. Speaking at a moderate pace<br> |
|
with a touch of energy, similar to someone narrating a<br> |
|
beauty/makeup tutorial video. The overall atmosphere is<br> |
|
relaxed and cheerful. |
|
</td> |
|
<td> |
|
Hi everyone! Today I'll be sharing a soft, romantic<br> |
|
makeup look that's perfect for dates. Many of you have <br> |
|
been asking how to apply this eyeshadow naturally - the<br> |
|
key is using gentle techniques. Let's go through the<br> |
|
steps together... |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/美妆女博主.wav" controls></audio> |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
English-speaking middle-aged male voice, slightly husky, <br> |
|
speaking at a moderate-to-slow pace with a deep tone. Like<br> |
|
someone telling an old story, conveying a nostalgic feeling,<br> |
|
with a relaxed and composed manner of speaking. |
|
</td> |
|
<td> |
|
That was back in the late 1970s. I remember when our <br> |
|
village first got electricity - everyone was so excited. <br> |
|
In theevenings, people would bring their stools and <br> |
|
gather under the big banyan tree by the village committee <br> |
|
office to watch movies projected on the wall. Even now, <br> |
|
thinking back to those moments still fills me with warmth. |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/回忆男中年.wav" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
|
|
<div class="article-block"> |
|
<h2 id="comparison-of-voice-naturalness">Comparison of voice naturalness |
|
with the previous generation products</h2> |
|
<p>The new model demonstrates significant advantages in naturalness compared to the previous version.</p> |
|
<h3 style="margin-top: 2rem;">Source Audio for Radiant_Girl</h3> |
|
<audio src="assets/audios/English_Radiant_Girl_Sourse.wav" controls></audio> |
|
<h3>DEMO</h3> |
|
<div class="scroll-wrapper"> |
|
<table style="width: 100%;"> |
|
<tbody> |
|
<tr class="border-bottom-thin"> |
|
<th scope="col">Text</th> |
|
<th scope="col" style="text-align: center;">MiniMax<br>Speech_02_HD</th> |
|
<th scope="col" style="text-align: center;">Microsoft<br>Azure TTS</th> |
|
<th scope="col" style="text-align: center;">AWS<br>Polly</th> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
I sat alone in the empty room, staring at the old photographs,<br> |
|
wondering how everything could change so quickly,<br> |
|
how a lifetime of memories could fade away just like that. |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Radiant_Girl_1.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Emma_1.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Joanna_1.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
<tr class="border-bottom-thin"> |
|
<td> |
|
The moment I held my acceptance letter, my heart burst with joy - <br> |
|
all those sleepless nights finally paid off, and I couldn't stop<br> |
|
dancing around the room, calling everyone I knew to share this amazing news! |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Radiant_Girl_2.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Emma_2.mp3" controls></audio> |
|
</td> |
|
<td> |
|
<audio class="audio-md" src="assets/audios/Joanna_2.mp3" controls></audio> |
|
</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
</div> |
|
</article> |
|
</main> |
|
|
|
<script> |
|
MathJax = { |
|
tex: { |
|
inlineMath: [['$', '$'],], |
|
}, |
|
} |
|
|
|
const darkModeToggle = document.getElementById('dark-mode-toggle') |
|
darkModeToggle.addEventListener('click', () => { |
|
document.body.classList.toggle('latex-dark') |
|
}) |
|
</script> |
|
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> |
|
</body> |
|
|
|
</html> |