AlexK-PL commited on
Commit
6b18a93
1 Parent(s): 46d8db5

Update General Model Description

Browse files

Updating Matcha description. Also adding Vocos model description

Files changed (1) hide show
  1. about.md +11 -6
about.md CHANGED
@@ -18,13 +18,18 @@ Here you'll be able to find all the information regarding our model, which has b
18
 
19
  ## General Model Description
20
 
21
- **Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
22
- On the one hand, the encoder part is based on a text encoder and a phoneme duration prediction. Together, they predict averaged acoustic features.
23
- On the other hand, the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
24
- In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
 
 
 
 
 
 
 
25
 
26
- **Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
27
- This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
28
 
29
  ## Adaptation to Catalan
30
 
 
18
 
19
  ## General Model Description
20
 
21
+ **Matcha-TTS** is a non-autorregressive encoder-decoder model designed for fast acoustic modelling in TTS.
22
+ The encoder part processes input sequences of phonemes and, together with a phoneme duration predictor, outputs averaged acoustic features. And the decoder,
23
+ which is essentially a U-Net backbone based on the Transfomer architecture, predicts the refined spectrogram.
24
+ The model is trained with optimal-transport conditional flow matching.
25
+ This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps.
26
+
27
+ **Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
28
+ Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
29
+ Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
30
+ The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the acoustic output of several TTS models.
31
+ This version is tailored for the Catalan language, as it was trained only on Catalan speech datasets.
32
 
 
 
33
 
34
  ## Adaptation to Catalan
35