vall-e_korean / README.md
LearnItAnyway's picture
Update README.md
6856c3b
metadata
license: other

VALL-E Korean Model

Introduction

The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from Facebook Research's EnCodec repository.

Model Details

  • Architecture: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
  • Hidden Dimensions: The model has a hidden dimension of 1024.
  • Transformer Layers: It comprises 12 transformer layers.
  • Attention Heads: Each layer has 16 attention heads.

Training Data

The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from AI-Hub ํ•œ๊ตญ์ธ ๋Œ€ํ™”์Œ์„ฑ.

Example Usage

For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: tester_colab.ipynb. This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from Plachtaa's VALL-E repository.

References

For more information and details on using the model, please refer to the provided references and resources.