LearnItAnyway
/

vall-e_korean

Model card Files Files and versions Community

vall-e_korean / README.md

LearnItAnyway's picture

Update README.md

ac5df2d about 1 year ago

|

history blame contribute delete

No virus

2.4 kB

	---
	license: other
	---
	# VALL-E Korean Model

	## Introduction

	The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from [Facebook Research's EnCodec repository](https://github.com/facebookresearch/encodec).

	## Model Details

	- Architecture: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
	- Hidden Dimensions: The model has a hidden dimension of 1024.
	- Transformer Layers: It comprises 12 transformer layers.
	- Attention Heads: Each layer has 16 attention heads.

	## Training Data

	The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from [AI-Hub 한국인 대화음성](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=130).

	## Example Usage

	For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: [tester_colab.ipynb](https://huggingface.co/LearnItAnyway/vall-e_korean/blob/main/tester_colab.ipynb). This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from [Plachtaa's VALL-E repository](https://github.com/Plachtaa/VALL-E-X).

	## References

	- [Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111)
	- [VALL-E Repository by lifeiteng](https://github.com/lifeiteng/vall-e)
	- [Enhuiz's VALL-E Repository](https://github.com/enhuiz/vall-e)
	- [VALL-E-X Repository by Plachtaa](https://github.com/Plachtaa/VALL-E-X)
	- [Vocos](https://github.com/charactr-platform/vocos)

	For more information and details on using the model, please refer to the provided references and resources.

	# Updated

	We trained the model on 8k dataset from [AI Hub](https://www.aihub.or.kr), which is uploaded as v1.
	The model has better performance when the clean audio source (e.g., voice-source), however, it may not work well when the audio source is bad.
	Therefore, the both v0 and v1 are maintained.