--- license: apache-2.0 language: - zh library_name: transformers.js --- # VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. ## Model Details Languages: Chinese Dataset: THCHS-30 Speakers: 44 Training Hours: 48 ## Usage Using this checkpoint from Hugging Face Transformers: ```py from transformers import VitsModel, VitsTokenizer from pypinyin import lazy_pinyin, Style import torch model = VitsModel.from_pretrained("BricksDisplay/vits-cmn") tokenizer = VitsTokenizer.from_pretrained("BricksDisplay/vits-cmn") text = "中文" payload = ''.join(lazy_pinyin(text, style=Style.TONE, tone_sandhi=True)) inputs = tokenizer(payload, return_tensors="pt") with torch.no_grad(): output = model(**inputs, speaker_id=0) from IPython.display import Audio Audio(output.audio[0], rate=16000) ``` Using this checkpoint from Transformers.js: ```js import { pipeline } from '@xenova/transformers'; import { pinyin } from 'pinyin-pro'; // Our use-case, using `pinyin-pro` const synthesizer = await pipeline('text-to-audio', 'BricksDisplay/vits-cmn', { quantized: false }) console.log(await synthesizer(pinyin("中文"))) // { // audio: Float32Array(?) [ ... ], // sampling_rate: 16000 // } ``` Note: Transformers.js (ONNX) version does not support speaker_id, so it will fixed in 0