inference without voice cloning

by gqd - opened Nov 17, 2023

Discussion

gqd

Nov 17, 2023

Hey

All the examples show how to produce output with a speaker voice
Wondering if it's possible to do fine-tuning on a speaker voice and then inference without passing a reference sample to reduce latency?

Thx

gqd changed discussion title from usage without voice cloning to inference without voice cloning Nov 17, 2023

gorkemgoknar

Coqui.ai org Nov 18, 2023

Once you calculated latents , you can pass same latents to inference there after, that reduces inference time.
Please check code on https://huggingface.co/spaces/coqui/xtts/blob/main/app.py#L233
gpt_cond_latent,speaker_embedding = model.get_conditioning_latents(audio_path=speaker_wav, gpt_cond_len=30, max_ref_length=60)

gqd

Nov 18, 2023

Hey @gorkemgoknar

Is it possible to fine-tune and inference coqui/XTTS-v2 as a single speaker model entirely, to remove the additional latency of using the latents?

Or wouldn't that make much of a difference, when using the precomputed latents as you suggested?

Thx

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment