Text-to-Speech
Transformers
Safetensors
English
parler_tts
text2text-generation
annotation

Inference speed

#2
by andreasrath - opened

Many thanks for releasing such an amazing model!

Would appreciate if you can share some numbers regarding inference speed and GPU memory requirements for inferencing. For example for A100, H100, T4 etc or consumer graphic cards.

Best, Andreas

Hey @andreasrath ! Awesome to see you're interested in running the model πŸ€— We've not focussed on inference speed yet, since there are a bunch of optimisations that we need to make that should give us free speed-ups (flash attention, torch compile). We'll release these inference + training optimisations as part of the v1 release in a few weeks time, where we'll also include some benchmarks on inference speed and memory requirements.

In the meantime, we'd be more than happy to help you run some inference speed benchmarks using the v0.1 Mini model if of interest? We could use a similar methodology to this Bark example by @ylacombe . You should be able to load the Parler-TTS model, and then drop it into the benchmarking code. The only updates you need to make are removing the args fine_temperature = 0.4, coarse_temperature = 0.8 in the call to .generate.

In terms of GPU requirements, Parler-TTS is built on top of Hugging Face Transformers, which places no constraints on GPU. The only requirements are that you are running Python 3.6+, PyTorch 1.1.0+. I ran the example codesnippet on an 80GB A100 GPU using PyTorch 2.2.2, and the memory peaked at 3.6GB for fp32 inference, and 1.8GB for fp16 inference, so we can be pretty confident the model will run with batch-size 1 on any hardware with 8GB VRAM of above.

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1", torch_dtype=torch_dtype).to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

Hope that helps!

Here's my benchmarks! Pretty impressive for a v.1 model. Looking forward to v1.0.

image.png

image.png

@sanchit-gandhi many thanks for the information on the model and the example code!

@ctranslate2-4you very impressive! Many thanks for sharing your benmarking here! Would you be able to attach the generated audios here also? Which texts did you use?

Would be also interesting how they compares to glow/hifigan or VITS, although a bit unfair comparison since these models do not generate a complete new voice but rather were trained on audio of a spcific voice.

Sign up or log in to comment