Inference speed

#2
by andreasrath - opened

Many thanks for releasing such an amazing model!

Would appreciate if you can share some numbers regarding inference speed and GPU memory requirements for inferencing. For example for A100, H100, T4 etc or consumer graphic cards.

Best, Andreas

Hey @andreasrath ! Awesome to see you're interested in running the model 🤗 We've not focussed on inference speed yet, since there are a bunch of optimisations that we need to make that should give us free speed-ups (flash attention, torch compile). We'll release these inference + training optimisations as part of the v1 release in a few weeks time, where we'll also include some benchmarks on inference speed and memory requirements.

In the meantime, we'd be more than happy to help you run some inference speed benchmarks using the v0.1 Mini model if of interest? We could use a similar methodology to this Bark example by @ylacombe . You should be able to load the Parler-TTS model, and then drop it into the benchmarking code. The only updates you need to make are removing the args fine_temperature = 0.4, coarse_temperature = 0.8 in the call to .generate.

In terms of GPU requirements, Parler-TTS is built on top of Hugging Face Transformers, which places no constraints on GPU. The only requirements are that you are running Python 3.6+, PyTorch 1.1.0+. I ran the example codesnippet on an 80GB A100 GPU using PyTorch 2.2.2, and the memory peaked at 3.6GB for fp32 inference, and 1.8GB for fp16 inference, so we can be pretty confident the model will run with batch-size 1 on any hardware with 8GB VRAM of above.

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1", torch_dtype=torch_dtype).to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

Hope that helps!

Here's my benchmarks! Pretty impressive for a v.1 model. Looking forward to v1.0.

image.png

image.png

@sanchit-gandhi many thanks for the information on the model and the example code!

@ctranslate2-4you very impressive! Many thanks for sharing your benmarking here! Would you be able to attach the generated audios here also? Which texts did you use?

Would be also interesting how they compares to glow/hifigan or VITS, although a bit unfair comparison since these models do not generate a complete new voice but rather were trained on audio of a spcific voice.

Hi all,
The model is super cool. Thank you for your efforts.
Maybe it'll be interesting to anyone. I built a docker with the model and run it on my Mac M3 Max (maxed out config). The standard phrase

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

takes 31 seconds to generate.

Again, maybe it'll be of any help to anyone, the Dockerfile is

FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt ./
RUN apt-get update
RUN apt-get install -y git
RUN apt-get install -y libgomp1
RUN apt-get install -y libsndfile1
RUN rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The service code is

import os
from fastapi import FastAPI, HTTPException, UploadFile, File
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
from fastapi.responses import StreamingResponse
import soundfile as sf
import torch
from pydantic import BaseModel
import uuid

app = FastAPI()

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

class TTSRequest(BaseModel):
    prompt: str
    description: str



@app
	.post("/generate-audio/")
async def generate_audio(request: TTSRequest):
    try:
        input_ids = tokenizer(request.description, return_tensors="pt").input_ids.to(device)
        prompt_input_ids = tokenizer(request.prompt, return_tensors="pt").input_ids.to(device)
        generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
        audio_arr = generation.cpu().numpy().squeeze()
        filename = f"parler_tts_out_{uuid.uuid4()}.wav"
        sf.write(filename, audio_arr, model.config.sampling_rate)

        # stream audio file in response
        return StreamingResponse(open(filename, "rb"), media_type="audio/wav")

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Error generating audio: {str(e)}")

it takes a few minutes to download the model after first run.

@GrigoriiA
Shouldn’t this be ‘mps’ instead of ‘cuda’ for Apple silicon?

… run it on my Mac M3 Max (maxed out config).
device = "cuda:0" if torch.cuda.is_available() else "cpu"

@GrigoriiA
Shouldn’t this be ‘mps’ instead of ‘cuda’ for Apple silicon?

… run it on my Mac M3 Max (maxed out config).
device = "cuda:0" if torch.cuda.is_available() else "cpu"

@cansa
I run this in Docker, AFAIK Docker doesn't allow metal passthrough. If you know different, please, share a link on how to do this. Thanks!

Sign up or log in to comment