trying to hack together a voice cloning demo....

#2
by sherlock1199 - opened

I've been trying to create my own custom embeddings using speechbrain/spkrec-xvect-voxceleb

signal, fs =torchaudio.load('morgan.wav')
embeddings = classifier.encode_batch(signal)

and generating audio using:

speech = model.generate_speech(inputs["input_ids"], embeddings[0], vocoder=vocoder)

but having the output garbled. is there an intermediary step i'm missing ?

so managed to get a non-garbled output. after resampling my wav file and converting it to mono. now to figure out how to improve the quality of voice reproduction.

sherlock1199 changed discussion status to closed
sherlock1199 changed discussion status to open

Hi, Thanks for your attention.

According to model.generate_speech, the src_tokens is required.
Thus, we recommend to implement as follows.

speech = model.generate_speech(src_tokens=inputs["input_ids"], spkembs=embeddings[0], ...)

Free for additional questions.

I've been trying to create my own custom embeddings using speechbrain/spkrec-xvect-voxceleb

signal, fs =torchaudio.load('morgan.wav')
embeddings = classifier.encode_batch(signal)

and generating audio using:

speech = model.generate_speech(inputs["input_ids"], embeddings[0], vocoder=vocoder)

but having the output garbled. is there an intermediary step i'm missing ?

Does the inputs["input_ids"] denote words? It seems waveform.

mechanicalsea changed discussion status to closed

Sign up or log in to comment