It sounds cool, very "AI going insane" sort of thing, but I must be doing something wrong. In the audio below, the TTS module is explaining the difference between sheep and goats. Around 10 seconds in, the glitching starts. I'm wondering if this has to do with how I've set the model up? Here is the code I've used:
#setup processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts",max_new_tokens=256) ttsmodel = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") # load xvector containing speaker's voice characteristics from a dataset embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embeddings = torch.tensor(embeddings_dataset["xvector"]).unsqueeze(0) ... inputs = processor(text=txt, return_tensors="pt",max_new_tokens=256) speech = ttsmodel.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder) ... #Process what is in the variable "txt", which has at most 30 characters:
See the newest/top post on here, I posted some code. I find the model trips over itself after too much input and/or runs into tensor errors. I am processing a prompt.txt file in chunks and outputting individual files. They are, in the end, actually easier to deal with because they are split up and anything that needs re-doing is easier. Stitch them together with another application, I'm going to use Audacity but you could do this in a python script (and convert the wav file to mp3 or another format...).