Model breaks for input texts longer than ~250-500 characters (silence + gibberish output)
Dear @Thomcles ,
Thank you very much for this model. I really appreciate your work for the Persian language.
I have encountered a few issues and wanted to let you know:
When the text is longer than about 250 characters, the voice becomes empty from the middle and nothing is spoken.
When it exceeds 500 characters, the output is completely broken and sounds like another language.
When the model reaches a period ("."), it stops reading the rest of the text.
I noticed that when I use a comma ("،") instead of a period ("."), the model can handle longer texts and gives a proper output.
Thank you again for your great effort.
Best regards,
Mahdi
Hi Mahdi,
Thanks for the feedback!
Obviously, the trained model (an LLM) starts hallucinating when the context size gets too large, mainly because I forgot to increase the maximum sequence length to make training more manageable.
I think the best solution if you want to avoid this is to chunk the text you want to convert to speech, and then concatenate the different audio clips.
If you’re up for it, you could always fine-tune the model on long context windows, much like the teams behind Qwen TTS, Moss TTS, or Vibevoice did. That should solve the problem and let you generate very long audio clips without any issues.
However, your remark about commas is interesting: how is it that this reduces errors?! I think it probably has to do with the fact that the training data mostly consists of complete sentences, so dots are associated with the end of a sentence.