Increase the output time?

#1
by phamhuyhung - opened

Hello, is there a way to increase the output time? Currently, I can only generate about 30 seconds.

Hi, there's a limit on max input length here: https://huggingface.co/spaces/ntt123/Vietnam-male-voice-TTS/blob/64347565e275d8ac02b3055ea4c03f2ea368585d/app.py#L139
Remove it if you want to generate longer clips.

It appears that the resources available on Hugging Face are restricted, which in turn prevents the generation of longer audio segments.
Would it be possible for you to provide more comprehensive guidelines regarding the installation process on a computer and the step-by-step procedures for data preparation, model training, and other related tasks? This way, individuals without programming backgrounds like myself could potentially carry out these tasks on our own computers.
Thank you, in the end, for initiating and sharing this remarkable project with the community.

Hi, please refer to the project at NTT123/light-speed for detailed information regarding data preparation, model training, etc.

I genuinely apologize, but as someone new to programming like myself, I couldn't comprehend any of the instructions on GitHub, not even starting from the computer setup step. I hope that when you have the time, you could rewrite the instructions step by step so that people like me can continue to contribute to the project.

Hi, there's a limit on max input length here: https://huggingface.co/spaces/ntt123/Vietnam-male-voice-TTS/blob/64347565e275d8ac02b3055ea4c03f2ea368585d/app.py#L139
Remove it if you want to generate longer clips.

I deleted this code in app.py file and rebuilt docker image.

if len(text) > 500:
        text = text[:500]

The app runs for about 30 seconds and then crashes when I input a 5000-word text. If I'm running Docker locally (http://localhost:7860), the container shuts down. I haven't found any error logs in the container. If entering a text with a length of 800 words, no errors occur. Please help, thank you so much!

image.png

Hi, the program likely crashes with long clips due to out-of- memory error.
To avoid this, create shorter clips at the sentence or paragraph level and then combine them to make a long clip. This also aligns with training data that uses 5-10 second clips.
Here is a template: Long clip = short clip 1 + 400ms silence + short clip 2 + 400ms silence + short clip 3 + ...
You can use Python libraries for audio manipulation, such as pydub. Here's the link: pydub API.

Hi @baobao01 @phamhuyhung , I've added the feature to generate long clips in the demo. Thank you for your comments.

Thank you very much, I provided an input of 10000 words and the application worked perfectly.
image.png

@baobao01 Hello, since the app developer is quite busy, could you please provide me with a way to contact you so I can ask about how to install the app on localhost like you did?

@ntt123 Hi, I discovered an error after you added the feature to create long clips in the demo. When creating a paragraph of about 60 seconds, a part will be lost and then it will continue to read.

@phamhuyhung You should install it on your localhost for the most accurate results, as the app running on this space is only for demonstration purposes and may be affected by hardware limitations of the free account or network connectivity issues. Please contact me via thanh.bao@outlook.com for assistance.

image.png

@baobao01 Hi, I have sent you an email, I hope you can spare some time to help me. Thank you very much

phamhuyhung changed discussion status to closed

Sign up or log in to comment