• Uses XTTS streaming backend now, it can stream just fine but for better user experiences (due to backend lags) we need to combine chunks and serve.
  • There is a AUDIO_WAIT_MODIFIER environment varialbe (0.9 default for a T4 GPU case) .. it can be set to 1.0 for a faster GPU.. Until Gradio supports real bytestreaming (with the calculated length) , this is the only solution as if we do not wait for audio play it will switch to next one too fast.
  • Final merged audio is available (so mobile users can listen, as on mobile autoplay will not work due to security )
  • system message changable now , mistral can act what we like to .
  • There is example for direct voice streaming (DIRECT_STREAM=1) , but it will produce lags due to mistral also streaming on backend.
gorkemgoknar changed pull request status to open

For A10 or faster gpu AUDIO_WAIT_MODIFIER can be set 1 , but 0.9 for T4 (or Turing based models)
DIRECT_STREAM=1 parameter will use continuous streaming like on xtts-streaming but audio is choppy as mistral is streaming too and it goes to mistral loop to complete sentence once first xtts yield is done.
Opted to go with DIRECT_STREAM=0 , as results are kind of good. AUDIO_WAIT_MODIFIER=1 will work just fine but there will be too much delays between sentences, so 0.9 for T4 seems like a sweet spot

Coqui.ai org

Thanks for this!

ylacombe changed pull request status to merged

Sign up or log in to comment