Generate synthesized speech from text and audio reference
Interact with a multimodal chatbot using text and audio