Generate descriptions from images and text prompts
Generate synthesized speech from text and audio reference
Interact with a multimodal chatbot using text and audio