Engage in multi-modal conversations with images and videos
Interact with images and texts using Qwen-VL-Max
VLMEvalKit Evaluation Results Collection
Generate audio from text with multiple language support
Create virtual try-ons for clothing on images of people
Generate music from text and melody descriptions