VLMEvalKit Evaluation Results Collection
Generate high-fidelity audio from input audio waveforms
Video captioning/tracking