When Vision Speaks for Sound Collection Data and model for the When Vision Speaks for Sound. Includes SFT and DPO training data, evaluation data and trained checkpoints. • 6 items • Updated 1 day ago