Whisper models finetuned on audio captioning instead of speech recognition. These model aim to briefly describe what happens in the audio scene.