arxiv:2307.03183

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Published on Jul 6, 2023

· Submitted by

akhaliq on Jul 7, 2023

Upvote

Authors:

Yuan Gong ,

Sameer Khurana ,

Abstract

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.03183 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2307.03183 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.