hojin/whispertalk · Hugging Face

Abstract

Language models (LMs) have made significant advancements, but their ability to incorporate voice input from human speech remains limited, often requiring audio-to-text transcription and resulting in the loss of vital vocal characteristics. To address this challenge, we train a transformer-based language model capable of accepting audio inputs alongside preceding conversations or prompts, enabling predictions for subsequent utterances. In addition to utilizing publicly available language model data, we collect a dataset of 3K hours of audio from the web, creating audio-text pairs representing the ensuing conversation. Additionally, we augment the training data by converting publicly available vocal characteristic labels (e.g., sentiment, gender) associated with the audio into language-based descriptions, enhancing the model's understanding of vocal nuances. Our findings demonstrate the model's capacity to perceive and comprehend audio content, generating meaningful responses grounded in auditory information. This work illuminates the potential of language models to facilitate audio-based interactions, bridging the gap between textual and vocal communication.

colab https://colab.research.google.com/drive/12F8EVlMZldaaEdubbKCPcxd12ORnpDOe?usp=sharing
github & technical paper https://github.com/hojinYang/whispertalk