Model Overview

Description:

Audio Flamingo is a novel audio-understanding language model for

understanding audio,
quickly adapting to unseen tasks via in-context learning and retrieval, and
understanding and responding to multi-turn dialogues

We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

This model is ready for non-commercial research-only.

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: Audio Flamingo

Audio Flamingo is a Flamingo-style architecture with frozen audio feature extractor, trainable transformation layers and xattn-dense layers, and language model layers.

Input:

Input Types: Audio, Text
Input Format: Wav/MP3/Flac, String
Input Parameters: None
Maximum Audio Input Lengths: 33.25 seconds
Maximum Text Input Lengths: 512 tokens

Output:

Output Type: Text
Output Format: String
Output Parameters: None

Software Integration:

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper

Preferred/Supported Operating System(s):

Linux

Model Version(s):

v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset:

Audio Flamingo is trained with publicly available datasets under various licenses, with the most restricted ones being non-commercial/research-only. The dataset contains diverse audio types including speech, environmental sounds, and music.

OpenAQA : Data collection method - [Human]; Labeling method - [Synthetic]
Laion630K
LP-MusicCaps
SoundDescs
WavCaps
AudioSet
AudioSet Strong Labeled
WavText5K
MSP-Podcast
ClothoAQA
Clotho-v2
MACS
FSD50k
CochlScene
NonSpeech 7k
Chime-home
Sonyc-UST
Emov-DB
JL-Corpus
Tess
OMGEmotion
MELD
MusicAVQA
MusicQA
MusicCaps
NSynth
MTG-Jamendo
MusDB-HQ
FMA

For all of these datasets, the data collection method is [human]. For OpenAQA, Laion630k, LP-MusicCaps, WavCaps, MusicQA, the data labeling method is [synthetic]. For the rest, the data labeling method is [human].

Evaluating Dataset:

Audio Flamingo is evaluated on the test split of the following datasets.

For all of these datasets, the data collection method is [human] and the data labeling method is [human].

Inference

Engine: HuggingFace Transformers
Test Hardware [Name the specific test hardware model]: A100 80GB