Model Overview
Description:
Audio Flamingo is a novel audio-understanding language model for
- understanding audio,
- quickly adapting to unseen tasks via in-context learning and retrieval, and
- understanding and responding to multi-turn dialogues
We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.
This model is ready for non-commercial research-only.
References(s):
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
- Project Page
- Demo Website
Model Architecture:
Architecture Type: Transformer
Network Architecture: Audio Flamingo
Audio Flamingo is a Flamingo-style architecture with frozen audio feature extractor, trainable transformation layers and xattn-dense layers, and language model layers.
Input:
Input Types: Audio, Text
Input Format: Wav/MP3/Flac, String
Input Parameters: None
Maximum Audio Input Lengths: 33.25 seconds
Maximum Text Input Lengths: 512 tokens
Output:
Output Type: Text
Output Format: String
Output Parameters: None
Software Integration:
Runtime Engine(s): PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Hopper
Preferred/Supported Operating System(s):
- Linux
Model Version(s):
- v1.0
Training, Testing, and Evaluation Datasets:
Training Dataset:
Audio Flamingo is trained with publicly available datasets under various licenses, with the most restricted ones being non-commercial/research-only. The dataset contains diverse audio types including speech, environmental sounds, and music.
- OpenAQA : Data collection method - [Human]; Labeling method - [Synthetic]
- Laion630K
- LP-MusicCaps
- SoundDescs
- WavCaps
- AudioSet
- AudioSet Strong Labeled
- WavText5K
- MSP-Podcast
- ClothoAQA
- Clotho-v2
- MACS
- FSD50k
- CochlScene
- NonSpeech 7k
- Chime-home
- Sonyc-UST
- Emov-DB
- JL-Corpus
- Tess
- OMGEmotion
- MELD
- MusicAVQA
- MusicQA
- MusicCaps
- NSynth
- MTG-Jamendo
- MusDB-HQ
- FMA
For all of these datasets, the data collection method is [human]. For OpenAQA, Laion630k, LP-MusicCaps, WavCaps, MusicQA, the data labeling method is [synthetic]. For the rest, the data labeling method is [human].
Evaluating Dataset:
Audio Flamingo is evaluated on the test split of the following datasets.
- ClothoAQA
- MusicAVQA
- Clotho-v2
- FSD50k
- CochlScene
- NonSpeech 7k
- NSynth
- AudioCaps
- CREMA-D
- Ravdess
- US8K
- GTZAN
- Medley-solos-DB
For all of these datasets, the data collection method is [human] and the data labeling method is [human].
Inference
Engine: HuggingFace Transformers
Test Hardware [Name the specific test hardware model]: A100 80GB