Whisper Multitask Analyzer

A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.

Model, codebase & card adapted from: MU-NLPC/whisper-small-audio-captioning
Model type: Whisper encoder-decoder transformer
Language(s) (NLP): en
License: cc-by-4.0
Parent Model: openai/whisper-small

Usage

The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder. The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.

The tag mapping of the current model is:

Task	ID	Description
tags	0	General descriptions, can include genres and features.
genre	1	Estimated musical genres.
mood	2	Estimated emotional feeling.
movement	3	Estimated audio pace and expression.
theme	4	Estimated audio usage (not very accurate)

Minimal example:

# Load model
checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)

# Load and preprocess audio
input_file = "..."
audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features

# Mappings by ID
print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}

# Inverted
print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}

# Prepare caption style
style_prefix = f"{model.named_task_mapping['tags']}: "
style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels

# Generate caption
model.eval()
outputs = model.generate(
    inputs=features.to(model.device),
    forced_ac_decoder_ids=style_prefix_tokens,
    max_length=100,
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

Example output: 0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental

WhisperTokenizer must be initialized with language="en" and task="transcribe".

The model class WhisperForAudioCaptioning can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper generate method to support forcing decoder prefix.

Licence

The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.