|
--- |
|
license: mit |
|
tags: |
|
- audio tagging |
|
- audio events |
|
- audio embeddings |
|
- convnext-audio |
|
- audioset |
|
inference: false |
|
--- |
|
|
|
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html). |
|
|
|
The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet. |
|
|
|
The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html). |
|
|
|
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below. |
|
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim. |
|
|
|
|
|
# Install |
|
|
|
This code is based on our repo: https://github.com/topel/audioset-convnext-inf |
|
|
|
You can pip install it: |
|
|
|
```bash |
|
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install |
|
``` |
|
|
|
# Usage |
|
|
|
Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels). |
|
|
|
```python |
|
import os |
|
import numpy as np |
|
import torch |
|
from torch.nn import functional as TF |
|
import torchaudio |
|
import torchaudio.functional as TAF |
|
|
|
from audioset_convnext_inf.pytorch.convnext import ConvNeXt |
|
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags |
|
|
|
model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu') |
|
|
|
print( |
|
"# params:", |
|
sum(param.numel() for param in model.parameters() if param.requires_grad), |
|
) |
|
if torch.cuda.is_available(): |
|
device = torch.device("cuda") |
|
else: |
|
device = torch.device("cpu") |
|
|
|
if "cuda" in str(device): |
|
model = model.to(device) |
|
``` |
|
|
|
Output: |
|
``` |
|
# params: 28222767 |
|
``` |
|
|
|
## Inference: get logits and probabilities |
|
|
|
```python |
|
sample_rate = 32000 |
|
audio_target_length = 10 * sample_rate # 10 s |
|
|
|
# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav" |
|
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav" |
|
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME) |
|
|
|
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH) |
|
if sample_rate_ != sample_rate: |
|
print("Resampling from %d to 32000 Hz"%sample_rate_) |
|
waveform = TAF.resample( |
|
waveform, |
|
sample_rate_, |
|
sample_rate, |
|
) |
|
|
|
if waveform.shape[-1] < audio_target_length: |
|
print("Padding waveform") |
|
missing = max(audio_target_length - waveform.shape[-1], 0) |
|
waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0) |
|
elif waveform.shape[-1] > audio_target_length: |
|
print("Cropping waveform") |
|
waveform = waveform[:, :audio_target_length] |
|
|
|
waveform = waveform.contiguous() |
|
waveform = waveform.to(device) |
|
|
|
print("\nInference on " + AUDIO_FNAME + "\n") |
|
|
|
with torch.no_grad(): |
|
model.eval() |
|
output = model(waveform) |
|
|
|
logits = output["clipwise_logits"] |
|
print("logits size:", logits.size()) |
|
|
|
probs = output["clipwise_output"] |
|
# Equivalent: probs = torch.sigmoid(logits) |
|
print("probs size:", probs.size()) |
|
|
|
current_dir=os.getcwd() |
|
lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv")) |
|
|
|
threshold = 0.25 |
|
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0] |
|
print("\nPredicted labels using activity threshold 0.25:\n") |
|
# print(sample_labels) |
|
for l in sample_labels: |
|
print("%s: %.3f"%(ix_to_lb[l], probs[0,l])) |
|
``` |
|
|
|
Output: |
|
``` |
|
Inference on 254906__tpellegrini__cavaco1.wav |
|
|
|
Resampling rate from 44100 to 32000 Hz |
|
Padding waveform |
|
logits size: torch.Size([1, 527]) |
|
probs size: torch.Size([1, 527]) |
|
Predicted labels using activity threshold 0.25: |
|
|
|
[137 138 139 140 149 151] |
|
Music: 0.896 |
|
Musical instrument: 0.686 |
|
Plucked string instrument: 0.608 |
|
Guitar: 0.369 |
|
Mandolin: 0.710 |
|
Ukulele: 0.268 |
|
``` |
|
|
|
Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho! |
|
|
|
|
|
## Get audio scene embeddings |
|
```python |
|
with torch.no_grad(): |
|
model.eval() |
|
output = model.forward_scene_embeddings(waveform) |
|
|
|
print("\nScene embedding, shape:", output.size()) |
|
``` |
|
|
|
Output: |
|
``` |
|
Scene embedding, shape: torch.Size([1, 768]) |
|
``` |
|
|
|
## Get frame-level embeddings |
|
```python |
|
with torch.no_grad(): |
|
model.eval() |
|
output = model.forward_frame_embeddings(waveform) |
|
|
|
print("\nFrame-level embeddings, shape:", output.size()) |
|
``` |
|
|
|
Output: |
|
``` |
|
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7]) |
|
``` |
|
|
|
# Zenodo |
|
|
|
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1 |
|
|
|
|
|
# Citation |
|
|
|
[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html) |
|
|
|
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564 |
|
|
|
```bibtex |
|
@inproceedings{pellegrini23_interspeech, |
|
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier}, |
|
title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}}, |
|
year=2023, |
|
booktitle={Proc. INTERSPEECH 2023}, |
|
pages={4169--4173}, |
|
doi={10.21437/Interspeech.2023-1564} |
|
} |
|
``` |
|
|
|
|
|
|