ConvNeXt-Tiny-AT / README.md
topel's picture
Update README.md
f4cc1d3
---
license: mit
tags:
- audio tagging
- audio events
- audio embeddings
- convnext-audio
- audioset
inference: false
---
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.
The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
# Install
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
You can pip install it:
```bash
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
```
# Usage
Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).
```python
import os
import numpy as np
import torch
from torch.nn import functional as TF
import torchaudio
import torchaudio.functional as TAF
from audioset_convnext_inf.pytorch.convnext import ConvNeXt
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')
print(
"# params:",
sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
if "cuda" in str(device):
model = model.to(device)
```
Output:
```
# params: 28222767
```
## Inference: get logits and probabilities
To run the following, first download ```254906__tpellegrini__cavaco1.wav``` and ```class_labels_indices.csv``` from this repository.
```python
sample_rate = 32000
audio_target_length = 10 * sample_rate # 10 s
# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
current_dir=os.getcwd()
AUDIO_FPATH = os.path.join(current_dir, AUDIO_FNAME)
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
print("Resampling from %d to 32000 Hz"%sample_rate_)
waveform = TAF.resample(
waveform,
sample_rate_,
sample_rate,
)
if waveform.shape[-1] < audio_target_length:
print("Padding waveform")
missing = max(audio_target_length - waveform.shape[-1], 0)
waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
elif waveform.shape[-1] > audio_target_length:
print("Cropping waveform")
waveform = waveform[:, :audio_target_length]
waveform = waveform.contiguous()
waveform = waveform.to(device)
print("\nInference on " + AUDIO_FNAME + "\n")
with torch.no_grad():
model.eval()
output = model(waveform)
logits = output["clipwise_logits"]
print("logits size:", logits.size())
probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())
lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))
threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("\nPredicted labels using activity threshold 0.25:\n")
# print(sample_labels)
for l in sample_labels:
print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))
```
Output:
```
Inference on 254906__tpellegrini__cavaco1.wav
Resampling rate from 44100 to 32000 Hz
Padding waveform
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:
[137 138 139 140 149 151]
Music: 0.896
Musical instrument: 0.686
Plucked string instrument: 0.608
Guitar: 0.369
Mandolin: 0.710
Ukulele: 0.268
```
Technically speaking, it's not a Mandolin nor a Ukulele, but a Brazilian cousin, the cavaquinho!
## Get audio scene embeddings
```python
with torch.no_grad():
model.eval()
output = model.forward_scene_embeddings(waveform)
print("\nScene embedding, shape:", output.size())
```
Output:
```
Scene embedding, shape: torch.Size([1, 768])
```
## Get frame-level embeddings
```python
with torch.no_grad():
model.eval()
output = model.forward_frame_embeddings(waveform)
print("\nFrame-level embeddings, shape:", output.size())
```
Output:
```
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
```
# Zenodo
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
# Citation
[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
```bibtex
@inproceedings{pellegrini23_interspeech,
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={4169--4173},
doi={10.21437/Interspeech.2023-1564}
}
```