File size: 5,603 Bytes

---
license: mit
tags:
- audio tagging
- audio events
- audio embeddings
- convnext-audio
- audioset
inference: false
---

**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).

The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.

The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).

Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.


# Install

This code is based on our repo: https://github.com/topel/audioset-convnext-inf

You can pip install it:

```bash
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
```

# Usage

Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels). 

```python
import os
import numpy as np
import torch
from torch.nn import functional as TF
import torchaudio
import torchaudio.functional as TAF

from audioset_convnext_inf.pytorch.convnext import ConvNeXt
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags

model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')

print(
    "# params:",
    sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

if "cuda" in str(device):
    model = model.to(device)
```

Output:
```
# params: 28222767
```

## Inference: get logits and probabilities

To run the following, first download ```254906__tpellegrini__cavaco1.wav``` and ```class_labels_indices.csv``` from this repository.

```python
sample_rate = 32000
audio_target_length = 10 * sample_rate  # 10 s

# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"

current_dir=os.getcwd()
AUDIO_FPATH = os.path.join(current_dir, AUDIO_FNAME)

waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
    print("Resampling from %d to 32000 Hz"%sample_rate_)
    waveform = TAF.resample(
        waveform,
        sample_rate_,
        sample_rate,
        )

if waveform.shape[-1] < audio_target_length:
    print("Padding waveform")
    missing = max(audio_target_length - waveform.shape[-1], 0)
    waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
elif waveform.shape[-1] > audio_target_length: 
    print("Cropping waveform")
    waveform = waveform[:, :audio_target_length]

waveform = waveform.contiguous()
waveform = waveform.to(device)

print("\nInference on " + AUDIO_FNAME + "\n")

with torch.no_grad():
    model.eval()
    output = model(waveform)

logits = output["clipwise_logits"]
print("logits size:", logits.size())

probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())

lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))

threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("\nPredicted labels using activity threshold 0.25:\n")
# print(sample_labels)
for l in sample_labels:
    print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))
```

Output:
```
Inference on 254906__tpellegrini__cavaco1.wav

Resampling rate from 44100 to 32000 Hz
Padding waveform
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:

[137 138 139 140 149 151]
Music: 0.896
Musical instrument: 0.686
Plucked string instrument: 0.608
Guitar: 0.369
Mandolin: 0.710
Ukulele: 0.268
```

Technically speaking, it's not a Mandolin nor a Ukulele, but a Brazilian cousin, the cavaquinho!


## Get audio scene embeddings
```python
with torch.no_grad():
    model.eval()
    output = model.forward_scene_embeddings(waveform)

print("\nScene embedding, shape:", output.size())
```

Output:
```
Scene embedding, shape: torch.Size([1, 768])
```

## Get frame-level embeddings
```python
with torch.no_grad():
    model.eval()
    output = model.forward_frame_embeddings(waveform)

print("\nFrame-level embeddings, shape:", output.size())
```

Output:
```
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
```

# Zenodo

The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1


# Citation

[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)

Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564

```bibtex
@inproceedings{pellegrini23_interspeech,
  author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
  title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4169--4173},
  doi={10.21437/Interspeech.2023-1564}
}
```