|
--- |
|
license: mit |
|
tags: |
|
- audio tagging |
|
- audio events |
|
- audio embeddings |
|
- convnext-audio |
|
- audioset |
|
inference: false |
|
--- |
|
|
|
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html). |
|
|
|
The model expects as input audio files of duration 10 seconds, and sample rate 32kHz. |
|
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html). |
|
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below. |
|
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim. |
|
|
|
|
|
# Install |
|
|
|
This code is based on our repo: https://github.com/topel/audioset-convnext-inf |
|
|
|
|
|
```bash |
|
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install |
|
``` |
|
|
|
# Usage |
|
|
|
Below is an example of how to instantiate our model convnext_tiny_471mAP.pth |
|
|
|
```python |
|
import os |
|
import numpy as np |
|
import torch |
|
import torchaudio |
|
|
|
from audioset_convnext_inf.pytorch.convnext import ConvNeXt |
|
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags |
|
|
|
model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu') |
|
|
|
print( |
|
"# params:", |
|
sum(param.numel() for param in model.parameters() if param.requires_grad), |
|
) |
|
if torch.cuda.is_available(): |
|
device = torch.device("cuda") |
|
else: |
|
device = torch.device("cpu") |
|
|
|
if "cuda" in str(device): |
|
model = model.to(device) |
|
``` |
|
|
|
Output: |
|
``` |
|
# params: 28222767 |
|
``` |
|
|
|
## Inference: get logits and probabilities |
|
|
|
```python |
|
sample_rate = 32000 |
|
audio_target_length = 10 * sample_rate # 10 s |
|
|
|
AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav" |
|
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME) |
|
|
|
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH) |
|
if sample_rate_ != sample_rate: |
|
print("ERROR: sampling rate not 32k Hz", sample_rate_) |
|
|
|
waveform = waveform.to(device) |
|
|
|
print("\nInference on " + AUDIO_FNAME + "\n") |
|
|
|
with torch.no_grad(): |
|
model.eval() |
|
output = model(waveform) |
|
|
|
logits = output["clipwise_logits"] |
|
print("logits size:", logits.size()) |
|
|
|
probs = output["clipwise_output"] |
|
# Equivalent: probs = torch.sigmoid(logits) |
|
print("probs size:", probs.size()) |
|
|
|
current_dir=os.getcwd() |
|
lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv")) |
|
|
|
threshold = 0.25 |
|
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0] |
|
print("\nPredicted labels using activity threshold 0.25:\n") |
|
# print(sample_labels) |
|
for l in sample_labels: |
|
print("%s: %.3f"%(ix_to_lb[l], probs[0,l])) |
|
``` |
|
|
|
Output: |
|
``` |
|
logits size: torch.Size([1, 527]) |
|
probs size: torch.Size([1, 527]) |
|
|
|
Predicted labels using activity threshold 0.25: |
|
|
|
Speech: 0.626 |
|
Music: 0.842 |
|
Musical instrument: 0.362 |
|
Plucked string instrument: 0.307 |
|
Ukulele: 0.703 |
|
Inside, small room: 0.305 |
|
``` |
|
|
|
|
|
|
|
## Get audio scene embeddings |
|
```python |
|
with torch.no_grad(): |
|
model.eval() |
|
output = model.forward_scene_embeddings(waveform) |
|
|
|
print("\nScene embedding, shape:", output.size()) |
|
``` |
|
|
|
Output: |
|
``` |
|
Scene embedding, shape: torch.Size([1, 768]) |
|
``` |
|
|
|
## Get frame-level embeddings |
|
```python |
|
with torch.no_grad(): |
|
model.eval() |
|
output = model.forward_frame_embeddings(waveform) |
|
|
|
print("\nFrame-level embeddings, shape:", output.size()) |
|
``` |
|
|
|
Output: |
|
``` |
|
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7]) |
|
``` |
|
|
|
# Zenodo |
|
|
|
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1 |
|
|
|
Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth |
|
|
|
The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set. |
|
|
|
|
|
# Citation |
|
|
|
[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html) |
|
|
|
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564 |
|
|
|
```bibtex |
|
@inproceedings{pellegrini23_interspeech, |
|
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier}, |
|
title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}}, |
|
year=2023, |
|
booktitle={Proc. INTERSPEECH 2023}, |
|
pages={4169--4173}, |
|
doi={10.21437/Interspeech.2023-1564} |
|
} |
|
``` |
|
|
|
|
|
|