File size: 4,237 Bytes
8006e9b e6a5d8c 8006e9b a0b8f57 d16a293 e6a5d8c d16a293 e6a5d8c d16a293 a0b8f57 d16a293 a0b8f57 d16a293 9040c27 d16a293 9040c27 e6a5d8c 9040c27 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
license: mit
tags:
- audio tagging
- audio events
- audio embeddings
- convnext-audio
- audioset
---
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
# Install
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
```bash
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
```
# Usage
Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
```python
import os
import numpy as np
import torch
import torchaudio
from audioset_convnext_inf.pytorch.convnext import ConvNeXt
model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", use_auth_token=None, map_location='cpu', use_auth_token="ACCESS_TOKEN_GOES_HERE")
print(
"# params:",
sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
if "cuda" in str(device):
model = model.to(device)
```
Output:
```
# params: 28222767
```
## Inference: get logits and probabilities
```python
sample_rate = 32000
audio_target_length = 10 * sample_rate # 10 s
AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
print("ERROR: sampling rate not 32k Hz", sample_rate_)
waveform = waveform.to(device)
print("\nInference on " + AUDIO_FNAME + "\n")
with torch.no_grad():
model.eval()
output = model(waveform)
logits = output["clipwise_logits"]
print("logits size:", logits.size())
probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())
threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("Predicted labels using activity threshold 0.25:\n")
print(sample_labels)
```
Output:
```
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:
[ 0 137 138 139 151 506]
```
## Get audio scene embeddings
```python
with torch.no_grad():
model.eval()
output = model.forward_scene_embeddings(waveform)
print("\nScene embedding, shape:", output.size())
```
Output:
```
Scene embedding, shape: torch.Size([1, 768])
```
## Get frame-level embeddings
```python
with torch.no_grad():
model.eval()
output = model.forward_frame_embeddings(waveform)
print("\nFrame-level embeddings, shape:", output.size())
```
Output:
```
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
```
# Zenodo
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth
The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set.
# Citation
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
```bibtex
@inproceedings{pellegrini23_interspeech,
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={4169--4173},
doi={10.21437/Interspeech.2023-1564}
}
```
|