File size: 4,306 Bytes
8006e9b
 
e6a5d8c
 
 
 
 
 
 
 
 
 
 
8006e9b
d16a293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6a5d8c
 
d16a293
 
 
 
 
 
 
 
 
e6a5d8c
 
d16a293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9040c27
 
d16a293
9040c27
 
 
 
 
 
 
 
e6a5d8c
 
9040c27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: mit
tags:
- audio tagging
- audio events
- audio embeddings
- convnext-audio
- audioset
inference: false
extra_gated_prompt: "The collected information will help acquire a better knowledge of who is using our audio event tools. If relevant, please cite our Interspeech 2023 paper."
extra_gated_fields:
  Company/university: text
  Website: text
---
ConvNeXt-Tiny-AT is an audio tagging CNN model, trained on AudioSet (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.

The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.

# Install

This code is based on our repo: https://github.com/topel/audioset-convnext-inf

Note that the checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1


```bash
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
```

# Usage

Below is an example of how to instantiate our model convnext_tiny_471mAP.pth

```python
# 1. visit hf.co/topel/ConvNeXt-Tiny-AT and accept user conditions
# 2. visit hf.co/settings/tokens to create an access token
# 3. instantiate pretrained model

import os
import numpy as np
import torch
import torchaudio

from audioset_convnext_inf.pytorch.convnext import ConvNeXt

model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", use_auth_token=None, map_location='cpu', use_auth_token="ACCESS_TOKEN_GOES_HERE")

print(
    "# params:",
    sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

if "cuda" in str(device):
    model = model.to(device)
```

Output:
```
# params: 28222767
```

## Inference: get logits and probabilities

```python

sample_rate = 32000
audio_target_length = 10 * sample_rate  # 10 s

AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)

waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
    print("ERROR: sampling rate not 32k Hz", sample_rate_)

waveform = waveform.to(device)

print("\nInference on " + AUDIO_FNAME + "\n")

with torch.no_grad():
    model.eval()
    output = model(waveform)

logits = output["clipwise_logits"]
print("logits size:", logits.size())

probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())

threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("Predicted labels using activity threshold 0.25:\n")
print(sample_labels)
```

Output:
```
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:

[  0 137 138 139 151 506]
```



## Get audio scene embeddings
```python
with torch.no_grad():
    model.eval()
    output = model.forward_scene_embeddings(waveform)

print("\nScene embedding, shape:", output.size())
```

Output:
```
Scene embedding, shape: torch.Size([1, 768])
```

## Get frame-level embeddings
```python
with torch.no_grad():
    model.eval()
    output = model.forward_frame_embeddings(waveform)

print("\nFrame-level embeddings, shape:", output.size())
```

Output:
```
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
```


## Citation

Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564

```bibtex
@inproceedings{pellegrini23_interspeech,
  author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
  title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4169--4173},
  doi={10.21437/Interspeech.2023-1564}
}
```