topel commited on
Commit
93056c3
1 Parent(s): e3808a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -17
README.md CHANGED
@@ -11,8 +11,10 @@ inference: false
11
 
12
  **ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
13
 
14
- The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
15
- It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
 
 
16
  Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
17
  The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
18
 
@@ -29,13 +31,15 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
29
 
30
  # Usage
31
 
32
- Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
33
 
34
  ```python
35
  import os
36
  import numpy as np
37
  import torch
 
38
  import torchaudio
 
39
 
40
  from audioset_convnext_inf.pytorch.convnext import ConvNeXt
41
  from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
@@ -66,13 +70,28 @@ Output:
66
  sample_rate = 32000
67
  audio_target_length = 10 * sample_rate # 10 s
68
 
69
- AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
 
70
  AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
71
 
72
  waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
73
  if sample_rate_ != sample_rate:
74
- print("ERROR: sampling rate not 32k Hz", sample_rate_)
75
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  waveform = waveform.to(device)
77
 
78
  print("\nInference on " + AUDIO_FNAME + "\n")
@@ -101,19 +120,24 @@ for l in sample_labels:
101
 
102
  Output:
103
  ```
 
 
 
 
104
  logits size: torch.Size([1, 527])
105
  probs size: torch.Size([1, 527])
106
-
107
  Predicted labels using activity threshold 0.25:
108
 
109
- Speech: 0.626
110
- Music: 0.842
111
- Musical instrument: 0.362
112
- Plucked string instrument: 0.307
113
- Ukulele: 0.703
114
- Inside, small room: 0.305
 
115
  ```
116
 
 
117
 
118
 
119
  ## Get audio scene embeddings
@@ -148,10 +172,6 @@ Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
148
 
149
  The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
150
 
151
- Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth
152
-
153
- The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set.
154
-
155
 
156
  # Citation
157
 
 
11
 
12
  **ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
13
 
14
+ The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.
15
+
16
+ The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
17
+
18
  Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
19
  The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
20
 
 
31
 
32
  # Usage
33
 
34
+ Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).
35
 
36
  ```python
37
  import os
38
  import numpy as np
39
  import torch
40
+ from torch.nn import functional as TF
41
  import torchaudio
42
+ import torchaudio.functional as TAF
43
 
44
  from audioset_convnext_inf.pytorch.convnext import ConvNeXt
45
  from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
 
70
  sample_rate = 32000
71
  audio_target_length = 10 * sample_rate # 10 s
72
 
73
+ # AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
74
+ AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
75
  AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
76
 
77
  waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
78
  if sample_rate_ != sample_rate:
79
+ print("Resampling from %d to 32000 Hz"%sample_rate_)
80
+ waveform = TAF.resample(
81
+ waveform,
82
+ sample_rate_,
83
+ sample_rate,
84
+ )
85
+
86
+ if waveform.shape[-1] < audio_target_length:
87
+ print("Padding waveform")
88
+ missing = max(audio_target_length - waveform.shape[-1], 0)
89
+ waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
90
+ elif waveform.shape[-1] > audio_target_length:
91
+ print("Cropping waveform")
92
+ waveform = waveform[:, :audio_target_length]
93
+
94
+ waveform = waveform.contiguous()
95
  waveform = waveform.to(device)
96
 
97
  print("\nInference on " + AUDIO_FNAME + "\n")
 
120
 
121
  Output:
122
  ```
123
+ Inference on 254906__tpellegrini__cavaco1.wav
124
+
125
+ Resampling rate from 44100 to 32000 Hz
126
+ Padding waveform
127
  logits size: torch.Size([1, 527])
128
  probs size: torch.Size([1, 527])
 
129
  Predicted labels using activity threshold 0.25:
130
 
131
+ [137 138 139 140 149 151]
132
+ Music: 0.896
133
+ Musical instrument: 0.686
134
+ Plucked string instrument: 0.608
135
+ Guitar: 0.369
136
+ Mandolin: 0.710
137
+ Ukulele: 0.268
138
  ```
139
 
140
+ Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho!
141
 
142
 
143
  ## Get audio scene embeddings
 
172
 
173
  The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
174
 
 
 
 
 
175
 
176
  # Citation
177