Update readme
Browse files
README.md
CHANGED
@@ -14,13 +14,14 @@ extra_gated_fields:
|
|
14 |
I plan to use this model for (task, type of audio data, etc): text
|
15 |
---
|
16 |
|
17 |
-
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
|
18 |
|
19 |
The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
|
20 |
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
|
21 |
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
|
22 |
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
|
23 |
|
|
|
24 |
# Install
|
25 |
|
26 |
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
|
@@ -35,6 +36,10 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
|
|
35 |
Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
|
36 |
|
37 |
```python
|
|
|
|
|
|
|
|
|
38 |
import os
|
39 |
import numpy as np
|
40 |
import torch
|
@@ -146,6 +151,8 @@ The second model is useful to perform audio captioning on the AudioCaps dataset
|
|
146 |
|
147 |
# Citation
|
148 |
|
|
|
|
|
149 |
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
|
150 |
|
151 |
```bibtex
|
|
|
14 |
I plan to use this model for (task, type of audio data, etc): text
|
15 |
---
|
16 |
|
17 |
+
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
|
18 |
|
19 |
The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
|
20 |
It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
|
21 |
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
|
22 |
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
|
23 |
|
24 |
+
|
25 |
# Install
|
26 |
|
27 |
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
|
|
|
36 |
Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
|
37 |
|
38 |
```python
|
39 |
+
# 1. visit hf.co/topel/ConvNeXt-Tiny-AT and accept user conditions
|
40 |
+
# 2. visit hf.co/settings/tokens to create an access token
|
41 |
+
# 3. instantiate pretrained model
|
42 |
+
|
43 |
import os
|
44 |
import numpy as np
|
45 |
import torch
|
|
|
151 |
|
152 |
# Citation
|
153 |
|
154 |
+
[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)
|
155 |
+
|
156 |
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
|
157 |
|
158 |
```bibtex
|