topel
/

ConvNeXt-Tiny-AT

audio embeddings

Model card Files Files and versions Community

topel commited on Sep 25, 2023

Commit

1e97b39

•

1 Parent(s): 40c6e9b

Update readme

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -14,13 +14,14 @@ extra_gated_fields:
   I plan to use this model for (task, type of audio data, etc): text
 ---
-**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
 The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
 It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
 Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
 The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
 # Install
 This code is based on our repo: https://github.com/topel/audioset-convnext-inf
@@ -35,6 +36,10 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
 Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
 ```python
 import os
 import numpy as np
 import torch
@@ -146,6 +151,8 @@ The second model is useful to perform audio captioning on the AudioCaps dataset
 # Citation
 Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
 ```bibtex

   I plan to use this model for (task, type of audio data, etc): text
 ---
+**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
 The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
 It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
 Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
 The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
 # Install
 This code is based on our repo: https://github.com/topel/audioset-convnext-inf
 Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
 ```python
+# 1. visit hf.co/topel/ConvNeXt-Tiny-AT and accept user conditions
+# 2. visit hf.co/settings/tokens to create an access token
+# 3. instantiate pretrained model
 import os
 import numpy as np
 import torch
 # Citation
+[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)
 Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
 ```bibtex