topel
/

ConvNeXt-Tiny-AT

@@ -6,13 +6,9 @@ tags:
 - audio embeddings
 - convnext-audio
 - audioset
-inference: false
-extra_gated_prompt: "The collected information will help acquire a better knowledge of who is using our audio event tools. If relevant, please cite our Interspeech 2023 paper."
-extra_gated_fields:
-  Company/university: text
-  Website: text
 ---
-ConvNeXt-Tiny-AT is an audio tagging CNN model, trained on AudioSet (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
 The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
 It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
@@ -23,8 +19,6 @@ The scene embedding is obtained from the frame-level embeddings, on which mean p
 This code is based on our repo: https://github.com/topel/audioset-convnext-inf
-Note that the checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
 ```bash
 pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
@@ -35,10 +29,6 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
 Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
 ```python
-# 1. visit hf.co/topel/ConvNeXt-Tiny-AT and accept user conditions
-# 2. visit hf.co/settings/tokens to create an access token
-# 3. instantiate pretrained model
 import os
 import numpy as np
 import torch
@@ -69,7 +59,6 @@ Output:
 ## Inference: get logits and probabilities
 ```python
 sample_rate = 32000
 audio_target_length = 10 * sample_rate  # 10 s
@@ -140,8 +129,16 @@ Output:
 Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
 ```
-## Citation
 Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564

 - audio embeddings
 - convnext-audio
 - audioset
 ---
+**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set.
 The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
 It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
 This code is based on our repo: https://github.com/topel/audioset-convnext-inf
 ```bash
 pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
 Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
 ```python
 import os
 import numpy as np
 import torch
 ## Inference: get logits and probabilities
 ```python
 sample_rate = 32000
 audio_target_length = 10 * sample_rate  # 10 s
 Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
 ```
+# Zenodo
+The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
+Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth
+The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set.
+# Citation
 Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564