--- license: mit tags: - audio tagging - audio events - audio embeddings - convnext-audio - audioset inference: false --- **ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html). The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet. The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html). Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below. The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim. # Install This code is based on our repo: https://github.com/topel/audioset-convnext-inf You can pip install it: ```bash pip install git+https://github.com/topel/audioset-convnext-inf@pip-install ``` # Usage Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels). ```python import os import numpy as np import torch from torch.nn import functional as TF import torchaudio import torchaudio.functional as TAF from audioset_convnext_inf.pytorch.convnext import ConvNeXt from audioset_convnext_inf.utils.utilities import read_audioset_label_tags model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu') print( "# params:", sum(param.numel() for param in model.parameters() if param.requires_grad), ) if torch.cuda.is_available(): device = torch.device("cuda") else: device = torch.device("cpu") if "cuda" in str(device): model = model.to(device) ``` Output: ``` # params: 28222767 ``` ## Inference: get logits and probabilities To run the following, first download ```254906__tpellegrini__cavaco1.wav``` and ```class_labels_indices.csv``` from this repository. ```python sample_rate = 32000 audio_target_length = 10 * sample_rate # 10 s # AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav" AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav" current_dir=os.getcwd() AUDIO_FPATH = os.path.join(current_dir, AUDIO_FNAME) waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH) if sample_rate_ != sample_rate: print("Resampling from %d to 32000 Hz"%sample_rate_) waveform = TAF.resample( waveform, sample_rate_, sample_rate, ) if waveform.shape[-1] < audio_target_length: print("Padding waveform") missing = max(audio_target_length - waveform.shape[-1], 0) waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0) elif waveform.shape[-1] > audio_target_length: print("Cropping waveform") waveform = waveform[:, :audio_target_length] waveform = waveform.contiguous() waveform = waveform.to(device) print("\nInference on " + AUDIO_FNAME + "\n") with torch.no_grad(): model.eval() output = model(waveform) logits = output["clipwise_logits"] print("logits size:", logits.size()) probs = output["clipwise_output"] # Equivalent: probs = torch.sigmoid(logits) print("probs size:", probs.size()) lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv")) threshold = 0.25 sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0] print("\nPredicted labels using activity threshold 0.25:\n") # print(sample_labels) for l in sample_labels: print("%s: %.3f"%(ix_to_lb[l], probs[0,l])) ``` Output: ``` Inference on 254906__tpellegrini__cavaco1.wav Resampling rate from 44100 to 32000 Hz Padding waveform logits size: torch.Size([1, 527]) probs size: torch.Size([1, 527]) Predicted labels using activity threshold 0.25: [137 138 139 140 149 151] Music: 0.896 Musical instrument: 0.686 Plucked string instrument: 0.608 Guitar: 0.369 Mandolin: 0.710 Ukulele: 0.268 ``` Technically speaking, it's not a Mandolin nor a Ukulele, but a Brazilian cousin, the cavaquinho! ## Get audio scene embeddings ```python with torch.no_grad(): model.eval() output = model.forward_scene_embeddings(waveform) print("\nScene embedding, shape:", output.size()) ``` Output: ``` Scene embedding, shape: torch.Size([1, 768]) ``` ## Get frame-level embeddings ```python with torch.no_grad(): model.eval() output = model.forward_frame_embeddings(waveform) print("\nFrame-level embeddings, shape:", output.size()) ``` Output: ``` Frame-level embeddings, shape: torch.Size([1, 768, 31, 7]) ``` # Zenodo The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1 # Citation [Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html) Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564 ```bibtex @inproceedings{pellegrini23_interspeech, author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier}, title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={4169--4173}, doi={10.21437/Interspeech.2023-1564} } ```