ConvNeXt-Tiny-AT / README.md

Update README.md

93056c3 about 1 year ago

5.47 kB

	---
	license: mit
	tags:
	- audio tagging
	- audio events
	- audio embeddings
	- convnext-audio
	- audioset
	inference: false
	---

	ConvNeXt-Tiny-AT is an audio tagging CNN model, trained on AudioSet (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).

	The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.

	The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).

	Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
	The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.


	# Install

	This code is based on our repo: https://github.com/topel/audioset-convnext-inf

	You can pip install it:

	```bash
	pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
	```

	# Usage

	Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).

	```python
	import os
	import numpy as np
	import torch
	from torch.nn import functional as TF
	import torchaudio
	import torchaudio.functional as TAF

	from audioset_convnext_inf.pytorch.convnext import ConvNeXt
	from audioset_convnext_inf.utils.utilities import read_audioset_label_tags

	model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')

	print(
	"# params:",
	sum(param.numel() for param in model.parameters() if param.requires_grad),
	)
	if torch.cuda.is_available():
	device = torch.device("cuda")
	else:
	device = torch.device("cpu")

	if "cuda" in str(device):
	model = model.to(device)
	```

	Output:
	```
	# params: 28222767
	```

	## Inference: get logits and probabilities

	```python
	sample_rate = 32000
	audio_target_length = 10 * sample_rate # 10 s

	# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
	AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
	AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)

	waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
	if sample_rate_ != sample_rate:
	print("Resampling from %d to 32000 Hz"%sample_rate_)
	waveform = TAF.resample(
	waveform,
	sample_rate_,
	sample_rate,
	)

	if waveform.shape[-1] < audio_target_length:
	print("Padding waveform")
	missing = max(audio_target_length - waveform.shape[-1], 0)
	waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
	elif waveform.shape[-1] > audio_target_length:
	print("Cropping waveform")
	waveform = waveform[:, :audio_target_length]

	waveform = waveform.contiguous()
	waveform = waveform.to(device)

	print("\nInference on " + AUDIO_FNAME + "\n")

	with torch.no_grad():
	model.eval()
	output = model(waveform)

	logits = output["clipwise_logits"]
	print("logits size:", logits.size())

	probs = output["clipwise_output"]
	# Equivalent: probs = torch.sigmoid(logits)
	print("probs size:", probs.size())

	current_dir=os.getcwd()
	lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))

	threshold = 0.25
	sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
	print("\nPredicted labels using activity threshold 0.25:\n")
	# print(sample_labels)
	for l in sample_labels:
	print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))
	```

	Output:
	```
	Inference on 254906__tpellegrini__cavaco1.wav

	Resampling rate from 44100 to 32000 Hz
	Padding waveform
	logits size: torch.Size([1, 527])
	probs size: torch.Size([1, 527])
	Predicted labels using activity threshold 0.25:

	[137 138 139 140 149 151]
	Music: 0.896
	Musical instrument: 0.686
	Plucked string instrument: 0.608
	Guitar: 0.369
	Mandolin: 0.710
	Ukulele: 0.268
	```

	Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho!


	## Get audio scene embeddings
	```python
	with torch.no_grad():
	model.eval()
	output = model.forward_scene_embeddings(waveform)

	print("\nScene embedding, shape:", output.size())
	```

	Output:
	```
	Scene embedding, shape: torch.Size([1, 768])
	```

	## Get frame-level embeddings
	```python
	with torch.no_grad():
	model.eval()
	output = model.forward_frame_embeddings(waveform)

	print("\nFrame-level embeddings, shape:", output.size())
	```

	Output:
	```
	Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
	```

	# Zenodo

	The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1


	# Citation

	[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)

	Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564

	```bibtex
	@inproceedings{pellegrini23_interspeech,
	author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
	title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
	year=2023,
	booktitle={Proc. INTERSPEECH 2023},
	pages={4169--4173},
	doi={10.21437/Interspeech.2023-1564}
	}
	```