|
--- |
|
license: cc-by-4.0 |
|
datasets: |
|
- ilyassmoummad/Xeno-Canto-6s-16khz |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- Bioacoustics |
|
- pytorch |
|
--- |
|
# ProtoCLR |
|
|
|
This repository contains a CvT-13 [Convolutional Vision Transformer](https://arxiv.org/abs/2103.15808) model trained from scratch on the [Xeno-Canto dataset](https://huggingface.co/datasets/ilyassmoummad/Xeno-Canto-6s-16khz), specifically on 6-second audio segments sampled at 16 kHz. The model is trained on Mel spectrograms of bird sounds using ProtoCLR [(Prototypical Contrastive Loss)](https://arxiv.org/abs/2409.08589) for 300 epochs and can be used as a feature extractor for bird audio classification and related tasks. |
|
|
|
## Files |
|
|
|
- `cvt.py`: Defines the CvT-13 model architecture. |
|
- `protoclr.pth`: Pre-trained model weights for ProtoCLR. |
|
- `config/`: Configuration files for CvT-13 setup. |
|
- `mel_spectrogram.py`: Contains the `MelSpectrogramProcessor` class, which converts audio waveforms into Mel spectrograms, a format suitable for model input. |
|
|
|
## Setup |
|
|
|
1. **Clone this repository**: |
|
Clone the repository and navigate into the project directory: |
|
```git clone https://huggingface.co/ilyassmoummad/ProtoCLR``` |
|
```cd ProtoCLR/``` |
|
|
|
2. **Install dependencies**: |
|
Ensure you have the required Python packages, including `torch` and any other dependencies listed in `requirements.txt`. |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Usage |
|
|
|
1. **Prepare the Audio**: |
|
To ensure compatibility with the model, follow these preprocessing steps for your audio files: |
|
- **Mono Channel (Mandatory)**: |
|
If the audio has multiple channels, convert it to a single mono channel by averaging the channels. |
|
- **Sample Rate (Mandatory)**: |
|
Resample the audio to a consistent sample rate of 16 kHz. |
|
- **Padding (Recommended)**: |
|
For audio files shorter than 6 seconds, pad with zeros or repeat the audio until it reaches a length of 6 seconds. |
|
- **Chunking (Recommended)**: |
|
For audio files longer than 6 seconds, split them into chunks of 6 seconds each for better processing. |
|
|
|
2. **Process the Audio**: |
|
Use the `MelSpectrogramProcessor` (from `melspectrogram.py`) to transform the prepared audio into a Mel spectrogram, a format suitable for model input, as demonstrated in the following example. |
|
|
|
## Example Code |
|
|
|
The following example demonstrates loading, processing, and running inference on an audio file: |
|
|
|
```python |
|
import torch |
|
from cvt import cvt13 # Import model architecture |
|
from melspectrogram import MelSpectrogramProcessor # Import Mel spectrogram processor |
|
|
|
# Initialize the preprocessor and model |
|
preprocessor = MelSpectrogramProcessor() |
|
model = cvt13() |
|
|
|
# Load weights trained using Cross-Entropy |
|
model.load_state_dict(torch.load("ce.pth", map_location="cpu")['encoder']) |
|
|
|
# Load weights trained using SimCLR (self-supervised contrastive learning) |
|
model.load_state_dict(torch.load("simclr.pth", map_location="cpu")) |
|
|
|
# Load weights trained using SupCon (supervised contrastive learning) |
|
model.load_state_dict(torch.load("supcon.pth", map_location="cpu")) |
|
|
|
# Load weights trained using ProtoCLR (supervised contrastive learning using prototypes) |
|
model.load_state_dict(torch.load("protoclr.pth", map_location="cpu")) |
|
|
|
# Optional: Move the model to GPU for faster processing if available using : model = model.to('cuda') , for instance. |
|
model.eval() |
|
|
|
# Load and preprocess a sample audio waveform |
|
def load_waveform(file_path): |
|
# Replace this with your specific audio loading function |
|
# For example, using torchaudio to load and resample |
|
pass |
|
|
|
waveform = load_waveform("path/to/audio.wav") # Load your audio file here |
|
|
|
# Ensure waveform is sampled at 16 kHz, then pad/chunk as needed for 6s length |
|
input_tensor = preprocessor.process(waveform).unsqueeze(0) # Add batch dimension |
|
|
|
# Run the model on the preprocessed audio |
|
with torch.no_grad(): |
|
output = model(input_tensor) |
|
print("Model output shape:", output.shape) |
|
``` |
|
|
|
## Model Performance Comparison |
|
The following table presents the classification accuracy of various models on one-shot and five-shot bird sound classification tasks, evaluated across different [soundscape datasets](https://zenodo.org/records/13994373). |
|
|
|
| Model | Model Size | PER | NES | UHH | HSN | SSW | SNE | Mean | |
|
|---------------------------|------------|-------------|-------------|-------------|-------------|-------------|-------------|-------| |
|
| Random Guessing | - | 0.75 | 1.12 | 3.70 | 5.26 | 1.04 | 1.78 | 2.22 | |
|
| | | | | | | | | | |
|
| **1-Shot Classification** | | | | | | | | | |
|
| BirdAVES-biox-base | 90M | 7.41±1.0 | 26.4±2.3 | 13.2±3.1 | 9.84±3.5 | 8.74±0.6 | 14.1±3.1 | 13.2 | |
|
| BirdAVES-bioxn-large | 300M | 7.59±0.8 | 27.2±3.6 | 13.7±2.9 | 12.5±3.6 | 10.0±1.4 | 14.5±3.2 | 14.2 | |
|
| BioLingual | 28M | 6.21±1.1 | 37.5±2.9 | 17.8±3.5 | 17.6±5.1 | 22.5±4.0 | 26.4±3.4 | 21.3 | |
|
| Perch | 80M | 9.10±5.3 | 42.4±4.9 | 19.8±5.0 | 26.7±9.8 | 22.3±3.3 | 29.1±5.9 | 24.9 | |
|
| CE (Ours) | 19M | 9.55±1.5 | 41.3±3.6 | 19.7±4.7 | 25.2±5.7 | 17.8±1.4 | 31.5±5.4 | 24.2 | |
|
| SimCLR (Ours) | 19M | 7.85±1.1 | 31.2±2.4 | 14.9±2.9 | 19.0±3.8 | 10.6±1.1 | 24.0±4.1 | 17.9 | |
|
| SupCon (Ours) | 19M | 8.53±1.1 | 39.8±6.0 | 18.8±3.0 | 20.4±6.9 | 12.6±1.6 | 23.2±3.1 | 20.5 | |
|
| ProtoCLR (Ours) | 19M | 9.23±1.6 | 38.6±5.1 | 18.4±2.3 | 21.2±7.3 | 15.5±2.3 | 25.8±5.2 | 21.4 | |
|
| | | | | | | | | | |
|
| **5-Shot Classification** | | | | | | | | | |
|
| BirdAVES-biox-base | 90M | 11.6±0.8 | 39.7±1.8 | 22.5±2.4 | 22.1±3.3 | 16.1±1.7 | 28.3±2.3 | 23.3 | |
|
| BirdAVES-bioxn-large | 300M | 15.0±0.9 | 42.6±2.7 | 23.7±3.8 | 28.4±2.4 | 18.3±1.8 | 27.3±2.3 | 25.8 | |
|
| BioLingual | 28M | 13.6±1.3 | 65.2±1.4 | 31.0±2.9 | 34.3±3.5 | 43.9±0.9 | 49.9±2.3 | 39.6 | |
|
| Perch | 80M | 21.2±1.2 | 71.7±1.5 | 39.5±3.0 | 52.5±5.9 | 48.0±1.9 | 59.7±1.8 | 48.7 | |
|
| CE (Ours) | 19M | 21.4±1.3 | 69.2±1.8 | 35.6±3.4 | 48.2±5.5 | 39.9±1.1 | 57.5±2.3 | 45.3 | |
|
| SimCLR (Ours) | 19M | 15.4±1.0 | 54.0±1.8 | 23.0±2.3 | 32.8±4.0 | 22.0±1.2 | 40.7±2.4 | 31.3 | |
|
| SupCon (Ours) | 19M | 17.2±1.3 | 64.6±2.4 | 34.1±2.9 | 42.5±2.9 | 30.8±0.8 | 48.1±2.4 | 39.5 | |
|
| ProtoCLR (Ours) | 19M | 19.2±1.1 | 67.9±2.8 | 36.1±4.3 | 48.0±4.3 | 34.6±2.3 | 48.6±2.8 | 42.4 | |
|
|
|
For additional details, please see the [pre-print on arXiv](https://arxiv.org/abs/2409.08589) and the [official GitHub repository](https://github.com/ilyassmoummad/ProtoCLR). |
|
|
|
## Citation |
|
|
|
If you use our model in your research, please cite the following paper: |
|
|
|
```bibtex |
|
@misc{moummad2024dirlbs, |
|
title={Domain-Invariant Representation Learning of Bird Sounds}, |
|
author={Ilyass Moummad and Romain Serizel and Emmanouil Benetos and Nicolas Farrugia}, |
|
year={2024}, |
|
eprint={2409.08589}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.SD}, |
|
url={https://arxiv.org/abs/2409.08589}, |
|
} |
|
``` |