Feature Extraction
PyTorch
Bioacoustics
File size: 8,054 Bytes
d2a2381
56eeffc
d2a2381
 
 
 
 
486960e
d2a2381
aab5975
 
 
 
 
 
 
e7700e0
aab5975
 
 
 
 
3cc4661
 
 
 
 
 
aab5975
 
 
 
 
 
 
3cc4661
 
 
 
 
 
 
 
 
 
 
 
 
aab5975
 
 
 
 
 
 
 
 
 
 
 
 
7858d92
 
 
 
 
 
 
 
 
 
 
ff3dfcb
7858d92
ff3dfcb
aab5975
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
088cb80
 
 
 
 
 
 
 
 
 
 
 
7858d92
 
 
088cb80
 
 
 
 
 
 
7858d92
 
 
088cb80
 
7858d92
088cb80
aab5975
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: cc-by-4.0
datasets:
- ilyassmoummad/Xeno-Canto-6s-16khz
pipeline_tag: feature-extraction
tags:
- Bioacoustics
- pytorch
---
# ProtoCLR

This repository contains a CvT-13 [Convolutional Vision Transformer](https://arxiv.org/abs/2103.15808) model trained from scratch on the [Xeno-Canto dataset](https://huggingface.co/datasets/ilyassmoummad/Xeno-Canto-6s-16khz), specifically on 6-second audio segments sampled at 16 kHz. The model is trained on Mel spectrograms of bird sounds using ProtoCLR [(Prototypical Contrastive Loss)](https://arxiv.org/abs/2409.08589) for 300 epochs and can be used as a feature extractor for bird audio classification and related tasks.

## Files

- `cvt.py`: Defines the CvT-13 model architecture.
- `protoclr.pth`: Pre-trained model weights for ProtoCLR.
- `config/`: Configuration files for CvT-13 setup.
- `mel_spectrogram.py`: Contains the `MelSpectrogramProcessor` class, which converts audio waveforms into Mel spectrograms, a format suitable for model input.

## Setup

1. **Clone this repository**:
    Clone the repository and navigate into the project directory:
    ```git clone https://huggingface.co/ilyassmoummad/ProtoCLR```
    ```cd ProtoCLR/```

2. **Install dependencies**:
    Ensure you have the required Python packages, including `torch` and any other dependencies listed in `requirements.txt`.
    ```bash
    pip install -r requirements.txt
    ```

## Usage

1. **Prepare the Audio**:  
   To ensure compatibility with the model, follow these preprocessing steps for your audio files:  
   - **Mono Channel (Mandatory)**:  
     If the audio has multiple channels, convert it to a single mono channel by averaging the channels.  
   - **Sample Rate (Mandatory)**:  
     Resample the audio to a consistent sample rate of 16 kHz.  
   - **Padding (Recommended)**:  
     For audio files shorter than 6 seconds, pad with zeros or repeat the audio until it reaches a length of 6 seconds.  
   - **Chunking (Recommended)**:  
     For audio files longer than 6 seconds, split them into chunks of 6 seconds each for better processing.

2. **Process the Audio**:  
   Use the `MelSpectrogramProcessor` (from `melspectrogram.py`) to transform the prepared audio into a Mel spectrogram, a format suitable for model input, as demonstrated in the following example.

## Example Code

The following example demonstrates loading, processing, and running inference on an audio file:

```python
import torch
from cvt import cvt13  # Import model architecture
from melspectrogram import MelSpectrogramProcessor  # Import Mel spectrogram processor

# Initialize the preprocessor and model
preprocessor = MelSpectrogramProcessor()
model = cvt13()

# Load weights trained using Cross-Entropy
model.load_state_dict(torch.load("ce.pth", map_location="cpu")['encoder'])

# Load weights trained using SimCLR (self-supervised contrastive learning)
model.load_state_dict(torch.load("simclr.pth", map_location="cpu"))

# Load weights trained using SupCon (supervised contrastive learning)
model.load_state_dict(torch.load("supcon.pth", map_location="cpu"))

# Load weights trained using ProtoCLR (supervised contrastive learning using prototypes)
model.load_state_dict(torch.load("protoclr.pth", map_location="cpu"))

# Optional: Move the model to GPU for faster processing if available using : model = model.to('cuda') , for instance.
model.eval()

# Load and preprocess a sample audio waveform
def load_waveform(file_path):
    # Replace this with your specific audio loading function
    # For example, using torchaudio to load and resample
    pass

waveform = load_waveform("path/to/audio.wav")  # Load your audio file here

# Ensure waveform is sampled at 16 kHz, then pad/chunk as needed for 6s length
input_tensor = preprocessor.process(waveform).unsqueeze(0)  # Add batch dimension

# Run the model on the preprocessed audio
with torch.no_grad():
    output = model(input_tensor)
    print("Model output shape:", output.shape)
```

## Model Performance Comparison
The following table presents the classification accuracy of various models on one-shot and five-shot bird sound classification tasks, evaluated across different [soundscape datasets](https://zenodo.org/records/13994373).

| Model                     | Model Size | PER         | NES         | UHH         | HSN         | SSW         | SNE         | Mean  |
|---------------------------|------------|-------------|-------------|-------------|-------------|-------------|-------------|-------|
| Random Guessing           | -          | 0.75        | 1.12        | 3.70        | 5.26        | 1.04        | 1.78        | 2.22  |
|                           |            |             |             |             |             |             |             |       |
| **1-Shot Classification** |            |             |             |             |             |             |             |       |
| BirdAVES-biox-base        | 90M        | 7.41±1.0    | 26.4±2.3    | 13.2±3.1    | 9.84±3.5    | 8.74±0.6    | 14.1±3.1    | 13.2  |
| BirdAVES-bioxn-large      | 300M       | 7.59±0.8    | 27.2±3.6    | 13.7±2.9    | 12.5±3.6    | 10.0±1.4    | 14.5±3.2    | 14.2  |
| BioLingual                | 28M        | 6.21±1.1    | 37.5±2.9    | 17.8±3.5    | 17.6±5.1    | 22.5±4.0    | 26.4±3.4    | 21.3  |
| Perch                     | 80M        | 9.10±5.3    | 42.4±4.9    | 19.8±5.0    | 26.7±9.8    | 22.3±3.3    | 29.1±5.9    | 24.9  |
| CE (Ours)                 | 19M        | 9.55±1.5    | 41.3±3.6    | 19.7±4.7    | 25.2±5.7    | 17.8±1.4    | 31.5±5.4    | 24.2  |
| SimCLR (Ours)             | 19M        | 7.85±1.1    | 31.2±2.4    | 14.9±2.9    | 19.0±3.8    | 10.6±1.1    | 24.0±4.1    | 17.9  |
| SupCon (Ours)             | 19M        | 8.53±1.1    | 39.8±6.0    | 18.8±3.0    | 20.4±6.9    | 12.6±1.6    | 23.2±3.1    | 20.5  |
| ProtoCLR (Ours)           | 19M        | 9.23±1.6    | 38.6±5.1    | 18.4±2.3    | 21.2±7.3    | 15.5±2.3    | 25.8±5.2    | 21.4  |
|                           |            |             |             |             |             |             |             |       |
| **5-Shot Classification** |            |             |             |             |             |             |             |       |
| BirdAVES-biox-base        | 90M        | 11.6±0.8    | 39.7±1.8    | 22.5±2.4    | 22.1±3.3    | 16.1±1.7    | 28.3±2.3    | 23.3  |
| BirdAVES-bioxn-large      | 300M       | 15.0±0.9    | 42.6±2.7    | 23.7±3.8    | 28.4±2.4    | 18.3±1.8    | 27.3±2.3    | 25.8  |
| BioLingual                | 28M        | 13.6±1.3    | 65.2±1.4    | 31.0±2.9    | 34.3±3.5    | 43.9±0.9    | 49.9±2.3    | 39.6  |
| Perch                     | 80M        | 21.2±1.2    | 71.7±1.5    | 39.5±3.0    | 52.5±5.9    | 48.0±1.9    | 59.7±1.8    | 48.7  |
| CE (Ours)                 | 19M        | 21.4±1.3    | 69.2±1.8    | 35.6±3.4    | 48.2±5.5    | 39.9±1.1    | 57.5±2.3    | 45.3  |
| SimCLR (Ours)             | 19M        | 15.4±1.0    | 54.0±1.8    | 23.0±2.3    | 32.8±4.0    | 22.0±1.2    | 40.7±2.4    | 31.3  |
| SupCon (Ours)             | 19M        | 17.2±1.3    | 64.6±2.4    | 34.1±2.9    | 42.5±2.9    | 30.8±0.8    | 48.1±2.4    | 39.5  |
| ProtoCLR (Ours)           | 19M        | 19.2±1.1    | 67.9±2.8    | 36.1±4.3    | 48.0±4.3    | 34.6±2.3    | 48.6±2.8    | 42.4  |

For additional details, please see the [pre-print on arXiv](https://arxiv.org/abs/2409.08589) and the [official GitHub repository](https://github.com/ilyassmoummad/ProtoCLR).

## Citation

If you use our model in your research, please cite the following paper:

```bibtex
@misc{moummad2024dirlbs,
      title={Domain-Invariant Representation Learning of Bird Sounds}, 
      author={Ilyass Moummad and Romain Serizel and Emmanouil Benetos and Nicolas Farrugia},
      year={2024},
      eprint={2409.08589},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2409.08589}, 
}
```