File size: 2,955 Bytes
820452e
 
 
 
 
 
f159298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b254148
f159298
 
 
b254148
f159298
 
 
b254148
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
license: apache-2.0
pipeline_tag: audio-classification
tags:
- music
- code
---

0Shot1Shot V0.1

![DALL·E 2024-05-17 11.22.28 - A sleek, modern image representing a drum sample sorting model. The design should incorporate elements of music, such as drum icons or waveforms, and .webp](https://cdn-uploads.huggingface.co/production/uploads/643de7b937ca9134b58a83c8/zr_mhV2T68cLmLGAHqsBo.webp)
[Image generated using DALL-E 3]

This is my first swing at training a one shot drum sample audio classification model for use in automagically sorting audio samples in a non-invasive way.

The dataset is one of my own creation, and isn't... amazing. It's called CYPAL 1SHOT v1, and contains around 4500 one shots from my personal collection divided into six categories: 808, Clap, Closed Hat, Kick, Open Hat, and Snare.


![Figure_2.png](https://cdn-uploads.huggingface.co/production/uploads/643de7b937ca9134b58a83c8/gf1qdEPWTvjuYwMVhbKIq.png)


The dataset prep script converts audio files into spectrograms for deep learning, and validates and processes audio files by resampling and removing silence. Spectrograms are generated using Librosa, validated, and saved as numpy arrays. The process includes augmentation with noise and transformations. A DataLoader and custom sampler efficiently batch the spectrograms. The training script then trains an audio classifier using a ResNet-based model on spectrogram data. It uses Optuna for hyperparameter optimization, running fifty tests at 50 epochs. Finally, it trains the model, evaluates its performance on a test set, and logs results. The resulting model includes an initial convolutional layer, followed by four residual blocks with increasing channels (64, 128, 256, 512). Each block contains two convolutional layers with batch normalization and ReLU activation. The network uses global average pooling, followed by a fully connected layer and a dropout layer, ending with a final fully connected layer for classification with softmax activation. Finally, the model training is continued to ensure convergence with fifty more epochs under the best hyperparameters that Optuna found using weighted sampling.

Included is a sample sorting script that sorts audio files using spectrograms that it creates. The model loaded and used to classify audio files converted to spectrograms. The classification results that are above 90% confidence are used to copy and sort the files into labeled folders, and the process is managed through a Tkinter interface that allows folder selection and displays a progress bar during sorting.

V1 wants:

more samples, cleaner dataset, more features (Crash, Ride, Rimshot, Tom, riser, fades, snaps, FX), higher accuracy (currently around 87 I think, even considering abysmal clap performance)

v2 wants:

Melodic samples, instrument oneshot (for keygroups/pitched usage), breaks, loops, alternative percussion (bongos, conga, timps, shaker, rattle), Foley, Soundscapes.