Intro

The Classical and Ethnic Vocal Style Classification model aims to distinguish between classical and ethnic vocal styles, with all audio samples sung by professional vocalists. The model is fine-tuned using an audio dataset consisting of four categories, which has been pre-processed into spectrograms. Initially pretrained in the computer vision (CV) domain, the backbone network undergoes a fine-tuning process specifically designed for vocal style classification tasks. In this model, the pre-training on CV tasks provides a foundation for the network to learn general audio features, which are then adjusted during fine-tuning to adapt to the subtle differences between classical and ethnic vocal styles. The audio dataset, comprising samples from classical and various ethnic singing traditions, enables the model to capture unique patterns associated with each vocal style. Representing spectrograms as input allows the model to effectively analyze both the temporal and frequency components of the audio signals. Through the fine-tuning process, the model continuously enhances its ability to discriminate between sound representations and subtle stylistic differences between classical and ethnic styles. This specialized model holds significant potential in the music industry and cultural preservation, as it accurately categorizes vocal performances into these two broad categories. Its foundation in pre-trained computer vision principles demonstrates the versatility and adaptability of neural networks across different domains, enhancing the model's capability to capture complex features of vocal performances.

Demo (inference code)

https://huggingface.co/spaces/ccmusic-database/bel_canto

Usage

from huggingface_hub import snapshot_download
model_dir = snapshot_download("ccmusic-database/bel_canto")

Maintenance

GIT_LFS_SKIP_SMUDGE=1 git clone git@hf.co:ccmusic-database/bel_canto
cd bel_canto

Results

Backbone	Mel	CQT	Chroma
Swin-S	0.928	0.936	0.787
Swin-T	0.906	0.863	0.731

AlexNet	0.919	0.920	0.746
ConvNeXt-T	0.895	0.925	0.714
GoogleNet	0.948	0.921	0.739
MNASNet1.3	0.931	0.931	0.765
SqueezeNet1.1	0.923	0.914	0.685
Average	0.921	0.916	0.738

Best Result

Loss curve
Training and validation accuracy
Confusion matrix

Dataset

https://huggingface.co/datasets/ccmusic-database/bel_canto

Mirror

https://www.modelscope.cn/models/ccmusic-database/bel_canto

Evaluation

https://github.com/monetjoe/ccmusic_eval

Cite

@article{Zhou-2025,
  author  = {Monan Zhou and Shenyang Xu and Zhaorui Liu and Zhaowen Wang and Feng Yu and Wei Li and Baoqiang Han},
  title   = {CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research},
  journal = {Transactions of the International Society for Music Information Retrieval},
  volume  = {8},
  number  = {1},
  pages   = {22--38},
  month   = {Mar},
  year    = {2025},
  url     = {https://doi.org/10.5334/tismir.194},
  doi     = {10.5334/tismir.194}
}

ccmusic-database
/

bel_canto