bookbot
/

distil-ast-audioset

Audio Classification

audio-spectrogram-transformer

generated_from_trainer

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

distil-ast-audioset / README.md

w11wo's picture

Update README.md

9d3f3b7 about 1 year ago

|

raw history blame contribute delete

No virus

3.73 kB

	---
	language: en
	license: apache-2.0
	tags:
	- audio-classification
	- generated_from_trainer
	metrics:
	- accuracy
	- f1
	---

	# Distil Audio Spectrogram Transformer AudioSet

	Distil Audio Spectrogram Transformer AudioSet is an audio classification model based on the [Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) architecture. This model is a distilled version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) on the [AudioSet](https://research.google.com/audioset/download.html) dataset.

	This model was trained using HuggingFace's PyTorch framework. All training was done on a Google Cloud Engine VM with a Tesla A100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/distil-ast-audioset/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/distil-ast-audioset/tensorboard) logged via Tensorboard.

	## Model

	\| Model \| #params \| Arch. \| Training/Validation data \|
	\| --------------------- \| ------- \| ----------------------------- \| ------------------------ \|
	\| `distil-ast-audioset` \| 44M \| Audio Spectrogram Transformer \| AudioSet \|

	## Evaluation Results

	The model achieves the following results on evaluation:

	\| Model \| F1 \| Roc Auc \| Accuracy \| mAP \|
	\| ------------------- \| ------ \| ------- \| -------- \| ------ \|
	\| Distil-AST AudioSet \| 0.4876 \| 0.7140 \| 0.0714 \| 0.4743 \|
	\| AST AudioSet \| 0.4989 \| 0.6905 \| 0.1247 \| 0.5603 \|

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:

	- `learning_rate`: 3e-05
	- `train_batch_size`: 32
	- `eval_batch_size`: 32
	- `seed`: 0
	- `gradient_accumulation_steps`: 4
	- `total_train_batch_size`: 128
	- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
	- `lr_scheduler_type`: linear
	- `lr_scheduler_warmup_ratio`: 0.1
	- `num_epochs`: 10.0
	- `mixed_precision_training`: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| F1 \| Roc Auc \| Accuracy \| Map \|
	\| :-----------: \| :---: \| :---: \| :-------------: \| :----: \| :-----: \| :------: \| :----: \|
	\| 1.5521 \| 1.0 \| 153 \| 0.7759 \| 0.3929 \| 0.6789 \| 0.0209 \| 0.3394 \|
	\| 0.7088 \| 2.0 \| 306 \| 0.5183 \| 0.4480 \| 0.7162 \| 0.0349 \| 0.4047 \|
	\| 0.484 \| 3.0 \| 459 \| 0.4342 \| 0.4673 \| 0.7241 \| 0.0447 \| 0.4348 \|
	\| 0.369 \| 4.0 \| 612 \| 0.3847 \| 0.4777 \| 0.7332 \| 0.0504 \| 0.4463 \|
	\| 0.2943 \| 5.0 \| 765 \| 0.3587 \| 0.4838 \| 0.7284 \| 0.0572 \| 0.4556 \|
	\| 0.2446 \| 6.0 \| 918 \| 0.3415 \| 0.4875 \| 0.7296 \| 0.0608 \| 0.4628 \|
	\| 0.2099 \| 7.0 \| 1071 \| 0.3273 \| 0.4896 \| 0.7246 \| 0.0648 \| 0.4682 \|
	\| 0.186 \| 8.0 \| 1224 \| 0.3140 \| 0.4888 \| 0.7171 \| 0.0689 \| 0.4711 \|
	\| 0.1693 \| 9.0 \| 1377 \| 0.3101 \| 0.4887 \| 0.7157 \| 0.0703 \| 0.4741 \|
	\| 0.1582 \| 10.0 \| 1530 \| 0.3063 \| 0.4876 \| 0.7140 \| 0.0714 \| 0.4743 \|

	## Disclaimer

	Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

	## Authors

	Distil Audio Spectrogram Transformer AudioSet was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io), [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on Google Cloud.

	## Framework versions

	- Transformers 4.27.0.dev0
	- Pytorch 1.13.1+cu117
	- Datasets 2.10.0
	- Tokenizers 0.13.2