File size: 3,726 Bytes

---
language: en
license: apache-2.0
tags:
  - audio-classification
  - generated_from_trainer
metrics:
  - accuracy
  - f1
---

# Distil Audio Spectrogram Transformer AudioSet

Distil Audio Spectrogram Transformer AudioSet is an audio classification model based on the [Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) architecture. This model is a distilled version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) on the [AudioSet](https://research.google.com/audioset/download.html) dataset.

This model was trained using HuggingFace's PyTorch framework. All training was done on a Google Cloud Engine VM with a Tesla A100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/distil-ast-audioset/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/distil-ast-audioset/tensorboard) logged via Tensorboard.

## Model

| Model                 | #params | Arch.                         | Training/Validation data |
| --------------------- | ------- | ----------------------------- | ------------------------ |
| `distil-ast-audioset` | 44M     | Audio Spectrogram Transformer | AudioSet                 |

## Evaluation Results

The model achieves the following results on evaluation:

| Model               | F1     | Roc Auc | Accuracy | mAP    |
| ------------------- | ------ | ------- | -------- | ------ |
| Distil-AST AudioSet | 0.4876 | 0.7140  | 0.0714   | 0.4743 |
| AST AudioSet        | 0.4989 | 0.6905  | 0.1247   | 0.5603 |

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:

- `learning_rate`: 3e-05
- `train_batch_size`: 32
- `eval_batch_size`: 32
- `seed`: 0
- `gradient_accumulation_steps`: 4
- `total_train_batch_size`: 128
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_ratio`: 0.1
- `num_epochs`: 10.0
- `mixed_precision_training`: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss |   F1   | Roc Auc | Accuracy |  Map   |
| :-----------: | :---: | :---: | :-------------: | :----: | :-----: | :------: | :----: |
|    1.5521     |  1.0  |  153  |     0.7759      | 0.3929 | 0.6789  |  0.0209  | 0.3394 |
|    0.7088     |  2.0  |  306  |     0.5183      | 0.4480 | 0.7162  |  0.0349  | 0.4047 |
|     0.484     |  3.0  |  459  |     0.4342      | 0.4673 | 0.7241  |  0.0447  | 0.4348 |
|     0.369     |  4.0  |  612  |     0.3847      | 0.4777 | 0.7332  |  0.0504  | 0.4463 |
|    0.2943     |  5.0  |  765  |     0.3587      | 0.4838 | 0.7284  |  0.0572  | 0.4556 |
|    0.2446     |  6.0  |  918  |     0.3415      | 0.4875 | 0.7296  |  0.0608  | 0.4628 |
|    0.2099     |  7.0  | 1071  |     0.3273      | 0.4896 | 0.7246  |  0.0648  | 0.4682 |
|     0.186     |  8.0  | 1224  |     0.3140      | 0.4888 | 0.7171  |  0.0689  | 0.4711 |
|    0.1693     |  9.0  | 1377  |     0.3101      | 0.4887 | 0.7157  |  0.0703  | 0.4741 |
|    0.1582     | 10.0  | 1530  |     0.3063      | 0.4876 | 0.7140  |  0.0714  | 0.4743 |

## Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

## Authors

Distil Audio Spectrogram Transformer AudioSet was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io), [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on Google Cloud.

## Framework versions

- Transformers 4.27.0.dev0
- Pytorch 1.13.1+cu117
- Datasets 2.10.0
- Tokenizers 0.13.2