Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method, and fine-tuned on AudioSet-20k.

This is a port of AudioMAE ViT-B/16 weights for usage with timm. The naming convention is adopted from other timm's ViT models.
See the original repo here: https://github.com/facebookresearch/AudioMAE
For the AudioSet-2M pre-trained checkpoint (without Audioset-20k fine-tuning), see https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m

Model Details

Model Type: Audio classification / feature backbone
Papers:
- Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
Pretrain Dataset: AudioSet-2M
Original: https://github.com/facebookresearch/AudioMAE

Model Usage

Audio Classification and Embeddings

import timm
import torch
import torch.nn.functional as F
from torchaudio.compliance import kaldi

# NOTE: for timm<0.9.11, you also need to pass `global_pool='avg'`
# if only embeddings are needed, pass `num_classes=0`
model = timm.create_model("hf_hub:gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k", pretrained=True)
model = model.eval()

MEAN = -4.2677393
STD = 4.5689974

audio = torch.randn(1, 10 * 16_000)  # make sure input is 16kHz
melspec = kaldi.fbank(audio, htk_compat=True, window_type="hanning", num_mel_bins=128)  # shape (n_frames, 128)

# AudioMAE only accepts 1024-frame input
if melspec.shape[0] < 1024:
    melspec = F.pad(melspec, (0, 0, 0, 1024 - melspec.shape[0]))
else:
    melspec = melspec[:1024]
melspec = (melspec - MEAN) / (STD * 2)

melspec = melspec.view(1, 1, 1024, 128)  # add batch dim and channel dim
output = model(melspec)

# for classification
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

# for embeddings
output  # shape (1, 768)

Citation

@inproceedings{huang2022amae,
  title = {Masked Autoencoders that Listen},
  author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
  booktitle = {NeurIPS},
  year = {2022}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}

gaunernst
/

vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

Model Details

Model Usage

Audio Classification and Embeddings

Citation

Spaces using gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k 3