gaunernst
/

vit_base_patch16_1024_128.audiomae_as2m_ft_as20k

Audio Classification

Model card Files Files and versions Community

gaunernst commited on Nov 16, 2023

Commit

a4c61e8

•

1 Parent(s): 0c9aeaf

Update README.md

Files changed (1) hide show

README.md +88 -0

README.md CHANGED Viewed

@@ -1,3 +1,91 @@
 ---
 license: cc-by-4.0
 ---

 ---
 license: cc-by-4.0
+library_name: timm
 ---
+# Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k
+This is a port of AudioMAE ViT-B/32 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models.
+See the original repo here: https://github.com/facebookresearch/AudioMAE
+A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method, and fine-tuned on AudioSet-20k.
+## Model Details
+- **Model Type:** Audio classification / feature backbone
+- **Papers:**
+  - Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
+- **Pretrain Dataset:** AudioSet-2M
+- **Original:** https://github.com/facebookresearch/AudioMAE
+## Model Usage
+### Audio Classification
+```python
+from urllib.request import urlopen
+import timm
+# TODO: change this to audio
+img = Image.open(urlopen(
+    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
+))
+model = timm.create_model('gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k', pretrained=True)
+model = model.eval()
+# TODO: torchaudio.compliance.kaldi.fbank
+output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1
+top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
+```
+### Audio Embeddings
+```python
+from urllib.request import urlopen
+import timm
+# TODO: change this to audio
+img = Image.open(urlopen(
+    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
+))
+model = timm.create_model(
+    'gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k',
+    pretrained=True,
+    num_classes=0,  # remove classifier nn.Linear
+)
+model = model.eval()
+# TODO: torchaudio.compliance.kaldi.fbank
+output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor
+# or equivalently (without needing to set num_classes=0)
+output = model.forward_features(transforms(img).unsqueeze(0))
+# output is unpooled, a (1, 197, 768) shaped tensor
+output = model.forward_head(output, pre_logits=True)
+# output is a (1, num_features) shaped tensor
+```
+## Citation
+```bibtex
+@inproceedings{huang2022amae,
+  title = {Masked Autoencoders that Listen},
+  author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
+  booktitle = {NeurIPS},
+  year = {2022}
+}
+```
+```bibtex
+@misc{rw2019timm,
+  author = {Ross Wightman},
+  title = {PyTorch Image Models},
+  year = {2019},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  doi = {10.5281/zenodo.4414861},
+  howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
+}
+```