gaunernst commited on
Commit
a4c61e8
1 Parent(s): 0c9aeaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md CHANGED
@@ -1,3 +1,91 @@
1
  ---
2
  license: cc-by-4.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ library_name: timm
4
  ---
5
+
6
+ # Model card for vit_base_patch16_1024_128.audiomae_as2m_ft_as20k
7
+
8
+ This is a port of AudioMAE ViT-B/32 weights for usage with `timm`. The naming convention is adopted from other `timm`'s ViT models.
9
+
10
+ See the original repo here: https://github.com/facebookresearch/AudioMAE
11
+
12
+ A Vision Transformer (ViT) for audio. Pretrained on AudioSet-2M with Self-Supervised Masked Autoencoder (MAE) method, and fine-tuned on AudioSet-20k.
13
+
14
+ ## Model Details
15
+ - **Model Type:** Audio classification / feature backbone
16
+ - **Papers:**
17
+ - Masked Autoencoders that Listen: https://arxiv.org/abs/2207.06405
18
+ - **Pretrain Dataset:** AudioSet-2M
19
+ - **Original:** https://github.com/facebookresearch/AudioMAE
20
+
21
+ ## Model Usage
22
+ ### Audio Classification
23
+ ```python
24
+ from urllib.request import urlopen
25
+ import timm
26
+
27
+ # TODO: change this to audio
28
+ img = Image.open(urlopen(
29
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
30
+ ))
31
+
32
+ model = timm.create_model('gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k', pretrained=True)
33
+ model = model.eval()
34
+
35
+ # TODO: torchaudio.compliance.kaldi.fbank
36
+
37
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
38
+
39
+ top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
40
+ ```
41
+
42
+ ### Audio Embeddings
43
+ ```python
44
+ from urllib.request import urlopen
45
+ import timm
46
+
47
+ # TODO: change this to audio
48
+ img = Image.open(urlopen(
49
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
50
+ ))
51
+
52
+ model = timm.create_model(
53
+ 'gaunernst/vit_base_patch16_1024_128.audiomae_as2m_ft_as20k',
54
+ pretrained=True,
55
+ num_classes=0, # remove classifier nn.Linear
56
+ )
57
+ model = model.eval()
58
+
59
+ # TODO: torchaudio.compliance.kaldi.fbank
60
+
61
+ output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
62
+
63
+ # or equivalently (without needing to set num_classes=0)
64
+
65
+ output = model.forward_features(transforms(img).unsqueeze(0))
66
+ # output is unpooled, a (1, 197, 768) shaped tensor
67
+
68
+ output = model.forward_head(output, pre_logits=True)
69
+ # output is a (1, num_features) shaped tensor
70
+ ```
71
+
72
+ ## Citation
73
+ ```bibtex
74
+ @inproceedings{huang2022amae,
75
+ title = {Masked Autoencoders that Listen},
76
+ author = {Huang, Po-Yao and Xu, Hu and Li, Juncheng and Baevski, Alexei and Auli, Michael and Galuba, Wojciech and Metze, Florian and Feichtenhofer, Christoph}
77
+ booktitle = {NeurIPS},
78
+ year = {2022}
79
+ }
80
+ ```
81
+ ```bibtex
82
+ @misc{rw2019timm,
83
+ author = {Ross Wightman},
84
+ title = {PyTorch Image Models},
85
+ year = {2019},
86
+ publisher = {GitHub},
87
+ journal = {GitHub repository},
88
+ doi = {10.5281/zenodo.4414861},
89
+ howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
90
+ }
91
+ ```