Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
inference: false
|
4 |
+
tags:
|
5 |
+
- music
|
6 |
+
---
|
7 |
+
|
8 |
+
# Introduction to our series work
|
9 |
+
|
10 |
+
The development log of our Music Audio Pre-training (m-a-p) model family:
|
11 |
+
- 17/03/2023: we release two advanced music understanding models, [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) and [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
|
12 |
+
- 14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public)
|
13 |
+
- 29/12/2022: a music understanding model [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) trained with **MLM** paradigm, which performs better at downstream tasks.
|
14 |
+
- 29/10/2022: a pre-trained MIR model [music2vec](https://huggingface.co/m-a-p/music2vec-v1) trained with **BYOL** paradigm.
|
15 |
+
|
16 |
+
|
17 |
+
|
18 |
+
Here is a table for quick model pick-up:
|
19 |
+
|
20 |
+
| Name | Pre-train Paradigm | Training Data (hour) | Pre-train Context (second) | Model Size | Transformer Layer-Dimension | Feature Rate | Sample Rate | Release Date |
|
21 |
+
| ------------------------------------------------------------ | ------------------ | -------------------- | ---------------------------- | ---------- | --------------------------- | ------------ | ----------- | ------------ |
|
22 |
+
| [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) | MLM | 160K | 5 | 330M | 24-1024 | 75 Hz | 24K Hz | 17/03/2023 |
|
23 |
+
| [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) | MLM | 20K | 5 | 95M | 12-768 | 75 Hz | 24K Hz | 17/03/2023 |
|
24 |
+
| [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | MLM | 900 | 5 | 95M | 12-768 | 50 Hz | 16K Hz | 14/03/2023 |
|
25 |
+
| [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) | MLM | 1000 | 5 | 95 M | 12-768 | 50 Hz | 16K Hz | 29/12/2023 |
|
26 |
+
| [music2vec-v1](https://huggingface.co/m-a-p/music2vec-v1) | BYOL | 1000 | 30 | 95 M | 12-768 | 50 Hz | 16K Hz | 30/10/2022 |
|
27 |
+
|
28 |
+
## Explanation
|
29 |
+
|
30 |
+
The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using:
|
31 |
+
|
32 |
+
- **Model Size**: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware.
|
33 |
+
- **Transformer Layer-Dimension**: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by **different layers could have various performance depending on tasks**.
|
34 |
+
- **Feature Rate**: Given a 1-second audio input, the number of features output by the model.
|
35 |
+
- **Sample Rate**: The frequency of audio that the model is trained with.
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
# Introduction to MERT-v1
|
40 |
+
|
41 |
+
Compared to MERT-v0, we introduce multiple new things in the MERT-v1 pre-training:
|
42 |
+
|
43 |
+
- Change the pseudo labels to 8 codebooks from [encodec](https://github.com/facebookresearch/encodec), which potentially has higher quality and empower our model to support music generation.
|
44 |
+
- MLM prediction with in-batch noise mixture.
|
45 |
+
- Train with higher audio frequency (24K Hz).
|
46 |
+
- Train with more audio data (up to 160 thousands of hours).
|
47 |
+
- More available model sizes 95M and 330M.
|
48 |
+
|
49 |
+
|
50 |
+
|
51 |
+
More details will be written in our coming-soon paper.
|
52 |
+
|
53 |
+
|
54 |
+
|
55 |
+
# Model Usage
|
56 |
+
|
57 |
+
```python
|
58 |
+
from transformers import Wav2Vec2Processor
|
59 |
+
from transformers import AutoModel
|
60 |
+
import torch
|
61 |
+
from torch import nn
|
62 |
+
from datasets import load_dataset
|
63 |
+
|
64 |
+
# load demo audio and set processor
|
65 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
|
66 |
+
dataset = dataset.sort("id")
|
67 |
+
sampling_rate = dataset.features["audio"].sampling_rate
|
68 |
+
processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
|
69 |
+
|
70 |
+
# loading our model weights
|
71 |
+
commit_hash='bccff5376fc07235d88954b43e5cd739fbc0796b' # this is recommended for security reason, the hash might be updated
|
72 |
+
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True, revision=commit_hash)
|
73 |
+
|
74 |
+
# audio file is decoded on the fly
|
75 |
+
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
|
76 |
+
with torch.no_grad():
|
77 |
+
outputs = model(**inputs, output_hidden_states=True)
|
78 |
+
|
79 |
+
# take a look at the output shape, there are 13 layers of representation
|
80 |
+
# each layer performs differently in different downstream tasks, you should choose empirically
|
81 |
+
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
|
82 |
+
print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim]
|
83 |
+
|
84 |
+
# for utterance level classification tasks, you can simply reduce the representation in time
|
85 |
+
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
|
86 |
+
print(time_reduced_hidden_states.shape) # [13, 768]
|
87 |
+
|
88 |
+
# you can even use a learnable weighted average representation
|
89 |
+
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
|
90 |
+
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
|
91 |
+
print(weighted_avg_hidden_states.shape) # [768]
|
92 |
+
```
|
93 |
+
|
94 |
+
|
95 |
+
|
96 |
+
# Citation
|
97 |
+
|
98 |
+
```shell
|
99 |
+
@article{li2022large,
|
100 |
+
title={Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning},
|
101 |
+
author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
|
102 |
+
year={2022}
|
103 |
+
}
|
104 |
+
|
105 |
+
```
|