reach-vb HF staff JadeCopet commited on
Commit
53d5f64
1 Parent(s): 7a5a4bc

Update README.md (#4)

Browse files

- Update README.md (3f6490272222cfb73063269e9af1120763dac37b)
- Update README.md (13c7faf27e8c30d49ea425ca3f28f353333bd5f2)


Co-authored-by: Jade Copet <JadeCopet@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +18 -8
README.md CHANGED
@@ -65,6 +65,7 @@ for idx, one_wav in enumerate(wav):
65
  audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
66
  ```
67
 
 
68
  ## Model details
69
 
70
  **Organization developing the model:** The FAIR team of Meta AI.
@@ -75,9 +76,9 @@ for idx, one_wav in enumerate(wav):
75
 
76
  **Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.
77
 
78
- **Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation][https://arxiv.org/abs/2306.05284].
79
 
80
- **Citation details**:
81
  ```
82
  @misc{copet2023simple,
83
  title={Simple and Controllable Music Generation},
@@ -89,7 +90,7 @@ for idx, one_wav in enumerate(wav):
89
  }
90
  ```
91
 
92
- **License** Code is released under MIT, model weights are released under CC-BY-NC 4.0.
93
 
94
  **Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.
95
 
@@ -101,7 +102,7 @@ for idx, one_wav in enumerate(wav):
101
 
102
  **Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.
103
 
104
- **Out-of-scope use cases** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
105
 
106
  ## Metrics
107
 
@@ -127,17 +128,26 @@ The model was evaluated on the [MusicCaps benchmark](https://www.kaggle.com/data
127
 
128
  ## Training datasets
129
 
130
- The model was trained using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
 
 
 
 
131
 
132
- ## Quantitative analysis
 
 
 
 
 
133
 
134
- More information can be found in the paper [Simple and Controllable Music Generation][arxiv], in the Experimental Setup section.
135
 
136
  ## Limitations and biases
137
 
138
  **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
139
 
140
- **Mitigations:** All vocals have been removed from the data source using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs). The model is therefore not able to produce vocals.
141
 
142
  **Limitations:**
143
 
 
65
  audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
66
  ```
67
 
68
+
69
  ## Model details
70
 
71
  **Organization developing the model:** The FAIR team of Meta AI.
 
76
 
77
  **Model type:** MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation.
78
 
79
+ **Paper or resources for more information:** More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284).
80
 
81
+ **Citation details:**
82
  ```
83
  @misc{copet2023simple,
84
  title={Simple and Controllable Music Generation},
 
90
  }
91
  ```
92
 
93
+ **License:** Code is released under MIT, model weights are released under CC-BY-NC 4.0.
94
 
95
  **Where to send questions or comments about the model:** Questions and comments about MusicGen can be sent via the [Github repository](https://github.com/facebookresearch/audiocraft) of the project, or by opening an issue.
96
 
 
102
 
103
  **Primary intended users:** The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.
104
 
105
+ **Out-of-scope use cases:** The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
106
 
107
  ## Metrics
108
 
 
128
 
129
  ## Training datasets
130
 
131
+ The model was trained on licensed data using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
132
+
133
+ ## Evaluation results
134
+
135
+ Below are the objective metrics obtained on MusicCaps with the released model. Note that for the publicly released models, we had all the datasets go through a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs), in order to keep only the instrumental part. This explains the difference in objective metrics with the models used in the paper.
136
 
137
+ | Model | Frechet Audio Distance | KLD | Text Consistency | Chroma Cosine Similarity |
138
+ |---|---|---|---|---|
139
+ | facebook/musicgen-small | 4.88 | 1.42 | 0.27 | - |
140
+ | facebook/musicgen-medium | 5.14 | 1.38 | 0.28 | - |
141
+ | facebook/musicgen-large | 5.48 | 1.37 | 0.28 | - |
142
+ | **facebook/musicgen-melody** | 4.93 | 1.41 | 0.27 | 0.44 |
143
 
144
+ More information can be found in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284), in the Results section.
145
 
146
  ## Limitations and biases
147
 
148
  **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
149
 
150
+ **Mitigations:** Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs).
151
 
152
  **Limitations:**
153