dorienh commited on
Commit
598c444
β€’
1 Parent(s): 6ab2eaa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -139
README.md CHANGED
@@ -1,139 +1,139 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - amaai-lab/MidiCaps
5
- tags:
6
- - music
7
- - text-to-music
8
- - symbolic-music
9
- ---
10
-
11
- <div align="center">
12
-
13
- # text2midi: Generating Symbolic Music from Captions
14
-
15
- [Demo](https://huggingface.co/spaces/amaai-lab/text2midi) | [Model](https://huggingface.co/amaai-lab/text2midi) | [Website and Examples](https://github.com/AMAAI-Lab/text2midi) | [Paper](https://arxiv.org/abs/TBD) | [Dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps)
16
-
17
- [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/amaai-lab/text2midi)
18
- </div>
19
-
20
- **text2midi** is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, **text2midi** allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style.
21
-
22
- πŸ”₯ Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).
23
-
24
- <div align="center">
25
- <img src="text2midi_architecture.jpg" width="500"/>
26
- </div>
27
-
28
- ## Quickstart Guide
29
-
30
- Generate symbolic music from a text prompt:
31
-
32
- ```python
33
- from transformers import T5Tokenizer
34
- from model.transformer_model import Transformer
35
- from miditok import REMI, TokenizerConfig
36
- from pathlib import Path
37
-
38
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
39
- artifact_folder = 'artifacts'
40
-
41
- tokenizer_filepath = os.path.join(artifact_folder, "vocab_remi.pkl")
42
- # Load the tokenizer dictionary
43
- with open(tokenizer_filepath, "rb") as f:
44
- r_tokenizer = pickle.load(f)
45
-
46
- # Get the vocab size
47
- vocab_size = len(r_tokenizer)
48
- print("Vocab size: ", vocab_size)
49
- model = Transformer(vocab_size, 768, 8, 5000, 18, 1024, False, 8, device=device)
50
- model.load_state_dict(torch.load('/text2midi/artifacts/pytorch_model_140.bin', map_location=device))
51
- model.eval()
52
- tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
53
-
54
- src = "A melodic electronic song with ambient elements, featuring piano, acoustic guitar, alto saxophone, string ensemble, and electric bass. Set in G minor with a 4/4 time signature, it moves at a lively Presto tempo. The composition evokes a blend of relaxation and darkness, with hints of happiness and a meditative quality."
55
- inputs = tokenizer(src, return_tensors='pt', padding=True, truncation=True)
56
- input_ids = nn.utils.rnn.pad_sequence(inputs.input_ids, batch_first=True, padding_value=0)
57
- input_ids = input_ids.to(device)
58
- attention_mask =nn.utils.rnn.pad_sequence(inputs.attention_mask, batch_first=True, padding_value=0)
59
- attention_mask = attention_mask.to(device)
60
- output = model.generate(input_ids, attention_mask, max_len=2000,temperature = 1.0)
61
- output_list = output[0].tolist()
62
- generated_midi = r_tokenizer.decode(output_list)
63
- generated_midi.dump_midi("output.mid")
64
- post_processing("output.mid", "output.mid")
65
- ```
66
-
67
- ## Installation
68
- ```bash
69
- git clone https://github.com/AMAAI-Lab/text-2-midi
70
- cd text-2-midi
71
- pip install -r requirements.txt
72
- ```
73
-
74
- ## Datasets
75
- The MidiCaps dataset is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks.
76
-
77
- ## Results of the Listening Study
78
-
79
- Each question is rated on a Likert scale from 1 (very bad) to 7 (very good). The table shows the average ratings per question for each group of participants.
80
-
81
- | **Question** | **General Audience (MidiCaps)** | **General Audience (text2midi)** | **Music Experts (MidiCaps)** | **Music Experts (text2midi)** |
82
- |---------------------|---------------------------------|-----------------------------------|------------------------------|--------------------------------|
83
- | Overall matching | 5.17 | 4.12 | 5.29 | 4.05 |
84
- | Genre matching | 5.22 | 4.29 | 5.31 | 4.29 |
85
- | Mood matching | 5.24 | 4.10 | 5.44 | 4.26 |
86
- | Key matching | 4.72 | 4.24 | 4.63 | 4.05 |
87
- | Chord matching | 4.65 | 4.23 | 4.05 | 4.06 |
88
- | Tempo matching | 4.72 | 4.48 | 5.15 | 4.90 |
89
-
90
-
91
- ## Objective Evaluations
92
-
93
- | Metric | text2midi | MidiCaps | MuseCoco |
94
- |---------------------|-----------|----------|----------|
95
- | CR ↑ | 2.156 | 3.4326 | 2.1288 |
96
- | CLAP ↑ | 0.2204 | 0.2593 | 0.2158 |
97
- | TB (%) ↑ | 34.03 | - | 21.71 |
98
- | TBT (%) ↑ | 66.9 | - | 54.63 |
99
- | CK (%) ↑ | 15.36 | - | 13.70 |
100
- | CKD (%) ↑ | 15.80 | - | 14.59 |
101
-
102
- **Note**:
103
- CR = Compression ratio
104
- CLAP = CLAP score
105
- TB = Tempo Bin
106
- TBT = Tempo Bin with Tolerance
107
- CK = Correct Key
108
- CKD = Correct Key with Duplicates
109
- ↑ = Higher score is better.
110
-
111
- ## Training
112
- To train text2midi, we recommend using accelerate for multi-GPU support. First, configure accelerate by running:
113
- ```bash
114
- accelerate config
115
- ```
116
-
117
- Then, use the following command to start training:
118
- ```bash
119
- accelerate launch train.py \
120
- --encoder_model="google/flan-t5-large" \
121
- --decoder_model="configs/transformer_decoder_config.json" \
122
- --dataset_name="amaai-lab/MidiCaps" \
123
- --pretrain_dataset="amaai-lab/SymphonyNet" \
124
- --batch_size=16 \
125
- --learning_rate=1e-4 \
126
- --epochs=40 \
127
- ```
128
-
129
- ## Citation
130
- If you use text2midi in your research, please cite:
131
- ```
132
- @misc{bhandari2025text2midi,
133
- title={text2midi: Generating Symbolic Music from Captions},
134
- author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
135
- year={2025},
136
- eprint={2311.08355},
137
- archivePrefix={arXiv},
138
- }
139
- ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - amaai-lab/MidiCaps
5
+ tags:
6
+ - music
7
+ - text-to-music
8
+ - symbolic-music
9
+ ---
10
+
11
+ <div align="center">
12
+
13
+ # Text2midi: Generating Symbolic Music from Captions
14
+
15
+ [Demo](https://huggingface.co/spaces/amaai-lab/text2midi) | [Model](https://huggingface.co/amaai-lab/text2midi) | [Website and Examples](https://github.com/AMAAI-Lab/text2midi) | [Paper](https://arxiv.org/abs/TBD) | [Dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps)
16
+
17
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/amaai-lab/text2midi)
18
+ </div>
19
+
20
+ **text2midi** is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, **text2midi** allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style.
21
+
22
+ πŸ”₯ Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).
23
+
24
+ <div align="center">
25
+ <img src="text2midi_architecture.jpg" width="500"/>
26
+ </div>
27
+
28
+ ## Quickstart Guide
29
+
30
+ Generate symbolic music from a text prompt:
31
+
32
+ ```python
33
+ from transformers import T5Tokenizer
34
+ from model.transformer_model import Transformer
35
+ from miditok import REMI, TokenizerConfig
36
+ from pathlib import Path
37
+
38
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
39
+ artifact_folder = 'artifacts'
40
+
41
+ tokenizer_filepath = os.path.join(artifact_folder, "vocab_remi.pkl")
42
+ # Load the tokenizer dictionary
43
+ with open(tokenizer_filepath, "rb") as f:
44
+ r_tokenizer = pickle.load(f)
45
+
46
+ # Get the vocab size
47
+ vocab_size = len(r_tokenizer)
48
+ print("Vocab size: ", vocab_size)
49
+ model = Transformer(vocab_size, 768, 8, 5000, 18, 1024, False, 8, device=device)
50
+ model.load_state_dict(torch.load('/text2midi/artifacts/pytorch_model_140.bin', map_location=device))
51
+ model.eval()
52
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
53
+
54
+ src = "A melodic electronic song with ambient elements, featuring piano, acoustic guitar, alto saxophone, string ensemble, and electric bass. Set in G minor with a 4/4 time signature, it moves at a lively Presto tempo. The composition evokes a blend of relaxation and darkness, with hints of happiness and a meditative quality."
55
+ inputs = tokenizer(src, return_tensors='pt', padding=True, truncation=True)
56
+ input_ids = nn.utils.rnn.pad_sequence(inputs.input_ids, batch_first=True, padding_value=0)
57
+ input_ids = input_ids.to(device)
58
+ attention_mask =nn.utils.rnn.pad_sequence(inputs.attention_mask, batch_first=True, padding_value=0)
59
+ attention_mask = attention_mask.to(device)
60
+ output = model.generate(input_ids, attention_mask, max_len=2000,temperature = 1.0)
61
+ output_list = output[0].tolist()
62
+ generated_midi = r_tokenizer.decode(output_list)
63
+ generated_midi.dump_midi("output.mid")
64
+ post_processing("output.mid", "output.mid")
65
+ ```
66
+
67
+ ## Installation
68
+ ```bash
69
+ git clone https://github.com/AMAAI-Lab/text-2-midi
70
+ cd text-2-midi
71
+ pip install -r requirements.txt
72
+ ```
73
+
74
+ ## Datasets
75
+ The MidiCaps dataset is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks.
76
+
77
+ ## Results of the Listening Study
78
+
79
+ Each question is rated on a Likert scale from 1 (very bad) to 7 (very good). The table shows the average ratings per question for each group of participants.
80
+
81
+ | **Question** | **General Audience (MidiCaps)** | **General Audience (text2midi)** | **Music Experts (MidiCaps)** | **Music Experts (text2midi)** |
82
+ |---------------------|---------------------------------|-----------------------------------|------------------------------|--------------------------------|
83
+ | Overall matching | 5.17 | 4.12 | 5.29 | 4.05 |
84
+ | Genre matching | 5.22 | 4.29 | 5.31 | 4.29 |
85
+ | Mood matching | 5.24 | 4.10 | 5.44 | 4.26 |
86
+ | Key matching | 4.72 | 4.24 | 4.63 | 4.05 |
87
+ | Chord matching | 4.65 | 4.23 | 4.05 | 4.06 |
88
+ | Tempo matching | 4.72 | 4.48 | 5.15 | 4.90 |
89
+
90
+
91
+ ## Objective Evaluations
92
+
93
+ | Metric | text2midi | MidiCaps | MuseCoco |
94
+ |---------------------|-----------|----------|----------|
95
+ | CR ↑ | 2.156 | 3.4326 | 2.1288 |
96
+ | CLAP ↑ | 0.2204 | 0.2593 | 0.2158 |
97
+ | TB (%) ↑ | 34.03 | - | 21.71 |
98
+ | TBT (%) ↑ | 66.9 | - | 54.63 |
99
+ | CK (%) ↑ | 15.36 | - | 13.70 |
100
+ | CKD (%) ↑ | 15.80 | - | 14.59 |
101
+
102
+ **Note**:
103
+ CR = Compression ratio
104
+ CLAP = CLAP score
105
+ TB = Tempo Bin
106
+ TBT = Tempo Bin with Tolerance
107
+ CK = Correct Key
108
+ CKD = Correct Key with Duplicates
109
+ ↑ = Higher score is better.
110
+
111
+ ## Training
112
+ To train text2midi, we recommend using accelerate for multi-GPU support. First, configure accelerate by running:
113
+ ```bash
114
+ accelerate config
115
+ ```
116
+
117
+ Then, use the following command to start training:
118
+ ```bash
119
+ accelerate launch train.py \
120
+ --encoder_model="google/flan-t5-large" \
121
+ --decoder_model="configs/transformer_decoder_config.json" \
122
+ --dataset_name="amaai-lab/MidiCaps" \
123
+ --pretrain_dataset="amaai-lab/SymphonyNet" \
124
+ --batch_size=16 \
125
+ --learning_rate=1e-4 \
126
+ --epochs=40 \
127
+ ```
128
+
129
+ ## Citation
130
+ If you use text2midi in your research, please cite:
131
+ ```
132
+ @misc{bhandari2025text2midi,
133
+ title={text2midi: Generating Symbolic Music from Captions},
134
+ author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
135
+ year={2025},
136
+ eprint={2311.08355},
137
+ archivePrefix={arXiv},
138
+ }
139
+ ```