keshavbhandari commited on
Commit
7bcc411
β€’
1 Parent(s): 34d5393

Updated readme

Browse files
Files changed (2) hide show
  1. README.md +101 -2
  2. text2midi_architecture.jpg +0 -0
README.md CHANGED
@@ -1,3 +1,102 @@
 
 
 
 
 
 
 
 
1
  ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ license: apache-2.0
2
+ datasets:
3
+ - amaai-lab/MidiCaps
4
+ tags:
5
+ - music
6
+ - text-to-music
7
+ - symbolic-music
8
+ pipeline_tag: text-to-music
9
  ---
10
+
11
+ <div align="center">
12
+
13
+ # text2midi: Generating Symbolic Music from Captions
14
+
15
+ [Demo](https://huggingface.co/spaces/amaai-lab/text2midi) | [Model](https://huggingface.co/amaai-lab/text2midi) | [Website and Examples](https://github.com/AMAAI-Lab/text2midi) | [Paper](https://arxiv.org/abs/2311.08355) | [Dataset](https://huggingface.co/datasets/amaai-lab/MidiCaps)
16
+
17
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/amaai-lab/text2midi)
18
+ </div>
19
+
20
+ **text2midi** is the first end-to-end model for generating MIDI files from textual descriptions. By leveraging pretrained large language models and a powerful autoregressive transformer decoder, **text2midi** allows users to create symbolic music that aligns with detailed textual prompts, including musical attributes like chords, tempo, and style.
21
+
22
+ πŸ”₯ Live demo available on [HuggingFace Spaces](https://huggingface.co/spaces/amaai-lab/text2midi).
23
+
24
+ <div align="center">
25
+ <img src="text2midi_architecture.jpg" width="500"/>
26
+ </div>
27
+
28
+ ## Quickstart Guide
29
+
30
+ Generate symbolic music from a text prompt:
31
+
32
+ ```python
33
+ from transformers import T5Tokenizer
34
+ from model.transformer_model import Transformer
35
+ from miditok import REMI, TokenizerConfig
36
+ from pathlib import Path
37
+
38
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
39
+ artifact_folder = 'artifacts'
40
+
41
+ tokenizer_filepath = os.path.join(artifact_folder, "vocab_remi.pkl")
42
+ # Load the tokenizer dictionary
43
+ with open(tokenizer_filepath, "rb") as f:
44
+ r_tokenizer = pickle.load(f)
45
+
46
+ # Get the vocab size
47
+ vocab_size = len(r_tokenizer)
48
+ print("Vocab size: ", vocab_size)
49
+ model = Transformer(vocab_size, 768, 8, 5000, 18, 1024, False, 8, device=device)
50
+ model.load_state_dict(torch.load('/text2midi/artifacts/pytorch_model_140.bin', map_location=device))
51
+ model.eval()
52
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
53
+
54
+ src = "A melodic electronic song with ambient elements, featuring piano, acoustic guitar, alto saxophone, string ensemble, and electric bass. Set in G minor with a 4/4 time signature, it moves at a lively Presto tempo. The composition evokes a blend of relaxation and darkness, with hints of happiness and a meditative quality."
55
+ inputs = tokenizer(src, return_tensors='pt', padding=True, truncation=True)
56
+ input_ids = nn.utils.rnn.pad_sequence(inputs.input_ids, batch_first=True, padding_value=0)
57
+ input_ids = input_ids.to(device)
58
+ attention_mask =nn.utils.rnn.pad_sequence(inputs.attention_mask, batch_first=True, padding_value=0)
59
+ attention_mask = attention_mask.to(device)
60
+ output = model.generate(input_ids, attention_mask, max_len=2000,temperature = 1.0)
61
+ output_list = output[0].tolist()
62
+ generated_midi = r_tokenizer.decode(output_list)
63
+ generated_midi.dump_midi("output.mid")
64
+ post_processing("output.mid", "output.mid")
65
+ ```
66
+
67
+ ## Installation
68
+ git clone https://github.com/AMAAI-Lab/text2midi
69
+ cd text2midi
70
+ pip install -r requirements.txt
71
+
72
+ ## Datasets
73
+ The MidiCaps dataset is a large-scale dataset of 168k MIDI files paired with rich text captions. These captions contain musical attributes such as key, tempo, style, and mood, making it ideal for text-to-MIDI generation tasks.
74
+
75
+ ## Subjective Evaluation by Expert Listeners
76
+ Model Dataset Pre-trained Overall Match ↑ Chord Match ↑ Tempo Match ↑ Symbolic Quality ↑ Musicality ↑ Text Alignment ↑
77
+ MuseCoco MidiCaps βœ“ 4.12 3.02 3.85 3.50 3.20 3.90
78
+ text2midi MidiCaps βœ“ 4.85 4.10 4.62 4.25 4.45 4.78
79
+
80
+ ## Training
81
+ To train text2midi, we recommend using accelerate for multi-GPU support. First, configure accelerate by running:
82
+ accelerate config
83
+
84
+ Then, use the following command to start training:
85
+ accelerate launch train.py \
86
+ --encoder_model="google/flan-t5-large" \
87
+ --decoder_model="configs/transformer_decoder_config.json" \
88
+ --dataset_name="amaai-lab/MidiCaps" \
89
+ --pretrain_dataset="amaai-lab/SymphonyNet" \
90
+ --batch_size=16 \
91
+ --learning_rate=1e-4 \
92
+ --epochs=40 \
93
+
94
+ ## Citation
95
+ If you use text2midi in your research, please cite:
96
+ @misc{bhandari2025text2midi,
97
+ title={text2midi: Generating Symbolic Music from Captions},
98
+ author={Keshav Bhandari and Abhinaba Roy and Kyra Wang and Geeta Puri and Simon Colton and Dorien Herremans},
99
+ year={2025},
100
+ eprint={2311.08355},
101
+ archivePrefix={arXiv},
102
+ }
text2midi_architecture.jpg ADDED