--- language: - en license: cc-by-nc-4.0 library_name: transformers tags: - music - art --- --- license: cc-by-4.0 language: - en tags: - music - art --- # Model Card for Model ID ## Model Details ### Model Description The model consists of a music encoder ```MERT-v1-300M```, a natural language decoder ```vicuna-7b-delta-v0```, and a linear projection laer between the two. This checkpoint of MusiLingo is developed on the MusicQA and can answer instructions with music raw audio, such as querying about the tempo, emotion, genre, tags or subjective feelings etc. You can use the MusicQA dataset for the following demo. For the implementation of MusicQA, please refer to our [Github repo](https://github.com/zihaod/MusiLingo/blob/main/musilingo/datasets/datasets/musicqa_dataset.py). ### Model Sources [optional] - **Repository:** [GitHub repo](https://github.com/zihaod/MusiLingo) - **Paper [optional]:** __[MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response](https://arxiv.org/abs/2309.08730)__ ## Getting Start ``` from tqdm.auto import tqdm import torch from torch.utils.data import DataLoader from transformers import Wav2Vec2FeatureExtractor from transformers import StoppingCriteria, StoppingCriteriaList class StoppingCriteriaSub(StoppingCriteria): def __init__(self, stops=[], encounters=1): super().__init__() self.stops = stops def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor): for stop in self.stops: if torch.all((stop == input_ids[0][-len(stop):])).item(): return True return False def answer(self, samples, stopping, max_new_tokens=300, num_beams=1, min_length=1, top_p=0.5, repetition_penalty=1.0, length_penalty=1, temperature=0.1, max_length=2000): audio = samples["audio"].cuda() audio_embeds, atts_audio = self.encode_audio(audio) if 'instruction_input' in samples: # instruction dataset #print('Instruction Batch') instruction_prompt = [] for instruction in samples['instruction_input']: prompt = ' ' + instruction instruction_prompt.append(self.prompt_template.format(prompt)) audio_embeds, atts_audio = self.instruction_prompt_wrap(audio_embeds, atts_audio, instruction_prompt) self.llama_tokenizer.padding_side = "right" batch_size = audio_embeds.shape[0] bos = torch.ones([batch_size, 1], dtype=torch.long, device=torch.device('cuda')) * self.llama_tokenizer.bos_token_id bos_embeds = self.llama_model.model.embed_tokens(bos) atts_bos = atts_audio[:, :1] inputs_embeds = torch.cat([bos_embeds, audio_embeds], dim=1) attention_mask = torch.cat([atts_bos, atts_audio], dim=1) outputs = self.llama_model.generate( inputs_embeds=inputs_embeds, max_new_tokens=max_new_tokens, stopping_criteria=stopping, num_beams=num_beams, do_sample=True, min_length=min_length, top_p=top_p, repetition_penalty=repetition_penalty, length_penalty=length_penalty, temperature=temperature, ) output_token = outputs[0] if output_token[0] == 0: # the model might output a unknow token at the beginning. remove it output_token = output_token[1:] if output_token[0] == 1: # if there is a start token at the beginning. remove it output_token = output_token[1:] output_text = self.llama_tokenizer.decode(output_token, add_special_tokens=False) output_text = output_text.split('###')[0] # remove the stop sign '###' output_text = output_text.split('Assistant:')[-1].strip() return output_text processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True) ds = MusicQADataset(processor, f'{path}/data/music_data', 'Eval') dl = DataLoader( ds, batch_size=1, num_workers=0, pin_memory=True, shuffle=False, drop_last=True, collate_fn=ds.collater ) stopping = StoppingCriteriaList([StoppingCriteriaSub([torch.tensor([835]).cuda(), torch.tensor([2277, 29937]).cuda()])]) from transformers import AutoModel model_musicqa = AutoModel.from_pretrained("m-a-p/MusiLingo-musicqa-v1") for idx, sample in tqdm(enumerate(dl)): ans = answer(Musilingo_musicqa.model, sample, stopping, length_penalty=100, temperature=0.1) txt = sample['text_input'][0] print(txt) print(and) ``` # Citing This Work If you find the work useful for your research, please consider citing it using the following BibTeX entry: ``` @inproceedings{deng2024musilingo, title={MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response}, author={Deng, Zihao and Ma, Yinghao and Liu, Yudong and Guo, Rongchen and Zhang, Ge and Chen, Wenhu and Huang, Wenhao and Benetos, Emmanouil}, booktitle={Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)}, year={2024}, organization={Association for Computational Linguistics} } ```