OpenJMLA / README.md

Update README.md

acd84ee 11 months ago

4.5 kB

	---
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- audio2text
	- music2text
	- musicllm
	- music foundation model
	license: cc
	---
	<img src="https://huggingface.co/UniMus/OpenJMLA/resolve/main/UniMus_logo_0.png" alt="drawing" width="256"/>

	# UniMus Project: OpenJMLA


	<br>
	&nbsp<a href="https://arxiv.org/pdf/2310.10159.pdf"> reImplementation of JMLA</a>
	</p>
	<br>

	Music tagging is a task to predict the tags of music recordings.
	However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags.
	In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem.
	The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B.
	We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings.
	We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers.
	We collect a large-scale music and description dataset from the internet.
	We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models.
	Our proposed JMLA system achieves a zero-shot audio tagging accuracy of 64.82% on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.


	## Requirements
	* conda create -name SpectPrompt python=3.9
	* pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
	* pip install transformers datasets librosa einops_exts einops mmcls peft ipdb torchlibrosa
	* pip install -U openmim
	* mim install mmcv==1.7.1
	<br>

	## Quickstart
	Below, we provide simple examples to show how to use OpenJMLA with 🤗 Transformers.

	#### 🤗 Transformers

	To use OpenJMLA for the inference, all you need to do is to input a few lines of codes as demonstrated below.

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch
	import numpy as np

	model = AutoModel.from_pretrained('UniMus/OpenJMLA', trust_remote_code=True)
	device = model.device
	# sample rate: 16k
	music_path = '/path/to/music.wav'
	# 1. get logmelspectrogram
	# get the file wav_to_mel.py from https://github.com/taugastcn/SpectPrompt.git
	from wav_to_mel import wav_to_mel
	lms = wav_to_mel(music_path)

	import os
	from torch.nn.utils.rnn import pad_sequence
	import random
	# get the file transforms.py from https://github.com/taugastcn/SpectPrompt.git
	from transforms import Normalize, SpecRandomCrop, SpecPadding, SpecRepeat
	transforms = [ Normalize(-4.5, 4.5), SpecRandomCrop(target_len=2992), SpecPadding(target_len=2992), SpecRepeat() ]
	lms = lms.numpy()
	for trans in transforms:
	lms = trans(lms)

	# 2. template of input
	input_dic = dict()
	input_dic['filenames'] = [music_path.split('/')[-1]]
	input_dic['ans_crds'] = [0]
	input_dic['audio_crds'] = [0]
	input_dic['attention_mask'] = torch.tensor([[1, 1, 1, 1, 1]]).to(device)
	input_dic['input_ids'] = torch.tensor([[1, 694, 5777, 683, 13]]).to(device)
	input_dic['spectrogram'] = torch.from_numpy(lms).unsqueez(dim=0).to(device)
	# 3. generation
	model.eval()
	gen_ids = model.forward_test(input)
	gen_text = model.neck.tokenizer.batch_decode(gen_ids.clip(0))
	# 4. Post-processing
	# Given that the training data may contain biases, the generated texts might need some straightforward post-processing to ensure accuracy.
	# In future versions, we will enhance the quality of the data.
	gen_text = gen_text.split('<s>')[-1].split('\n')[0].strip()
	gen_text = gen_text.replace(' in Chinese','')
	gen_text = gen_text.replace(' Chinese','')
	print(gen_text)
	```

	## Example

	### music:
	https://www.youtube.com/watch?v=Q_yuO8UNGmY

	### caption:
	Instruments: Vocals, piano, strings
	Genre: pop
	Theme: Heartbreak.
	Mood: Melancholy.
	Era: Contemporary.
	Tempo: Fast
	Best scene: A small, dimly lit bar. The melancholy mood of this song will complement the stage-inspired melody.

	## Citation
	If you find our paper and code useful in your research, please consider giving a star and citation

	```BibTeX
	@article{JMLA,
	title={JOINT MUSIC AND LANGUAGE ATTENTION MODELS FOR ZERO-SHOT MUSIC TAGGING},
	author={Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong},
	journal={arXiv preprint arXiv:2310.10159},
	year={2023}
	}
	```
	<br>