Clarification on Accepted Audio File Formats for MERT-v1-95M Model

#1
by dedai - opened

I am currently exploring the MERT-v1-95M model for conducting multiple music understanding tasks. I appreciate the work done on this model and its potential applications in music information retrieval.

However, I am encountering some difficulties with the audio file input. Specifically, I am unsure about the accepted audio file formats for this model. I have tried inputting files in various formats including WAV, AAC, and MP3, but I have not been successful in getting the model to process these files. Interestingly, an AAC file that was recorded on my phone did work, but I would like to test the model with a wider range of audio files, including those in WAV format.

The 'Record from Microphone' feature is functioning correctly, but I am having trouble with the 'Add music audio file' feature. When I attempt to upload an audio file, I receive an error message, but the message does not provide specific details about the nature of the error. This has made it difficult for me to diagnose the issue and determine the appropriate steps to resolve it.

Could you please provide some guidance on this matter? Specifically, I would like to know:

What audio file formats are accepted by the MERT-v1-95M model?
Are there any specific requirements or constraints for the audio files (e.g., bit rate, sample rate, number of channels)?
Could the error message be made more informative to help users diagnose issues with their audio files?
I believe this information would be beneficial not only for me but also for other users of this model. Thank you in advance for your assistance.

Best regards,

Elliott
elliott@iamdedeye.com

Multimodal Art Projection org

Hi Elliott,

Thanks for trying out our demo and the suggestions.

Theorectically, all the audio formats supported by torchaudio.load() can be used in the demo. Theese should include but not limited to WAV, AMB, MP3, FLAC.

Due the hardware limitation of the machine hosting our demospecification (2 CPU and 16GB RAM), there might be Error output when uploading long audios.

Unfortunately, we couldn't fix this in a short time since our team are all volunteer researchers.

We recommend to test audios less than 30 seconds or using the live mode if you are trying the Music Descriptor demo hosted online at HuggingFace Space.

This issue is expected to solve in the future by applying more community-support GPU resources or using other audio encoding strategy.

In the current stage, if you want to directly run the demo with longer audios, you could:

  • clone this space git clone https://huggingface.co/spaces/m-a-p/Music-Descriptor and deploy the demo on your own machine with higher performance following the official instruction. The code will automatically use GPU for inference if there is GPU that can be detected by torch.cuda.is_available().
  • develop your own application with the MERT models if you have the experience of machine learning.

(and yes, I've updated this to the readme, thanks for the suggestions again)

Dear Yizhilll,

Thank you for your prompt and detailed response. I appreciate the clarification on the audio formats supported by the MERT-v1-95M model and the constraints due to hardware limitations.

I have followed your recommendation and tested with audio files less than 30 seconds in length, but I am still encountering the same error. The error occurs regardless of the audio file format (WAV, AAC, MP3, etc.) and its duration.

I understand that the team is composed of volunteer researchers and that certain issues may take time to resolve. I appreciate the work you are doing and the challenges involved in maintaining such a project.

In the meantime, I will consider the options you suggested, such as cloning the space and deploying the demo on my own machine, or developing my own application with the MERT models.

I have been following your work for a while now and I am very interested in the potential applications of the MERT model. I have a particular application in mind that I believe could benefit greatly from this model. I would appreciate it if you could reach out to me directly at elliott@iamdedeye.com to discuss this further.

Again, I appreciate your assistance and look forward to any updates or improvements to the Music Descriptor demo.

Best regards,

Elliott

Sign up or log in to comment