This new version has been crafted with transparency in mind,
so you can understand the process of translating an image to a musical equivalent.
How does it works under the hood ? 🤔
First, we get a very literal caption from microsoft/kosmos-2-patch14-224; this caption is then given to a LLM Agent (currently HuggingFaceH4/zephyr-7b-beta )which task is to translate the image caption to a musical and inspirational prompt for the next step.
Once we got a nice musical text from the LLM, we can send it to the text-to-music model of your choice:
MAGNet, MusicGen, AudioLDM-2, Riffusion or Mustango
Instead of the previous version of Image to Music which used Mubert API, and could output curious and obscure combinations, we only provide open sourced models available on the hub, called via the gradio API.
Also i guess the music result should be more accurate to the atmosphere of the image input, thanks to the LLM Agent step.
Pro tip, you can adjust the inspirational prompt to match your expectations, according to the chosen model and specific behavior of each one 👌
Try it, explore different models and tell me which one is your favorite 🤗