Get audio embeddings

#6
by epinnock - opened

Is is possible to use this to generate audio embeddings? If so is there any good documentation on this.

Yes - you can extract hidden states from the model by passing the argument output_hidden_states to the forward call, see https://huggingface.co/docs/transformers/main/en/model_doc/musicgen#transformers.MusicgenForConditionalGeneration.forward

Or alternatively to the generate method:

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

generated_outputs = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256, output_hidden_states=True, return_dict_in_generate=True)

Which embeddings are you interested in particular? The audio codes from EnCodec? Or the hidden-states from MusicGen?

Sign up or log in to comment