Confusion about the use of the Encodec model

#4
by xtluo - opened

In your published paper, the Encodec model is used as the final acoustic teacher, but the pseudocode is:

y_VQ = embedding(x_acoustic_labels) 
z = MERT(x_noised)
loss_acoustic = Cross_Entropy(z[mask_idx], y_VQ[mask_idx])

So, my questions are:

  1. How CrossEntropy is calculated between z and embedding y_VQ
  2. I didn't find any encodec-related information in your open-sourced code
Multimodal Art Projection org
  1. You may refer to our open-sourced code: https://github.com/yizhilll/MERT

  2. The codecs are pre-extracted from the audio, and directly used as labels.

Sign up or log in to comment