Could you share your inference setting?
I used the "tango-full-ft-audiocaps" ckpt to inference and get bad evaluation score(FD 28, FAD 2.3, KL 2.0,etc), my inference setting is "num_steps 200, guide 3.0, no seed"
Could you use this script instead? https://github.com/declare-lab/tango/blob/master/inference_hf.py
The sampling rate should be 16 kHz.
Could you use this script instead? https://github.com/declare-lab/tango/blob/master/inference_hf.py
The sampling rate should be 16 kHz.
Yes, I just used script. But the results are still not good. Should the ground turth files be resampled from 32khz to 16khz?
Yes! You need to resample everything to 16khz. See this issue on Github: https://github.com/declare-lab/tango/issues/28
Yes! You need to resample everything to 16khz. See this issue on Github: https://github.com/declare-lab/tango/issues/28
Thanks! After resample the reference files to 16khz, I got better FD:19.5 and KL_softmax:1.148, but FAD get worse from 2.3 to 51.021. Could you give some advice to fix it
Yes! You need to resample everything to 16khz. See this issue on Github: https://github.com/declare-lab/tango/issues/28
Thanks! After resample the reference files to 16khz, I got better FD:19.5 and KL_softmax:1.148, but FAD get worse from 2.3 to 51.021. Could you give some advice to fix it
I just change the reference audio encoding from pcm_f32le to pcm_s16le , in order to be same with the orignal 32khz reference audio, the fad decrease from 54 to 2.7, but it still not so good. orz