Could you share your inference setting?

by AowoA - opened Sep 18, 2023

Sep 18, 2023

I used the "tango-full-ft-audiocaps" ckpt to inference and get bad evaluation score(FD 28, FAD 2.3, KL 2.0,etc), my inference setting is "num_steps 200, guide 3.0, no seed"

soujanyaporia

Deep Cognition and Language Research (DeCLaRe) Lab org Sep 20, 2023

Could you use this script instead? https://github.com/declare-lab/tango/blob/master/inference_hf.py
The sampling rate should be 16 kHz.

AowoA

Sep 21, 2023

Could you use this script instead? https://github.com/declare-lab/tango/blob/master/inference_hf.py
The sampling rate should be 16 kHz.

Yes, I just used script. But the results are still not good. Should the ground turth files be resampled from 32khz to 16khz?

soujanyaporia

Deep Cognition and Language Research (DeCLaRe) Lab org Sep 21, 2023

Yes! You need to resample everything to 16khz. See this issue on Github: https://github.com/declare-lab/tango/issues/28

AowoA

Sep 26, 2023

Yes! You need to resample everything to 16khz. See this issue on Github: https://github.com/declare-lab/tango/issues/28
Thanks! After resample the reference files to 16khz, I got better FD:19.5 and KL_softmax:1.148, but FAD get worse from 2.3 to 51.021. Could you give some advice to fix it

AowoA

Sep 27, 2023

Yes! You need to resample everything to 16khz. See this issue on Github: https://github.com/declare-lab/tango/issues/28
Thanks! After resample the reference files to 16khz, I got better FD:19.5 and KL_softmax:1.148, but FAD get worse from 2.3 to 51.021. Could you give some advice to fix it

I just change the reference audio encoding from pcm_f32le to pcm_s16le , in order to be same with the orignal 32khz reference audio, the fad decrease from 54 to 2.7, but it still not so good. orz

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment