The model outputs almost only female predictions

#11
by yovelcohen1 - opened

Hi all, I'm testing out the model a bit, upon trying few tens of audio segments it seems that the model almost always outputs female prediction, with the logits seeming to output a pretty binary answer every time:

[[ 3.2411015 -3.3607018]]
[[ 3.2428334 -3.3618562]]
[[ 3.2423477 -3.3617458]]
[[ 3.2432578 -3.3629394]]

I tested it on mostly extracted segments of about 20 seconds from various movies, the audio was in high quality (48000 sample rate) so I also changed it to 16,000 and segmented to 5 second pieces but the results consistently predict female weather there's only males in the audio or mixed either way..

Hello, thank you for your interest.

It is indeed unusual that the logits consistently favor female predictions in your case. One aspect to consider is the characteristics of the audio samples you are testing. Ensure that these samples do not contain excessive noise, reverb, or effects that could result from a significant distance between the speaker and the microphone. This model was trained on a clean subset of LibriSpeech, which is known for its clarity and controlled recording conditions. If your audio samples are noisy, I recommend using a denoiser or speech enhancer before running them through the model. This adjustment could help achieve more accurate predictions.

@alefiury Thanks for the tip, appreciate it. so I tried cleaning the audio before using this library:

Now I don't get strictly female results but the results are still pretty bad, I'm trying to figure out where it could be coming from.. maybe it's how I load the data?

from pydub import AudioSegment

samples, results, sample_rate, clean_audio = dict(), dict(), 16_000, True
for file in test_files:
    seg = AudioSegment.from_file(str(file))
    seg = seg.set_frame_rate(16000)
    samples[file.name.lower()] = np.array(seg.get_array_of_samples())

for filename, audio_input in samples.items():
    with torch.no_grad():
        cleaned_audio = nr.reduce_noise(y=audio_input, sr=sample_rate, device='cpu', use_torch=True) if clean_audio else audio_input
        inputs = feature_extractor(
            raw_speech=cleaned_audio,
            return_tensors="pt",
            sampling_rate=16000
        )
        results[filename] = model(inputs['input_values'].to(torch.double))

out of 24, 6 female and 6 male, in a set of cleaned audio and ones that didn't result in 19/24 predictions ending up wrong:

{'female_1.wav': {'cleanAudio': [-3.276392655487466, 3.382957059469654],
                  'rawAudio': [3.181107905931074, -3.2946958548209757]},
 'female_2.wav': {'cleanAudio': [-3.2962496381793716, 3.4013841456865954],
                  'rawAudio': [-3.297591932268639, 3.4013036314370604]},
 'female_3.wav': {'cleanAudio': [-3.2940059382126, 3.4020490165389132],
                  'rawAudio': [-3.272081541200108, 3.3783438936112145]},
 'female_4.wav': {'cleanAudio': [-3.296302234093971, 3.4016465666453133],
                  'rawAudio': [-3.276005731086416, 3.3813277557153296]},
 'female_5.wav': {'cleanAudio': [-3.295011947765837, 3.400797250029279],
                  'rawAudio': [-3.2947238807450465, 3.400360321466281]},
 'female_6.wav': {'cleanAudio': [-3.2958203606677934, 3.403502080536814],
                  'rawAudio': [-3.285688745521438, 3.391436539778566]},
 'male_1.wav': {'cleanAudio': [-3.2938195235142347, 3.3994886687900894],
                'rawAudio': [-3.216829509466939, 3.3276760362324396]},
 'male_2.wav': {'cleanAudio': [3.2408053253693865, -3.3608319708014247],
                'rawAudio': [3.243136054686257, -3.364560018280158]},
 'male_3.wav': {'cleanAudio': [3.240225165902827, -3.3578235974318122],
                'rawAudio': [3.2465453746553092, -3.365712258937916]},
 'male_4.wav': {'cleanAudio': [2.6519421093098527, -2.7854817282895104],
                'rawAudio': [3.24838837358642, -3.364284071410005]},
 'male_5.wav': {'cleanAudio': [-3.295762335912377, 3.4021156334188074],
                'rawAudio': [3.174643379044341, -3.278296053341783]},
 'male_6.wav': {'cleanAudio': [-3.295762335912377, 3.4021156334188074],
                'rawAudio': [3.174643379044341, -3.278296053341783]}}

Your method of data loading seems to be correct, but you might consider using the torchaudio or librosa libraries to load your files instead of AudioSegment, just to check, and compare the results. Additionally, if your audio files contain A LOT of noise, the process of noise reduction could introduce unwanted artifacts, potentially hurting your model's performance even more. In such cases, I suggest opting for more robust denoisers that utilize neural networks, rather than relying solely on those based on DSP. For instance, you could experiment with: Denoiser, FullSubNet-plus, or Adobe's tool, Adobe Enhancer. Testing each of these might help you determine which one most effectively reduces noise while minimizing artifacts.

Also, keep in mind that the model in this repository was primarily trained on clean and prepared speech (reading mostly), and it may struggle to generalize to data in the wild that has considerable noise or speech with extensive prosodic variations. Actually, I am considering training a more robust version of the model using more diverse data.

Sign up or log in to comment