Has anyone successfully get the inference code for this to run?

#3
by apobec3f - opened

I tried to write up the inference code based on the base model it is finetuned on, but no success :(

Kind of...

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification

modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))

                sound_array = np.array(waveform)
                input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
                with torch.no_grad():
                    result = modelw(input_values).logits
                probs = list(result.detach().numpy()[0])

This works and produces reasonable results.

However, for certain files/segments there seems to be memory leaks.

I am not sure.

Kind of...

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification

modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))

                sound_array = np.array(waveform)
                input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
                with torch.no_grad():
                    result = modelw(input_values).logits
                probs = list(result.detach().numpy()[0])

This works and produces reasonable results.

However, for certain files/segments there seems to be memory leaks.

I am not sure.

I tried your code with something like this:

import torch
import numpy as np
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
from scipy.io.wavfile import read

modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech')
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech')
audio_pth = 'your_test_audio_path.wav'
output = read(audio_pth)
waveform = np.array(output[1],dtype=float)
sound_array = np.array(waveform)
input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
with torch.no_grad():
    result = modelw(input_values).logits
    prob = torch.nn.functional.softmax(logits, dim=1)

Note I used prob = torch.nn.functional.softmax(logits, dim=1) on the last line for the probabilities, which seems to make sense based on a few test audios I tried.

Indeed, the following for instance:

with torch.no_grad():
    logits = modelw(input_values).logits
    prob = torch.nn.functional.softmax(logits, dim=1).tolist()[0][0]

Seems to return the probability of the speaker being a female.

Whereas:

 prob = torch.nn.functional.softmax(logits, dim=1).tolist()[0][1]

Returns the probability of a male.

As this is a binary classifier, one plus the other should be approximately 1.

Sorry I just found on some audios it's making wrong prediction while on the inference API it is not:
This one for example is predicted as male while on the inference API it is female (correct)

For me it works fine so far with some 40 speakers.

After diarisation, I have concatenated the longest non-overlapping segments corresponding to each speaker into separate WAV files (the preprocessing involves voice isolation with demucs, normalisation with pydub and conversion to 16 kHz mono WAV with pysox). I am also removing silences with pysox as that helps for other tasks but so far it does not seem to have a noticeable effect for gender attribution.

With this preprocessing, the model seems to work well.

Previously I had also experienced some issues, including the inability to process certain segments (huge memory usage and never-ending processing) as well as gender misassignment.

I starts to feel I am you from another universe cause we have been doing exactly the same preprocessing steps! Wow!
Glad you got it working fine, I will look a bit closer to my pipeline.
Also out of topic question, the official demucs interface is really hard to use, for example it always write the results to files in a predefined folder etc. Also it has memory issue processing large audio snippets (might due to inefficient segmentation and batching).
Have you had any luck getting it to work reliably? :D

You can check the initial preprocessing there:

https://github.com/mirix/approaches-to-diarisation/tree/main

Namely:

demucs.separate.main(shlex.split('--two-stems vocals -n mdx_extra ' + 'samples/' + name + ' -o tmp'))

Thanks man! Appreciate that!!

alefiury changed discussion status to closed

Sign up or log in to comment