How to run pre-trained model on local audio file?

#2
by risau - opened

Hi! I am not sure how to use the model. Trying to use pipeline but

from transformers import AutoProcessor, AutoModelForAudioClassification
processor = AutoProcessor.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

gives error:
OSError: Can't load tokenizer for 'ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

I too am confused. I've used many other huggingface models and this one needs a working example of applying the pretrained model to a local audio file.

So I hacked around with this model all morning and here is a script of how I THINK it is supposed to work. I don't feel confident that every step is accurate, but since the source doesn't seem to be available on github and there's no public example, this is a start. Hopefully the author or someone else will correct my work and we'll all be better off. I can't believe this is the most downloaded audio-emotion model on huggingface at the moment. (Though "downloads" include all the people who try the online demo too).

(Honorable mention to github copilot, that filled in some of the code for where documentation was lacking!)

import torch
from transformers import AutoProcessor, AutoModelForAudioClassification, Wav2Vec2FeatureExtractor
import numpy as np
from pydub import AudioSegment

# https://github.com/ehcalabres/EMOVoice
# the preprocessor was derived from https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english
# processor1 = AutoProcessor.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")
# ^^^ no preload model available for this model (above), but the `feature_extractor` works in place
model1 = AutoModelForAudioClassification.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-large-xlsr-53")

def predict_emotion(audio_file):
    if not audio_file:
        # I fetched some samples with known emotions from here: https://www.fesliyanstudios.com/royalty-free-sound-effects-download/poeple-crying-252
        audio_file = 'mp3/dude-crying.mp3'
    sound = AudioSegment.from_file(audio_file)
    sound = sound.set_frame_rate(16000)
    sound_array = np.array(sound.get_array_of_samples())
    # this model is VERY SLOW, so best to pass in small sections that contain 
    # emotional words from the transcript. like 10s or less.
    # how to make sub-chunk  -- this was necessary even with very short audio files 
    # test = torch.tensor(input.input_values.float()[:, :100000])

    input = feature_extractor(
        raw_speech=sound_array,
        sampling_rate=16000,
        padding=True,
        return_tensors="pt")

    result = model1.forward(input.input_values.float())
    # making sense of the result 
    id2label = {
        "0": "angry",
        "1": "calm",
        "2": "disgust",
        "3": "fearful",
        "4": "happy",
        "5": "neutral",
        "6": "sad",
        "7": "surprised"
    }
    interp = dict(zip(id2label.values(), list(round(float(i),4) for i in result[0][0])))
    return interp

{'angry': 0.0389, 'calm': 0.0323, 'disgust': -0.0222, 'fearful': -0.1644, 'happy': -0.0891, 'neutral': 0.0672, 'sad': 0.0889, 'surprised': 0.053}

(the sample was of a dude crying, and "sad" is the highest scoring match, so it worked)

My audio test file. Seems to be an actor, not an authentically sad person, but the classifier should still pick up the signals.

Hello @marcmaxmeister
When I try your code with the sample "dude crying" I get totally different results:
{'angry': -0.0665,
'calm': 0.1155,
'disgust': 0.0035,
'fearful': 0.0059,
'happy': -0.061,
'neutral': -0.08,
'sad': -0.0665,
'surprised': 0.009}

can you try again and see if you get my results or still the good results with sad beeing the biggest one?

Sure. I copied the code from this page and reran it on my local copy of dude-crying.mp3 as well as a downloaded copy of the WAV file, just in case the site modified the bitrate or something. Got the same result each time, but they did not match my previous run or your test run. Image attached.

image.png

Anyone notice the warnings when loading the pretrained model?

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.output.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.bias', 'classifier.bias', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Does it mean the model is actually outputting random predictions?

Anyone notice the warnings when loading the pretrained model?

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.output.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.bias', 'classifier.bias', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Does it mean the model is accurately outputting random predictions?

I met this issue too. Did you solve this problem?

Anyone notice the warnings when loading the pretrained model?

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.output.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.bias', 'classifier.bias', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Does it mean the model is accurately outputting random predictions?

I met this issue too. Did you solve this problem?

Well, I guess I solved it. I noticed that the reason for the mismatch is the name of that layer. I manually set the random initialed layer's value.

This comment has been hidden

Anyone notice the warnings when loading the pretrained model?

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.output.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.bias', 'classifier.bias', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Does it mean the model is accurately outputting random predictions?

I met this issue too. Did you solve this problem?

Well, I guess I solved it. I noticed that the reason for the mismatch is the name of that layer. I manually set the random initialed layer's value.

How did you set the random initialed layer's value manually?

Anyone notice the warnings when loading the pretrained model?

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.output.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.bias', 'classifier.bias', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Does it mean the model is accurately outputting random predictions?

I met this issue too. Did you solve this problem?

Well, I guess I solved it. I noticed that the reason for the mismatch is the name of that layer. I manually set the random initialed layer's value.

I have the same question. How did you set that?

!!! This is how I did, plz let me know if it works on your data !!! @Zkli @qadoor @huttersadan

model = AutoModelForAudioClassification.from_pretrained("wav2vec2-lg-xlsr-en-speech-emotion-recognition") 

model.projector = nn.Linear(1024, 1024, bias=True)
model.classifier = nn.Linear(1024, 8, bias=True)

torch_state_dict = torch.load('/content/wav2vec2-lg-xlsr-en-speech-emotion-recognition/pytorch_model.bin', map_location=torch.device('cpu'))

model.projector.weight.data = torch_state_dict['classifier.dense.weight']
model.projector.bias.data = torch_state_dict['classifier.dense.bias']

model.classifier.weight.data = torch_state_dict['classifier.output.weight']
model.classifier.bias.data = torch_state_dict['classifier.output.bias']

Anyone notice the warnings when loading the pretrained model?

Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.output.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['projector.bias', 'classifier.bias', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Does it mean the model is accurately outputting random predictions?

I met this issue too. Did you solve this problem?

Well, I guess I solved it. I noticed that the reason for the mismatch is the name of that layer. I manually set the random initialed layer's value.

I was able to use the pipeline (from transformers) giving the feature extractor and the model, but I only get the five top scored classes. Is there a way to get all eight?

[Edit: I have to add return_all_scores=True when instantiating the pipeline.]

Sign up or log in to comment