HuBERT fine-tuned on DUSHA dataset for speech emotion recognition in russian language

The pre-trained model is this one - facebook/hubert-large-ls960-ft

The DUSHA dataset used can be found here

Fine-tuning

Fine-tuned in Google Colab using Pro account with A100 GPU

Freezed all layers exept projector, classifier and all 24 HubertEncoderLayerStableLayerNorm layers

Used half of the train dataset

Training parameters

  • 2 epochs
  • train batch size = 8
  • eval batch size = 8
  • gradient accumulation steps = 4
  • learning rate = 5e-5 without warm up and decay

Metrics

Achieved

  • accuracy = 0.86
  • balanced = 0.76
  • macro f1 score = 0.81 on test set, improving accucary and f1 score compared to dataset baseline

Usage

from transformers import HubertForSequenceClassification, Wav2Vec2FeatureExtractor
import torchaudio
import torch

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForSequenceClassification.from_pretrained("xbgoose/hubert-speech-emotion-recognition-russian-dusha-finetuned")
num2emotion = {0: 'neutral', 1: 'angry', 2: 'positive', 3: 'sad', 4: 'other'}

filepath = "path/to/audio.wav"

waveform, sample_rate = torchaudio.load(filepath, normalize=True)
transform = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = transform(waveform)

inputs = feature_extractor(
        waveform, 
        sampling_rate=feature_extractor.sampling_rate, 
        return_tensors="pt",
        padding=True,
        max_length=16000 * 10,
        truncation=True
    )

logits = model(inputs['input_values'][0]).logits
predictions = torch.argmax(logits, dim=-1)
predicted_emotion = num2emotion[predictions.numpy()[0]]
print(predicted_emotion)
Downloads last month
21,620
Safetensors
Model size
316M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for xbgoose/hubert-large-speech-emotion-recognition-russian-dusha-finetuned

Finetuned
(22)
this model

Dataset used to train xbgoose/hubert-large-speech-emotion-recognition-russian-dusha-finetuned

Spaces using xbgoose/hubert-large-speech-emotion-recognition-russian-dusha-finetuned 2