Instructions to use ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")# Load model directly from transformers import AutoProcessor, AutoModelForAudioClassification processor = AutoProcessor.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition") model = AutoModelForAudioClassification.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition") - Notebooks
- Google Colab
- Kaggle
fix/model_load_projector_classifier_issue
Fixed an issue of loading the model. Reassigned weights and biases of classifier and projector according to Transformers v5
Independent confirmation that this fix matters. On stock transformers >=4.51, AutoModelForAudioClassification.from_pretrained(...) on this checkpoint silently re-initializes the classifier head:
classifier.output.bias | UNEXPECTED
classifier.output.weight | UNEXPECTED
classifier.dense.bias | UNEXPECTED
classifier.dense.weight | UNEXPECTED
projector.weight | MISSING (newly initialized)
projector.bias | MISSING (newly initialized)
classifier.weight | MISSING (newly initialized)
classifier.bias | MISSING (newly initialized)
No exception is raised, so users get garbage predictions and tend to conclude the model itself is poor. On RAVDESS Actor_01 (8 clips, one per emotion), the unfixed load gave 1/8 correct — essentially chance.
One additional point worth flagging in the fix: the original training head was pool → Linear(1024,1024) → tanh → Linear(1024,8). The stock Wav2Vec2ForSequenceClassification forward in transformers v5 does not apply tanh between projector and classifier. If the remapping just renames the weights, accuracy will be measurably below the original.
Reproducible reference forward that preserves the activation:
class EhcalabresSER(torch.nn.Module):
LABELS = ["angry","calm","disgust","fearful","happy","neutral","sad","surprised"]
def __init__(self, name="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"):
super().__init__()
from transformers import AutoFeatureExtractor, Wav2Vec2Model
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
self.feat = AutoFeatureExtractor.from_pretrained(name)
self.encoder = Wav2Vec2Model.from_pretrained(name)
h = self.encoder.config.hidden_size
self.dense = torch.nn.Linear(h, h)
self.output = torch.nn.Linear(h, len(self.LABELS))
sd = load_file(hf_hub_download(name, "model.safetensors"))
self.dense.weight.data.copy_(sd["classifier.dense.weight"])
self.dense.bias.data.copy_(sd["classifier.dense.bias"])
self.output.weight.data.copy_(sd["classifier.output.weight"])
self.output.bias.data.copy_(sd["classifier.output.bias"])
self.eval()
@torch .no_grad()
def forward(self, y, sr=16000):
x = self.feat(y, sampling_rate=sr, return_tensors="pt", padding=True)
feats = self.encoder(**x).last_hidden_state
h = torch.tanh(self.dense(feats.mean(dim=1)))
return self.output(h)
With this forward, on RAVDESS (24 actors, 1440 clips) on CPU: 91.6% top-1 / 0.91 macro-F1, mean inference 746 ms per ~3.7s clip (RTF 0.20). Numbers are consistent with the originally reported model performance.
Two suggestions for the maintainer:
- Either merge a fix that preserves the
tanh(custom subclass or post-hoc head) or add inference code to the model card that does so explicitly, so users don't fall into the silent-failure path. - Add a one-line warning in the README about the transformers version compatibility — many users will hit this without knowing why their accuracy is poor.
Happy to upstream the reference forward as a code snippet or a tiny model.py shim if useful.