fix/model_load_projector_classifier_issue

#12
by entfane - opened
No description provided.

Fixed an issue of loading the model. Reassigned weights and biases of classifier and projector according to Transformers v5

entfane changed pull request status to open

Independent confirmation that this fix matters. On stock transformers >=4.51, AutoModelForAudioClassification.from_pretrained(...) on this checkpoint silently re-initializes the classifier head:

classifier.output.bias   | UNEXPECTED
classifier.output.weight | UNEXPECTED
classifier.dense.bias    | UNEXPECTED
classifier.dense.weight  | UNEXPECTED
projector.weight         | MISSING (newly initialized)
projector.bias           | MISSING (newly initialized)
classifier.weight        | MISSING (newly initialized)
classifier.bias          | MISSING (newly initialized)

No exception is raised, so users get garbage predictions and tend to conclude the model itself is poor. On RAVDESS Actor_01 (8 clips, one per emotion), the unfixed load gave 1/8 correct — essentially chance.

One additional point worth flagging in the fix: the original training head was pool → Linear(1024,1024) → tanh → Linear(1024,8). The stock Wav2Vec2ForSequenceClassification forward in transformers v5 does not apply tanh between projector and classifier. If the remapping just renames the weights, accuracy will be measurably below the original.

Reproducible reference forward that preserves the activation:

class EhcalabresSER(torch.nn.Module):
    LABELS = ["angry","calm","disgust","fearful","happy","neutral","sad","surprised"]
    def __init__(self, name="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"):
        super().__init__()
        from transformers import AutoFeatureExtractor, Wav2Vec2Model
        from huggingface_hub import hf_hub_download
        from safetensors.torch import load_file
        self.feat = AutoFeatureExtractor.from_pretrained(name)
        self.encoder = Wav2Vec2Model.from_pretrained(name)
        h = self.encoder.config.hidden_size
        self.dense = torch.nn.Linear(h, h)
        self.output = torch.nn.Linear(h, len(self.LABELS))
        sd = load_file(hf_hub_download(name, "model.safetensors"))
        self.dense.weight.data.copy_(sd["classifier.dense.weight"])
        self.dense.bias.data.copy_(sd["classifier.dense.bias"])
        self.output.weight.data.copy_(sd["classifier.output.weight"])
        self.output.bias.data.copy_(sd["classifier.output.bias"])
        self.eval()

    @torch .no_grad()
    def forward(self, y, sr=16000):
        x = self.feat(y, sampling_rate=sr, return_tensors="pt", padding=True)
        feats = self.encoder(**x).last_hidden_state
        h = torch.tanh(self.dense(feats.mean(dim=1)))
        return self.output(h)

With this forward, on RAVDESS (24 actors, 1440 clips) on CPU: 91.6% top-1 / 0.91 macro-F1, mean inference 746 ms per ~3.7s clip (RTF 0.20). Numbers are consistent with the originally reported model performance.

Two suggestions for the maintainer:

  1. Either merge a fix that preserves the tanh (custom subclass or post-hoc head) or add inference code to the model card that does so explicitly, so users don't fall into the silent-failure path.
  2. Add a one-line warning in the README about the transformers version compatibility — many users will hit this without knowing why their accuracy is poor.

Happy to upstream the reference forward as a code snippet or a tiny model.py shim if useful.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment