Flattery Prediction from Speech

This Wav2Vec2 model was finetuned to predict flattery from speech English earning calls. It was introduced in This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach, which was accepted at INTERSPEECH 2024. If you are looking for the text-based classifier (based on RoBERTa) introduced in the paper, please see here.

Model Details

Model Description

This is a (further) fine-tuned variant of a Wav2Vec2 model for Speech Emotion Recognition in MSP. It is trained using a dataset comprising single sentences uttered in business calls, which were labeled for flattery in a binary manner. The training set comprised 7167 sentences, 1878 sentences were used as development set. For more details, please refer to the paper(TODO), especially Sections 2 for the dataset, 3.2.2 for the training procedure and 4.2 for the results. The checkpoint provided here was trained without further pruning the model. It achieves Unweighed Average Recall (UAR) values of .8001 and .8084 on the development and test partition, respectively.

Model Sources

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]

Usage

The following snippet illustrates the usage of the model.

from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
from torch import sigmoid
import librosa

# initialize model and tokenizer
checkpoint = "chrlukas/flattery_prediction_speech"
processor = AutoFeatureExtractor.from_pretrained(checkpoint)
model = Wav2Vec2ForSequenceClassification.from_pretrained(checkpoint)
model.eval()

# predict flattery in a sentence
example_file = 'example.wav'
# audio must be resampled to 16Hz
y, _ = librosa.load(test_file, sr=16000)
inp = processor(y, sampling_rate=16000, return_tensors='pt')
with torch.no_grad():
  logits = model(**inp).logits
prediction = sigmoid(logits).item()
flattery = prediction >= 0.5
print(f'Flattery detected? {flattery}')

Bias, Risks, and Limitations

The model is trained on a highly-domain specific dataset sourced from earning calls, i.e., typically conversations between business analysts and CEOs of US-American companies. Hence, it can not be expected to generalize well to other domains and contexts. Moreover, the majority of speakers (162/178) in the training dataset are male. However, we found this to have rather little impact on the model's performance for held-out female speakers (cf. Section 4.4 in the paper)

Citation

BibTeX:

TODO