metadata
license: apache-2.0
language:
- en
base_model:
- ntu-spml/distilhubert
pipeline_tag: audio-classification
library_name: flair
Emotion Detection From Speech
This model is the fine-tuned version of DistilHuBERT which classifies emotions from audio inputs.
Approach
- Dataset: The Ravdess dataset, comprising 1,440 audio files with 8 emotion labels: calm, happy, sad, angry, fearful, surprise, neutral, and disgust.
- Model Fine-Tuning: The DistilHuBERT model was fine-tuned for 7 epochs with a learning rate of 5e-5, achieving an accuracy of 98% on the test dataset.
Data Preprocessing
- Sampling Rate: Audio files were resampled to 16kHz to match the model's requirements.
- Padding: Audio clips shorter than 30 seconds were zero-padded.
- Train-Test Split: 80% of the samples were used for training, and 20% for testing.
Model Architecture
- DistilHuBERT: A lightweight variant of HuBERT, fine-tuned for emotion classification.
- Fine-Tuning Setup:
- Optimizer: AdamW
- Loss Function: Cross-Entropy
- Learning Rate: 5e-5
- Warm-up Ratio: 0.1
- Epochs: 7
Results
- Accuracy: 0.98 on the test dataset
- Loss: 0.10 on the test dataset
Usage
from transformers import pipeline
pipe = pipeline( "audio-classification", model="BilalHasan/distilhubert-finetuned-ravdess", )
emotion = pipe(path_to_your_audio)
Demo
You can access the live demo of the app on Hugging Face Spaces.