AST Fine-Tuned Model for Emotion Classification
AST Fine-Tuned Model for Emotion Classification
This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the CREMA-D dataset, focusing on six emotional categories. The base model was sourced from MIT's pre-trained AST model.
Model Details
- Base Model:
MIT/ast-finetuned-audioset-10-10-0.4593
- Fine-Tuned Dataset: CREMA-D
- Architecture: Audio Spectrogram Transformer (AST)
- Model Type: Single-label classification
- Input Features: Log-Mel Spectrograms (128 mel bins)
- Output Classes:
- ANG: Anger
- DIS: Disgust
- FEA: Fear
- HAP: Happiness
- NEU: Neutral
- SAD: Sadness
Model Configuration
- Hidden Size: 768
- Number of Attention Heads: 12
- Number of Hidden Layers: 12
- Patch Size: 16
- Maximum Length: 1024
- Dropout Probability: 0.0
- Activation Function: GELU (Gaussian Error Linear Unit)
- Optimizer: Adam
- Learning Rate: 1e-4
Training Details
- Dataset: CREMA-D (Emotion-Labeled Speech Data)
- Data Augmentation:
- Noise injection
- Time shifting
- Speed perturbation
- Fine-Tuning Epochs: 5
- Batch Size: 16
- Learning Rate Scheduler: Linear decay
- Best Validation Accuracy: 60.71%
- Best Checkpoint:
./results/checkpoint-1119
How to Use
Load the Model
from transformers import AutoModelForAudioClassification, AutoProcessor
# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")
# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")
# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
Metrics
Validation Results
- Best Validation Accuracy: 60.71%
- Validation Loss: 1.1126
Evaluation Details
- Eval Dataset: CREMA-D test split
- Batch Size: 16
- Number of Steps: 94
Limitations
- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
- Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.
Acknowledgments
This work is based on the Audio Spectrogram Transformer (AST) model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.
License
The model is shared under the MIT License. Refer to the licensing details in the repository.
Citation
If you use this model in your work, please cite:
@misc{ast-finetuned-model,
author = {forwarder1121},
title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
year = {2024},
url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}
Contact
For questions, reach out to forwarder1121@naver.com
.
- Downloads last month
- 116
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for forwarder1121/ast-finetuned-model
Base model
MIT/ast-finetuned-audioset-10-10-0.4593