Fine-Tuned MiniLM for GoEmotions Sentiment Analysis
This repository contains a fine-tuned version of Microsoft's MiniLM-v2 model, specifically optimized for sentiment analysis using the GoEmotions dataset. The model is capable of classifying text into the following emotional/sentiment categories:
This model is just 90MB making it ideal for memory constraint environments.
- anger
- approval
- confusion
- disappointment
- disapproval
- gratitude
- joy
- sadness
- neutral
These sentiments more or less cover all the sentiments that can be in a sentence. Useful for validating sentiment analysis models.
Label Analogy when using Inference:
{
"LABEL_0":anger,
"LABEL_1":approval,
"LABEL_2":confusion,
"LABEL_3":disappointment,
"LABEL_4":disapproval,
"LABEL_5":gratitude,
"LABEL_6":joy,
"LABEL_7":sadness,
"LABEL_8":neutral
}
Why MiniLM?
MiniLM is a distilled version of larger language models like BERT and RoBERTa. It strikes a remarkable balance between performance and efficiency:
- Reduced Size: MiniLM is significantly smaller than its parent models, making it faster to load and deploy, especially in resource-constrained environments.
- Comparable Performance: Despite its compact size, MiniLM maintains surprisingly high accuracy on various natural language processing (NLP) tasks, including sentiment analysis.
- Distillation Power: MiniLM's distillation technique ensures that it captures the essential knowledge of larger models, making it a potent tool for real-world applications.
GoEmotions Dataset
google-research-datasets/go_emotions
The GoEmotions dataset is a valuable resource for sentiment analysis. It consists of thousands of Reddit comments labeled with the nine emotional/sentiment classes listed above. This dataset's richness in diverse expressions of emotions makes it an ideal choice for training a versatile sentiment analysis model.
Training Procedure
- Data Preprocessing: The GoEmotions dataset was preprocessed to ensure consistency and remove noise.
- Tokenizer: The MiniLM-v2 tokenizer was used to convert text into numerical representations suitable for the model.
- Fine-Tuning: The MiniLM-v2 model was fine-tuned on the GoEmotions dataset using a standard training loop. The model's parameters were adjusted to optimize its performance on sentiment classification.
- Evaluation: The fine-tuned model was evaluated on a held-out test set to measure its accuracy and generalization capabilities.
How to Use This Model
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
required_sentiments = ['anger', 'approval', 'confusion', 'disappointment', 'disapproval', 'gratitude', 'joy', 'sadness', 'neutral']
model = AutoModelForSequenceClassification.from_pretrained('./saved_model')
tokenizer = AutoTokenizer.from_pretrained('./saved_model')
text = "How can you be so careless"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=128)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1).item()
# Map the label to sentiment
label_mapping = {idx: sentiment for idx, sentiment in enumerate(required_sentiments)}
predicted_sentiment = label_mapping[predictions]
print(f'Text: {text}')
print(f'Predicted Sentiment: {predicted_sentiment}')
- Downloads last month
- 117