Instructions to use shae2977/xlm-roberta-hinglish-sentiment-analysis with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shae2977/xlm-roberta-hinglish-sentiment-analysis with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="shae2977/xlm-roberta-hinglish-sentiment-analysis")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("shae2977/xlm-roberta-hinglish-sentiment-analysis") model = AutoModelForSequenceClassification.from_pretrained("shae2977/xlm-roberta-hinglish-sentiment-analysis") - Notebooks
- Google Colab
- Kaggle
XLM-RoBERTa Hinglish Sentiment Analysis
A fine-tuned XLM-RoBERTa model for sentiment classification of Hinglish text โ the code-mixed Hindi-English language used by hundreds of millions of Indians online.
Most sentiment models are trained on clean English or formal Hindi. They fail badly on Hinglish because it is neither; it is a fluid mix of both, written in Roman script, full of slang, abbreviations, and cultural references. This model is trained specifically to handle that.
Model details
| Property | Value |
|---|---|
| Base model | FacebookAI/xlm-roberta-base |
| Fine-tuned on | Self-annotated Hinglish YouTube comments |
| Task | 3-class sentiment classification |
| Classes | Negative (0), Neutral (1), Positive (2) |
| Training samples | ~2,500 |
| Test samples | ~638 |
Performance
Evaluated on a held-out test set of Hinglish YouTube comments.
| Model | Weighted F1 |
|---|---|
| VADER (baseline) | 0.39 |
| XLM-RoBERTa (this model) | 0.67 |
VADER drops to 0.39 on Hinglish because it has no concept of Roman-script Hindi words like acha, bahut, zabardast, or mixed constructions like "yaar this song is too good na?". XLM-RoBERTa's multilingual pretraining gives it a foundation to handle this naturally.
Per-class results
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Negative | 0.66 | 0.77 | 0.71 |
| Neutral | 0.60 | 0.48 | 0.53 |
| Positive | 0.70 | 0.64 | 0.67 |
Neutral is the hardest class as neutral comments are often ambiguous even for human annotators.
Dataset
The training data is a self-annotated dataset of 3,000+ Hinglish YouTube comments scraped from Indian music, entertainment, and pop culture videos. Comments were manually labeled as Positive, Negative, or Neutral.
Dataset: shae2977/hinglish-youtube-sentiment-dataset
Usage
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
import torch
model_name = "shae2977/xlm-roberta-hinglish-sentiment-analysis"
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
model = XLMRobertaForSequenceClassification.from_pretrained(model_name)
model.eval()
labels = {0: "Negative", 1: "Neutral", 2: "Positive"}
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
predicted = torch.argmax(outputs.logits, dim=1).item()
return labels[predicted]
predict("yaar ye song bahut zabardast hai") # Positive
predict("bilkul bekar tha") # Negative
predict("theek tha, kuch khaas nahi") # Neutral
Training details
| Hyperparameter | Value |
|---|---|
| Learning rate | 1e-5 |
| Batch size | 16 |
| Max sequence length | 128 |
| Optimizer | AdamW |
| Epochs | 7 |
| Random seed | 42 |
| Hardware | NVIDIA T4 (Google Colab) |
Limitations
- Trained on YouTube comments from Indian entertainment content โ may not generalize well to other domains like politics or sports
- Neutral class performance is weaker than Positive/Negative
- Dataset size is limited (~3,000 samples) โ a larger annotated dataset would improve performance
- Overall weighted F1 of 0.67 means that comments may be miscalssified
Author
Built by shae2977
- Downloads last month
- 25
Model tree for shae2977/xlm-roberta-hinglish-sentiment-analysis
Base model
FacebookAI/xlm-roberta-base