HinSpam β Hinglish Spam & Ham Classifier
HinSpam is a binary text classifier trained to detect spam messages in Hinglish β the code-mixed Hindi-English variety dominant across Indian SMS, WhatsApp, and online messaging platforms. It distinguishes between spam (unsolicited promotional, scam, or phishing messages) and ham (genuine, organic conversation), with strong performance on the kinds of deceptive, urgent-language spam that targets Indian mobile users.
Model Description
| Property | Details |
|---|---|
| Task | Binary Text Classification |
| Label 0 | ham β legitimate, non-spam message |
| Label 1 | spam β unsolicited, scam, or phishing message |
| Language | Hinglish (Hindi-English code-mixed, Roman script) |
| Domain | SMS, WhatsApp, online messaging |
Key Features
π΅ Hinglish-Native Spam Detection
Most spam classifiers are trained on English or pure Hindi corpora and fail on the code-mixed reality of Indian messaging. HinSpam is trained natively on Hinglish, capturing the mixed-language patterns used both by real users and by scammers targeting Indian audiences.
π£ Phishing & Scam Pattern Recognition
The model is specifically tuned to catch common Indian spam tropes:
- Fake lottery and prize claims β e.g., "50,000 rupey jeete hain, abhi claim karein"
- OTP forwarding scams β requests to share OTPs with "managers" or "bank officials"
- Malicious links β suspicious URLs embedded in prize or account-alert messages
- Urgency language β pressure tactics like "abhi click karein" or "sirf aaj ke liye"
- Impersonation β messages posing as banks, government schemes, or delivery services
π¬ Low False-Positive Rate on Casual Hinglish
Everyday Hinglish conversation β complaints, plans, banter β is often informal and noisy. HinSpam maintains high precision on ham, avoiding the over-flagging that plagues generic spam filters on non-English text.
Performance Metrics
Evaluated on a held-out test set of Hinglish messages:
| Metric | Score |
|---|---|
| Accuracy | 0.9530 |
| F1 Score | 0.9490 |
| Precision | 0.9340 |
| Recall | 0.9650 |
| MCC (Matthews Correlation Coefficient) | 0.9060 |
Recall = 0.9650 β The model catches the vast majority of spam, making it effective as a first-pass filter in messaging pipelines.
MCC = 0.9060 β A strong balanced metric confirming reliable classification across both classes, even under mild label imbalance.
Intended Use
HinSpam is designed for:
- SMS and messaging spam filters for Indian telecom and app platforms
- WhatsApp / chat moderation tools targeting scam and phishing messages in Hinglish
- Financial fraud prevention β catching OTP-forwarding and fake prize scams before they reach users
- Research on code-mixed spam detection and NLP for Indian languages
Training Data
The model was trained on a curated dataset of Hinglish messages covering a wide range of spam and ham examples:
- Label 0 (ham): Casual conversation, personal messages, everyday Hinglish banter
- Label 1 (spam): Lottery scams, OTP phishing, prize claim fraud, malicious links, fake bank alerts
Example Data Points
"Tera naya job kaisa chal raha hai boss kaisa hai wahan ka theek hai na.", ham
"Aapke account mein scratch winner se jeete hue 50000 rupey credit krene ke liye apna otp diye gaye number par forward kre", spam
"Bhai tu kab sudhrega hamesha late aane ki aadat hai teri toh pakki.", ham
"500000 jitne ke liye is link par click kre: www.maha-winner-india.in", spam
"Aaj bahut bore ho raha hu yaar kuch karne ka plan bata theek sa.", ham
"Aapke account mein lottery se jeete hue 100000 rupey credit krene ke liye apna otp manager ko forward kre aur claim karein", spam
Related Models
- HinTox β Hinglish hate speech and abusive language detector with obfuscation robustness (leetspeak, misspellings)
Citation
If you use HinSpam in your research or product, please cite:
@misc{hinspam2025,
title = {HinSpam: Hinglish Spam Detection for Code-Mixed Indian Messaging},
year = {2026},
note = {HuggingFace Model Hub},
url = {https://huggingface.co/Keshav0av/HinSpam}
}
Contact
For questions, feedback, or collaboration, open an issue on the model repository or reach out via HuggingFace.
- Downloads last month
- 34