nahiar commited on
Commit
582daa9
·
verified ·
1 Parent(s): c8fe68d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +190 -189
README.md CHANGED
@@ -1,190 +1,191 @@
1
- ---
2
- language:
3
- - id
4
- - eng
5
- library_name: transformers
6
- pipeline_tag: text-classification
7
- tags:
8
- - text-classification
9
- - spam-detection
10
- - indonesian
11
- - multilingual
12
- - xlm-roberta
13
- - social-media
14
- license: apache-2.0
15
- metrics:
16
- - accuracy
17
- - f1
18
- base_model:
19
- - FacebookAI/xlm-roberta-base
20
- ---
21
-
22
- # Spam Detection for Social Media Text
23
- **Multilingual Indonesian & English | XLM-RoBERTa**
24
-
25
- This model is a fine-tuned **XLM-RoBERTa** designed to detect **Spam vs Ham** content in social media text.
26
- It supports **Indonesian** and **English Languages**, making it suitable for multi-platform moderation use cases such as Twitter/X, Instagram, TikTok, Facebook, and online forums.
27
-
28
- ---
29
-
30
- ## ✨ Key Features
31
-
32
- - ✅ Spam vs Ham classification
33
- - 🌏 Multilingual support (Indonesian & English)
34
- - 🧠 Based on **XLM-RoBERTa (multilingual transformer)**
35
- - ⚡ Ready-to-use with Hugging Face `pipeline`
36
- - 📊 Strong performance on noisy social media text
37
-
38
- ---
39
-
40
- ## 🌍 Supported Languages
41
-
42
- - 🇮🇩 Bahasa Indonesia
43
- - 🇬🇧 English
44
-
45
- ---
46
-
47
- ## 🧪 Model Performance
48
-
49
- | Metric | Score |
50
- |---------------------|--------|
51
- | Accuracy | 0.9645 |
52
- | F1 (Macro) | 0.9639 |
53
- | F1 (Weighted) | 0.9700 |
54
- | Precision | 0.9700 |
55
- | Recall | 0.9600 |
56
- | Training Loss | 0.0637 |
57
- | Validation Loss | 0.1242 |
58
-
59
- > Evaluated on held-out validation data with balanced spam/ham distribution.
60
-
61
- ---
62
-
63
- ## 🚀 Quick Start
64
-
65
- ### Installation
66
- ```bash
67
- pip install transformers torch
68
- ````
69
-
70
- ### Single Prediction
71
-
72
- ```python
73
- from transformers import pipeline
74
-
75
- classifier = pipeline(
76
- task="text-classification",
77
- model="nahiar/spam-detection-xlm-roberta-v1"
78
- )
79
-
80
- result = classifier("PASTI DIJAMIN WDP 100%")
81
- print(result)
82
- ```
83
-
84
- **Output**
85
-
86
- ```python
87
- [{'label': 'LABEL_1', 'score': 0.9876}]
88
- ```
89
-
90
- ### Label Mapping
91
-
92
- ```text
93
- LABEL_0 → SPAM
94
- LABEL_1 → HAM
95
- ```
96
-
97
- ---
98
-
99
- ## 📦 Batch Inference Example
100
-
101
- ```python
102
- "texts": [
103
- "साइबर हमले के बाद JLR का बड़ा बयान - जानें कंपनी ने क्या कहा | Tata Motors के शेयर पर दिखेगा असर?
104
-
105
- #TataMotors #JLR #CyberAttack
106
-
107
- https://t.co/6WlGS77UUp",
108
- "Kita sudah Ready skrg ini bagi yang memerlukan jasa pemulihan akun & Hapus All akun
109
-
110
- Lacak lokasi / sadap wa / Hack Akun / Revengeporn - korban pemerasan vcs / terror
111
-
112
- TIKTOK,GMAIL,TWITER,TELEGRAM,
113
- FACEBOOK,INSTAGRAM
114
- #revengeporn #zonauangᅠᅠᅠ
115
- ☎️ https://t.co/K0AbW08qnU https://t.co/4IpWNA7a0z",
116
- "💥Slot Gacor Hari ini Rute303
117
- 💥Jaminan Jackpot Maxwin malam ini
118
-
119
- LINK SLOT GACOR HARI INI : https://t.co/QvxjCAnt8o
120
-
121
- Tags:
122
- Jumbo #timsekop Jumat gratis ongkir Like Crazy PSIM https://t.co/ukuRdlvgGA"
123
- ]
124
-
125
- results = classifier(texts)
126
-
127
- for text, result in zip(texts, results):
128
- print(f"{text} -> {result['label']} ({result['score']:.4f})")
129
- ```
130
-
131
- ---
132
-
133
- ## 🏗️ Training Configuration
134
-
135
- | Parameter | Value |
136
- | ------------------ | ---------------- |
137
- | Base Model | xlm-roberta-base |
138
- | Training Samples | 22,243 |
139
- | Validation Samples | 5,561 |
140
- | Epochs | 3 |
141
- | Learning Rate | 2e-5 |
142
- | Batch Size | 16 |
143
- | Training Date | 2026-01-21 |
144
-
145
- ---
146
-
147
- ## 🎯 Intended Use Cases
148
-
149
- * Social media spam moderation
150
- * Comment & post filtering
151
- * Content quality control
152
- * Pre-filtering for sentiment or topic analysis pipelines
153
-
154
- ---
155
-
156
- ## ⚠️ Limitations
157
-
158
- * Binary classification only (Spam / Ham)
159
- * Not optimized for non-social-media formal text
160
- * Performance may degrade on very short or ambiguous messages
161
-
162
- ---
163
-
164
- ## 📜 License
165
-
166
- Released under the **Apache 2.0 License**.
167
- Free for commercial and research use.
168
-
169
- ---
170
-
171
- ## 📚 Citation
172
-
173
- If you use this model in your work, please cite:
174
-
175
- ```bibtex
176
- @misc{djunaedi2026spam,
177
- author = {AI/ML Engineer ADS Digital Partner},
178
- title = {Spam Detection for Social Media Text},
179
- year = {2025},
180
- publisher = {Hugging Face},
181
- url = {https://huggingface.co/nahiar/spam-detection-xlm-roberta-v1}
182
- }
183
- ```
184
-
185
- ---
186
-
187
- ## 🙌 Acknowledgements
188
-
189
- * Hugging Face Transformers
 
190
  * Facebook AI Research — XLM-RoBERTa
 
1
+ ---
2
+ language:
3
+ - id
4
+ - eng
5
+ library_name: transformers
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - text-classification
9
+ - sentiment-analysis
10
+ - indonesian
11
+ - multilingual
12
+ - xlm-roberta
13
+ - social-media
14
+ license: apache-2.0
15
+ metrics:
16
+ - accuracy
17
+ - f1
18
+ base_model:
19
+ - FacebookAI/xlm-roberta-base
20
+ ---
21
+
22
+ # Sentiment Analysis for Social Media Text
23
+ **Multilingual Indonesian & English | XLM-RoBERTa**
24
+
25
+ This model is a fine-tuned **XLM-RoBERTa-Base** designed to analyze **Sentiment Positive, Neutral, Negative** content in social media text.
26
+ It supports **Indonesian** and **English Languages**, making it suitable for multi-platform moderation use cases such as Twitter/X, Instagram, TikTok, Facebook, and online forums.
27
+
28
+ ---
29
+
30
+ ## ✨ Key Features
31
+
32
+ - ✅ Sentiment Posisitve, Neutral, and Negative classification
33
+ - 🌏 Multilingual support (Indonesian & English)
34
+ - 🧠 Based on **XLM-RoBERTa (multilingual transformer)**
35
+ - ⚡ Ready-to-use with Hugging Face `pipeline`
36
+ - 📊 Strong performance on noisy social media text
37
+
38
+ ---
39
+
40
+ ## 🌍 Supported Languages
41
+
42
+ - 🇮🇩 Bahasa Indonesia
43
+ - 🇬🇧 English
44
+
45
+ ---
46
+
47
+ ## 🧪 Model Performance
48
+
49
+ | Metric | Score |
50
+ |---------------------|--------|
51
+ | Accuracy | 0.8527 |
52
+ | F1 (Macro) | 0.8525 |
53
+ | F1 (Weighted) | 0.8525 |
54
+ | Precision | 0.8500 |
55
+ | Recall | 0.8500 |
56
+ | Training Loss | 0.2759 |
57
+ | Validation Loss | 0.4368 |
58
+
59
+ > Evaluated on held-out validation data with balanced sentiment distribution.
60
+
61
+ ---
62
+
63
+ ## 🚀 Quick Start
64
+
65
+ ### Installation
66
+ ```bash
67
+ pip install transformers torch
68
+ ````
69
+
70
+ ### Single Prediction
71
+
72
+ ```python
73
+ from transformers import pipeline
74
+
75
+ classifier = pipeline(
76
+ task="text-classification",
77
+ model="nahiar/sentiment-analysis-v2"
78
+ )
79
+
80
+ result = classifier("PASTI DIJAMIN WDP 100%")
81
+ print(result)
82
+ ```
83
+
84
+ **Output**
85
+
86
+ ```python
87
+ [{'label': 'LABEL_1', 'score': 0.9876}]
88
+ ```
89
+
90
+ ### Label Mapping
91
+
92
+ ```text
93
+ LABEL_0 → NEUTRAL
94
+ LABEL_1 → POSITIF
95
+ LABEL_2 → NEGATIVE
96
+ ```
97
+
98
+ ---
99
+
100
+ ## 📦 Batch Inference Example
101
+
102
+ ```python
103
+ "texts": [
104
+ "साइबर हमले के बाद JLR का बड़ा बयान - जानें कंपनी ने क्या कहा | Tata Motors के शेयर पर दिखेगा असर?
105
+
106
+ #TataMotors #JLR #CyberAttack
107
+
108
+ https://t.co/6WlGS77UUp",
109
+ "Kita sudah Ready skrg ini bagi yang memerlukan jasa pemulihan akun & Hapus All akun
110
+
111
+ Lacak lokasi / sadap wa / Hack Akun / Revengeporn - korban pemerasan vcs / terror
112
+
113
+ TIKTOK,GMAIL,TWITER,TELEGRAM,
114
+ FACEBOOK,INSTAGRAM
115
+ #revengeporn #zonauangᅠᅠᅠ
116
+ ☎️ https://t.co/K0AbW08qnU https://t.co/4IpWNA7a0z",
117
+ "💥Slot Gacor Hari ini Rute303
118
+ 💥Jaminan Jackpot Maxwin malam ini
119
+
120
+ LINK SLOT GACOR HARI INI : https://t.co/QvxjCAnt8o
121
+
122
+ Tags:
123
+ Jumbo #timsekop Jumat gratis ongkir Like Crazy PSIM https://t.co/ukuRdlvgGA"
124
+ ]
125
+
126
+ results = classifier(texts)
127
+
128
+ for text, result in zip(texts, results):
129
+ print(f"{text} -> {result['label']} ({result['score']:.4f})")
130
+ ```
131
+
132
+ ---
133
+
134
+ ## 🏗️ Training Configuration
135
+
136
+ | Parameter | Value |
137
+ | ------------------ | ---------------- |
138
+ | Base Model | xlm-roberta-base |
139
+ | Training Samples | 19,200 |
140
+ | Validation Samples | 4,800 |
141
+ | Epochs | 3 |
142
+ | Learning Rate | 1e-5 |
143
+ | Batch Size | 16 |
144
+ | Training Date | 2026-02-05 |
145
+
146
+ ---
147
+
148
+ ## 🎯 Intended Use Cases
149
+
150
+ * Social media Sentiment Analysis
151
+ * Comment & post filtering
152
+ * Content quality control
153
+
154
+ ---
155
+
156
+ ## ⚠️ Limitations
157
+
158
+ * Binary classification only (Positive, Negative, Neutral)
159
+ * Not optimized for non-social-media formal text
160
+ * Performance may degrade on very short or ambiguous messages
161
+ * The model still has the potential to be biased
162
+
163
+ ---
164
+
165
+ ## 📜 License
166
+
167
+ Released under the **Apache 2.0 License**.
168
+ Free for commercial and research use.
169
+
170
+ ---
171
+
172
+ ## 📚 Citation
173
+
174
+ If you use this model in your work, please cite:
175
+
176
+ ```bibtex
177
+ @misc{djunaedi2026sentiment,
178
+ author = {AI/ML Engineer ADS Digital Partner},
179
+ title = {Sentiment Analysis for Social Media Text},
180
+ year = {2026},
181
+ publisher = {Hugging Face},
182
+ url = {https://huggingface.co/nahiar/spam-detection-v2}
183
+ }
184
+ ```
185
+
186
+ ---
187
+
188
+ ## 🙌 Acknowledgements
189
+
190
+ * Hugging Face Transformers
191
  * Facebook AI Research — XLM-RoBERTa