TF-IDF Machine Learning Baselines for Indonesian Sentiment Classification

3 Scikit-learn baseline models for sentiment classification on Indonesian e-commerce reviews (Tokopedia).

📊 Models Summary

Model	Accuracy	Macro F1	Weighted F1	Inference (ms)
TF-IDF + Logistic Regression	94.36%	0.5164	0.9575	~10
TF-IDF + SVM ⭐ BEST	97.60%	0.5506	0.9740	~15
TF-IDF + Naive Bayes	97.53%	0.3292	0.9634	~5

Test Set Details

Total Test Samples: 13,067
Total Training Samples: 52,268
Dataset: Tokopedia Product Reviews 2025
Split Strategy: 80% train, 20% test (stratified)

Models Details

1️⃣ Logistic Regression

File: tfidf_logreg.joblib
Accuracy: 94.36%
Macro F1: 0.5164
Weighted F1: 0.9575
Inference Speed: ~10ms per sample (CPU)
Model Size: ~1.2 MB
Use Case: Baseline comparison, interpretability

2️⃣ Support Vector Machine (SVM) ⭐ RECOMMENDED FOR PRODUCTION

File: tfidf_svm.joblib
Accuracy: 97.60%
Macro F1: 0.5506
Weighted F1: 0.9740
Inference Speed: ~15ms per sample (CPU)
Model Size: ~1.5 MB
Use Case: Production deployment - best accuracy + reasonable inference time

3️⃣ Multinomial Naive Bayes

File: tfidf_nb.joblib
Accuracy: 97.53%
Macro F1: 0.3292
Weighted F1: 0.9634
Inference Speed: ~5ms per sample (CPU) - FASTEST
Model Size: ~1.2 MB
Use Case: Real-time inference, probabilistic outputs, resource-constrained environments
Note: Lower macro F1 due to class imbalance sensitivity

🔧 Feature Engineering

TF-IDF Vectorization (Applied to all 3 models)

Word N-grams: 1-3 (unigram, bigram, trigram)
Character N-grams: 2-4 (morphological patterns in Indonesian)
Max Features: 100,000 total (50k word + 50k character)
Normalization: L2
Sublinear TF: Enabled (reduces impact of term frequency)
Min DF: 1 (consider all features)
Max DF: 1.0 (no upper limit)

Example Features

Word n-grams: 'produk', 'produk bagus', 'bagus sekali'
Char n-grams: 'pr', 'pro', 'prod', 'rod', 'odu'

📚 Class Labels

0: Negatif (Negative)
1: Netral (Neutral)
2: Positif (Positive)

💻 Usage

Load Logistic Regression Model

import joblib

# Load model
model = joblib.load('tfidf_logreg.joblib')

# Make prediction
review = ['Produk sangat bagus dan cepat sampai!']
prediction = model.predict(review)
confidence = model.predict_proba(review)

labels = {0: 'Negatif', 1: 'Netral', 2: 'Positif'}
print(f"Sentiment: {labels[prediction[0]]}")
print(f"Confidence: {confidence[0].max():.2%}")

Load SVM Model

import joblib

# Load SVM model
model = joblib.load('tfidf_svm.joblib')

# Make prediction
prediction = model.predict(['Barang rusak, tidak puas'])
probabilities = model.decision_function(['Barang rusak, tidak puas'])

labels = {0: 'Negatif', 1: 'Netral', 2: 'Positif'}
print(f"Predicted: {labels[prediction[0]]}")

Load Naive Bayes Model

import joblib

# Load NB model
model = joblib.load('tfidf_nb.joblib')

# Make prediction (Naive Bayes has probabilistic outputs)
prediction = model.predict(['Produk OK, sesuai harga'])
probabilities = model.predict_proba(['Produk OK, sesuai harga'])

labels = {0: 'Negatif', 1: 'Netral', 2: 'Positif'}
print(f"Sentiment: {labels[prediction[0]]}")
print(f"Class Probabilities: {dict(zip(labels.values(), probabilities[0]))}")

Batch Processing

import joblib

model = joblib.load('tfidf_svm.joblib')  # Use best model (SVM)

reviews = [
    'Produk bagus, recommend!',
    'Lumayan lah',
    'Jelek banget, mau komplain'
]

predictions = model.predict(reviews)
labels = {0: 'Negatif', 1: 'Netral', 2: 'Positif'}

for review, pred in zip(reviews, predictions):
    print(f"'{review}' → {labels[pred]}")

✨ Advantages

✅ Fast Inference - 5-15ms per sample on CPU ✅ Small Model Size - Only ~1.2-1.5 MB each (vs 475 MB for transformer) ✅ No GPU Required - Works on any CPU-only systems ✅ Interpretable - Can extract feature importance (LR, NB) ✅ Production-Ready - Easy deployment and ONNX conversion ✅ High Accuracy - 97.60% for SVM model ✅ Multiple Options - Choose based on speed vs accuracy trade-off

📊 Model Comparison

Aspect	LogReg	SVM	NB	IndoBERT
Accuracy	94.36%	97.60%	97.53%	88.70%
Macro F1	0.5164	0.5506	0.3292	0.5088
Inference Speed	~10ms	~15ms	~5ms	~500ms
Model Size	1.2MB	1.5MB	1.2MB	475MB
GPU Required	❌	❌	❌	❌*
Interpretable	✅ High	⚠️ Medium	✅ High	❌ Low
Semantic Understanding	❌ Low	❌ Low	❌ Low	✅ High

*IndoBERT without GPU is very slow (~1-2s per sample)

Recommendation by Use Case

Production Deployment: Use SVM (best accuracy + reasonable speed)
Real-time Requirements: Use Naive Bayes (fastest inference)
Explainability Needed: Use Logistic Regression (most interpretable)
Complex Semantics: Use IndoBERT (see separate model)

⚙️ Training Configuration

Vectorization

Parameter	Value
Vectorizer	TfidfVectorizer
Min DF (minimum document frequency)	1
Max DF (maximum document frequency)	1.0
Max Features	100,000
Norm	L2
Sublinear TF	True
Word N-grams	(1, 3)
Char N-grams	(2, 4)

Logistic Regression

Parameter	Value
Max Iterations	2000
Regularization	L2
C (regularization strength)	0.5
Class Weight	balanced
Solver	lbfgs

Support Vector Machine

Parameter	Value
Kernel	linear
C (regularization strength)	0.5
Class Weight	balanced
Dual	False
Max Iterations	1000

Naive Bayes

Parameter	Value
Alpha (Laplace smoothing)	1.0
Fit Prior	True
Class Prior	None (automatic)

Dataset Information

Source: Tokopedia Product Reviews 2025
Total Samples: 65,335
Train/Test Split: 80/20 (stratified)
Languages: Indonesian
Domain: E-commerce product sentiment
Train Set: 52,268 samples
Test Set: 13,067 samples

Limitations

May struggle with sarcasm or complex linguistic patterns
Limited context understanding (bag-of-words approach)
Domain-specific (trained on Tokopedia reviews)
Works best for relatively short reviews
No multilingual support

Citation

@misc{kelompok10_baseline_2026,
  title={TF-IDF Tokopedia Sentiment Classifier - Baseline Models},
  author={Kelompok 10},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/kelompok-10-NLP-SD-2026/tfidf-sentiment-baseline}}
}

Authors

Team: Kelompok 10
Institution: PBA 2026
Repository: https://github.com/zeeyachan/pba2026-kelompok10

License

MIT License - See LICENSE file for details

Recommendation: For production systems with CPU constraints, use this baseline model. For deeper semantic understanding and edge cases, combine with IndoBERT model for ensemble predictions.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support