🇮🇩 Indonesian Emotion Classification using IndoBERT

GitHub Hugging Face

Multiclass emotion classification for Indonesian text using IndoBERT and Hugging Face Transformers. This project was developed as part of the NoLimit Indonesia Data Scientist Hiring Test.

Overview

The objective of this project is to classify Indonesian text into five emotion categories:

  • 😡 Anger
  • 😨 Fear
  • 😊 Happy
  • ❤️ Love
  • 😢 Sadness

The model is fine-tuned from IndoBERT Base P1 and evaluated using standard classification metrics including Accuracy, Precision, Recall, and F1 Score.


Dataset

Source

Indonesian Twitter Emotion Dataset:

https://www.kaggle.com/datasets/dennisherdi/indonesian-twitter-emotion

Emotion Labels

Label Description
anger Anger emotion
fear Fear or anxiety
happy Happiness or joy
love Affection or love
sadness Sadness

Data Split

Split Percentage
Train 70%
Validation 15%
Test 15%

Project Structure

.
├── dataset/
│   ├── emotion_dataset.csv
│   └── slang_dict.csv
│
├── models/
│   └── indobert-emotion-classification/
│       ├── config.json
│       ├── model.safetensors
│       ├── tokenizer.json
│       ├── tokenizer_config.json
│       └── training_args.bin
│
├── notebooks/
│   ├── emotion_classification.ipynb
│   └── word_dictionary.xlsx
│
├── results/
├── requirements.txt
└── README.md

Workflow

Project Workflow


Methodology

1. Data Preprocessing

Several preprocessing techniques were applied to improve text quality before training:

  • Lowercasing
  • URL removal
  • Mention removal
  • Hashtag normalization
  • Repeated character normalization
  • Indonesian slang normalization
  • Whitespace normalization

Example

Input:

GAKKK suka banget sama pelayanan ini!!!

Output:

tidak suka banget sama pelayanan ini!!!

2. Tokenization

Tokenizer used:

indobenchmark/indobert-base-p1

3. Model Fine-Tuning

Base model:

indobenchmark/indobert-base-p1

Framework:

  • PyTorch
  • Hugging Face Transformers

Training configuration:

Parameter Value
Learning Rate 1e-5
Batch Size 8
Epochs 15
Weight Decay 0.01
Warmup Ratio 0.1
FP16 Enabled
Evaluation Metric Macro F1

Results

Test Performance

Metric Score
Accuracy 0.74
Macro Precision 0.76
Macro Recall 0.75
Macro F1 Score 0.75
Weighted F1 Score 0.74

Key Findings

  • IndoBERT achieved a Macro F1 Score of 0.75 on the test dataset.
  • Text normalization helped improve model robustness against informal Indonesian social media language.
  • The model performed well across all emotion categories with balanced precision and recall.
  • Fear-related tweets remain the most challenging category due to semantic overlap with sadness and anxiety expressions.

Example Predictions

Text Predicted Emotion
aku sangat senang hari ini happy
aku kecewa dengan hasilnya sadness
aku takut menghadapi ujian besok fear
aku sangat menyayanginya love
aku marah dengan pelayanan ini anger

Trained Model

The fine-tuned model is available in:

models/indobert-emotion-classification/

The model directory contains:

  • config.json
  • model.safetensors
  • tokenizer.json
  • tokenizer_config.json
  • training_args.bin

Installation

Clone the repository:

git clone https://github.com/kevin-sandy/indobert-emotion-classification-twitter.git
cd indobert-emotion-classification-twitter

Install dependencies:

pip install -r requirements.txt

Running the Project

Launch Jupyter Notebook:

jupyter notebook

Open:

notebooks/emotion_classification.ipynb

Run all cells sequentially to reproduce the complete workflow:

  1. Data Loading
  2. Exploratory Data Analysis (EDA)
  3. Data Preprocessing
  4. Label Encoding
  5. Data Splitting
  6. Tokenization
  7. IndoBERT Fine-Tuning
  8. Model Evaluation
  9. Inference

Inference

Load the trained model using Hugging Face Transformers:

from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="./models/indobert-emotion-classification"
)

classifier("aku sangat senang hari ini")

Example output:

[
    {
        "label": "happy",
        "score": 0.95
    }
]

Technologies

  • Python
  • PyTorch
  • Hugging Face Transformers
  • Scikit-learn
  • Pandas
  • NumPy
  • Matplotlib
  • Jupyter Notebook

Author

Kevin Sandy Dimpos Manurung

Project submitted for the Data Scientist Hiring Test.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kevin-sandy/indobert-emotion-classification-twitter

Finetuned
(135)
this model

Space using kevin-sandy/indobert-emotion-classification-twitter 1