IndoBERTweet Multiclass with Balanced Augmented Dataset Model

Overview

This repository contains a trained IndoBERTweet model for text classification. The model has been trained and evaluated on a balanced dataset comprising various labels such as Politics, Social Culture, Defense and Security, Ideology, Economy, Natural Resources, Demography, and Geography.

Dataset Information

Before Augmentation/Balancing

Label	Count
Politics	2972
Social Culture	587
Defense and Security	400
Ideology	400
Economy	367
Natural Resources	192
Demography	62
Geography	20

After Balancing

Label	Count
Politics	2969
Demography	427
Social Culture	422
Ideology	343
Defense and Security	331
Economy	309
Natural Resources	156
Geography	133

Label Encoding

Encoded	Label
0	Demography
1	Economy
2	Geography
3	Ideology
4	Defense and Security
5	Politics
6	Social Culture
7	Natural Resources

Data Split

Train Size: 4326 samples (85%)
Test Size: 764 samples (15%)

Model Training Log

Epoch 1/4

Train Loss: 1.0651 | Train Accuracy: 0.6700
Test Loss: 0.8339 | Test Accuracy: 0.7313

Epoch 2/4

Train Loss: 0.6496 | Train Accuracy: 0.7879
Test Loss: 0.6988 | Test Accuracy: 0.7717

Epoch 3/4

Train Loss: 0.4223 | Train Accuracy: 0.8736
Test Loss: 0.7308 | Test Accuracy: 0.7704

Epoch 4/4

Train Loss: 0.2764 | Train Accuracy: 0.9150
Test Loss: 0.7615 | Test Accuracy: 0.7826

Training Completed

Model Evaluation

Precision Score: 0.7836
Recall Score: 0.7827
F1 Score: 0.7820

Classification Report

Label	Precision	Recall	F1-Score	Support
Demography	0.90	0.94	0.92	64
Economy	0.70	0.67	0.69	46
Geography	0.95	0.90	0.92	20
Ideology	0.72	0.56	0.63	52
Defense and Security	0.73	0.66	0.69	50
Politics	0.84	0.86	0.85	446
Social Culture	0.43	0.48	0.45	63
Natural Resources	0.61	0.61	0.61	23

Accuracy Score: 0.7827
Balanced Accuracy Score: 0.7091
Macro Average: 0.74 (Precision), 0.71 (Recall), 0.72 (F1-Score)
Weighted Average: 0.78 (Precision), 0.78 (Recall), 0.78 (F1-Score)

How to Use the Model

To use this model, you can load it using the transformers library from Hugging Face.

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Rendika/Trained-indobertweet-balanced-dataset")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset")

Conclusion

This IndoBERTweet model is fine-tuned on a balanced dataset to enhance its performance across different categories. The model demonstrates good performance metrics, making it suitable for a variety of text classification tasks in the Indonesian language.

Feel free to use and contribute to this repository. For any issues or suggestions, please open an issue on the repository or contact the maintainer.