IndoBERTweet Multiclass with Balanced Augmented Dataset Model
Overview
This repository contains a trained IndoBERTweet model for text classification. The model has been trained and evaluated on a balanced dataset comprising various labels such as Politics, Social Culture, Defense and Security, Ideology, Economy, Natural Resources, Demography, and Geography.
Dataset Information
Before Augmentation/Balancing
Label | Count |
---|---|
Politics | 2972 |
Social Culture | 587 |
Defense and Security | 400 |
Ideology | 400 |
Economy | 367 |
Natural Resources | 192 |
Demography | 62 |
Geography | 20 |
After Balancing
Label | Count |
---|---|
Politics | 2969 |
Demography | 427 |
Social Culture | 422 |
Ideology | 343 |
Defense and Security | 331 |
Economy | 309 |
Natural Resources | 156 |
Geography | 133 |
Label Encoding
Encoded | Label |
---|---|
0 | Demography |
1 | Economy |
2 | Geography |
3 | Ideology |
4 | Defense and Security |
5 | Politics |
6 | Social Culture |
7 | Natural Resources |
Data Split
- Train Size: 4326 samples (85%)
- Test Size: 764 samples (15%)
Model Training Log
Epoch 1/4
- Train Loss: 1.0651 | Train Accuracy: 0.6700
- Test Loss: 0.8339 | Test Accuracy: 0.7313
Epoch 2/4
- Train Loss: 0.6496 | Train Accuracy: 0.7879
- Test Loss: 0.6988 | Test Accuracy: 0.7717
Epoch 3/4
- Train Loss: 0.4223 | Train Accuracy: 0.8736
- Test Loss: 0.7308 | Test Accuracy: 0.7704
Epoch 4/4
- Train Loss: 0.2764 | Train Accuracy: 0.9150
- Test Loss: 0.7615 | Test Accuracy: 0.7826
Training Completed
Model Evaluation
- Precision Score: 0.7836
- Recall Score: 0.7827
- F1 Score: 0.7820
Classification Report
Label | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Demography | 0.90 | 0.94 | 0.92 | 64 |
Economy | 0.70 | 0.67 | 0.69 | 46 |
Geography | 0.95 | 0.90 | 0.92 | 20 |
Ideology | 0.72 | 0.56 | 0.63 | 52 |
Defense and Security | 0.73 | 0.66 | 0.69 | 50 |
Politics | 0.84 | 0.86 | 0.85 | 446 |
Social Culture | 0.43 | 0.48 | 0.45 | 63 |
Natural Resources | 0.61 | 0.61 | 0.61 | 23 |
- Accuracy Score: 0.7827
- Balanced Accuracy Score: 0.7091
- Macro Average: 0.74 (Precision), 0.71 (Recall), 0.72 (F1-Score)
- Weighted Average: 0.78 (Precision), 0.78 (Recall), 0.78 (F1-Score)
How to Use the Model
To use this model, you can load it using the transformers
library from Hugging Face.
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="Rendika/Trained-indobertweet-balanced-dataset")
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset")
Conclusion
This IndoBERTweet model is fine-tuned on a balanced dataset to enhance its performance across different categories. The model demonstrates good performance metrics, making it suitable for a variety of text classification tasks in the Indonesian language.
Feel free to use and contribute to this repository. For any issues or suggestions, please open an issue on the repository or contact the maintainer.
- Downloads last month
- 7