--- license: mit language: - id - en metrics: - accuracy - recall - precision pipeline_tag: text-classification tags: - election - multiclass - Balanced Accuracy --- # IndoBERTweet Multiclass with Balanced Augmented Dataset Model ## Overview This repository contains a trained IndoBERTweet model for text classification. The model has been trained and evaluated on a balanced dataset comprising various labels such as Politics, Social Culture, Defense and Security, Ideology, Economy, Natural Resources, Demography, and Geography. ## Dataset Information ### Before Augmentation/Balancing | Label | Count | |-----------------------------|-------| | Politics | 2972 | | Social Culture | 587 | | Defense and Security | 400 | | Ideology | 400 | | Economy | 367 | | Natural Resources | 192 | | Demography | 62 | | Geography | 20 | ### After Balancing | Label | Count | |-----------------------------|-------| | Politics | 2969 | | Demography | 427 | | Social Culture | 422 | | Ideology | 343 | | Defense and Security | 331 | | Economy | 309 | | Natural Resources | 156 | | Geography | 133 | ### Label Encoding | Encoded | Label | |---------|-------------------------| | 0 | Demography | | 1 | Economy | | 2 | Geography | | 3 | Ideology | | 4 | Defense and Security | | 5 | Politics | | 6 | Social Culture | | 7 | Natural Resources | ## Data Split - **Train Size**: 4326 samples (85%) - **Test Size**: 764 samples (15%) ## Model Training Log **Epoch 1/4** - Train Loss: 1.0651 | Train Accuracy: 0.6700 - Test Loss: 0.8339 | Test Accuracy: 0.7313 **Epoch 2/4** - Train Loss: 0.6496 | Train Accuracy: 0.7879 - Test Loss: 0.6988 | Test Accuracy: 0.7717 **Epoch 3/4** - Train Loss: 0.4223 | Train Accuracy: 0.8736 - Test Loss: 0.7308 | Test Accuracy: 0.7704 **Epoch 4/4** - Train Loss: 0.2764 | Train Accuracy: 0.9150 - Test Loss: 0.7615 | Test Accuracy: 0.7826 **Training Completed** ## Model Evaluation - **Precision Score**: 0.7836 - **Recall Score**: 0.7827 - **F1 Score**: 0.7820 ### Classification Report | Label | Precision | Recall | F1-Score | Support | |-------------------------|-----------|--------|----------|---------| | Demography | 0.90 | 0.94 | 0.92 | 64 | | Economy | 0.70 | 0.67 | 0.69 | 46 | | Geography | 0.95 | 0.90 | 0.92 | 20 | | Ideology | 0.72 | 0.56 | 0.63 | 52 | | Defense and Security | 0.73 | 0.66 | 0.69 | 50 | | Politics | 0.84 | 0.86 | 0.85 | 446 | | Social Culture | 0.43 | 0.48 | 0.45 | 63 | | Natural Resources | 0.61 | 0.61 | 0.61 | 23 | - **Accuracy Score**: 0.7827 - **Balanced Accuracy Score**: 0.7091 - **Macro Average**: 0.74 (Precision), 0.71 (Recall), 0.72 (F1-Score) - **Weighted Average**: 0.78 (Precision), 0.78 (Recall), 0.78 (F1-Score) ## How to Use the Model To use this model, you can load it using the `transformers` library from Hugging Face. ```python # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Rendika/Trained-indobertweet-balanced-dataset") # Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset") model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset") ``` ## Conclusion This IndoBERTweet model is fine-tuned on a balanced dataset to enhance its performance across different categories. The model demonstrates good performance metrics, making it suitable for a variety of text classification tasks in the Indonesian language. Feel free to use and contribute to this repository. For any issues or suggestions, please open an issue on the repository or contact the maintainer. ---