Election Tweets Classification Model

This repository contains a fine-tuned of indolem/indobertweet-base-uncased model for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.

Classes

The model classifies tweets into the following categories:

  1. Politik (2972 samples)
  2. Sosial Budaya (425 samples)
  3. Ideologi (343 samples)
  4. Pertahanan dan Keamanan (331 samples)
  5. Ekonomi (310 samples)
  6. Sumber Daya Alam (157 samples)
  7. Demografi (61 samples)
  8. Geografi (20 samples)
Encoded Label
0 Demografi
1 Ekonomi
2 Geografi
3 Ideologi
4 Pertahanan dan Keamanan
5 Politik
6 Sosial Budaya
7 Sumber Daya Alam

Libraries Used

The following libraries were used for data processing, model training, and evaluation:

  • Data processing: numpy, pandas, re, string, random
  • Visualization: matplotlib.pyplot, seaborn, tqdm, plotly.graph_objs, plotly.express, plotly.figure_factory
  • Word cloud generation: PIL, wordcloud
  • NLP: nltk, nlp_id, Sastrawi, tweet-preprocessor
  • Machine Learning: tensorflow, keras, sklearn, transformers, torch

Data Preparation

Data Split

The dataset was split into training, validation, and test sets with the following proportions:

  • Training Set: 85% (3925 samples)
  • Validation Set: 10% (463 samples)
  • Test Set: 5% (231 samples)

Training Details

  • Epochs: 3
  • Batch Size: 32

Training Results

Epoch Train Loss Train Accuracy Validation Loss Validation Accuracy
1 0.9382 0.7167 0.7518 0.7671
2 0.5741 0.8229 0.7081 0.7931
3 0.3541 0.8958 0.7473 0.7953

Model Architecture

The model is built using the TensorFlow and Keras libraries and employs the following architecture:

  • Embedding Layer: Converts input tokens into dense vectors of fixed size.
  • LSTM Layers: Bidirectional LSTM layers capture dependencies in the text data.
  • Dense Layers: Fully connected layers for classification.
  • Dropout Layers: Prevent overfitting by randomly dropping units during training.
  • Batch Normalization: Normalizes activations of the previous layer.

Usage

Installation

To use the model, ensure you have the required libraries installed. You can install them using pip:

pip install transformers
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rendika/tweets-election-classification")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/tweets-election-classification")
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Rendika/tweets-election-classification")

Data Cleaning

The data was cleaned using the following steps:

  1. Converted text to lowercase.
  2. Removed 'RT'.
  3. Removed links.
  4. Removed patterns like '[RE ...]'.
  5. Removed patterns like '@ ... ='.
  6. Removed non-ASCII characters (including emojis).
  7. Removed punctuation (excluding '#').
  8. Removed excessive whitespace.

Sample Code

Here's a sample code snippet to load and use the model:

import tensorflow as tf
from tensorflow.keras.models import load_model
import pandas as pd

# Load the trained model
model = load_model('path_to_your_model.h5')

# Preprocess new data
def preprocess_text(text):
    # Include your text preprocessing steps here
    pass

# Example usage
new_tweets = pd.Series(["Your new tweet text here"])
preprocessed_tweets = new_tweets.apply(preprocess_text)
# Tokenize and pad sequences as done during training
# ...

# Predict the class
predictions = model.predict(preprocessed_tweets)
predicted_classes = predictions.argmax(axis=-1)

Evaluation

The model was evaluated using the following metrics:

  • Precision: Measure of accuracy of the positive predictions.
  • Recall: Measure of the ability to find all relevant instances.
  • F1 Score: Harmonic mean of precision and recall.
  • Accuracy: Overall accuracy of the model.
  • Balanced Accuracy: Accuracy adjusted for class imbalance.

Conclusion

This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.

License

This project is licensed under the MIT License.

Contact

For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].

Downloads last month
17
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.