--- license: mit language: - en - id metrics: - accuracy pipeline_tag: text-classification --- # Election Tweets Classification Model This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods. ## Classes The model classifies tweets into the following categories: 1. **Politik** (2972 samples) 2. **Sosial Budaya** (425 samples) 3. **Ideologi** (343 samples) 4. **Pertahanan dan Keamanan** (331 samples) 5. **Ekonomi** (310 samples) 6. **Sumber Daya Alam** (157 samples) 7. **Demografi** (61 samples) 8. **Geografi** (20 samples) | Encoded | Label | |:---------:|:---------------------------:| | 0 | Demografi | | 1 | Ekonomi | | 2 | Geografi | | 3 | Ideologi | | 4 | Pertahanan dan Keamanan | | 5 | Politik | | 6 | Sosial Budaya | | 7 | Sumber Daya Alam | ## Libraries Used The following libraries were used for data processing, model training, and evaluation: - Data processing: `numpy`, `pandas`, `re`, `string`, `random` - Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory` - Word cloud generation: `PIL`, `wordcloud` - NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor` - Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch` ## Data Preparation ### Data Split The dataset was split into training, validation, and test sets with the following proportions: - **Training Set**: 85% (3925 samples) - **Validation Set**: 10% (463 samples) - **Test Set**: 5% (231 samples) ### Training Details - **Epochs**: 3 - **Batch Size**: 32 ### Training Results | Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | |-------|------------|----------------|-----------------|---------------------| | 1 | 0.9382 | 0.7167 | 0.7518 | 0.7671 | | 2 | 0.5741 | 0.8229 | 0.7081 | 0.7931 | | 3 | 0.3541 | 0.8958 | 0.7473 | 0.7953 | ## Model Architecture The model is built using the TensorFlow and Keras libraries and employs the following architecture: - **Embedding Layer**: Converts input tokens into dense vectors of fixed size. - **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data. - **Dense Layers**: Fully connected layers for classification. - **Dropout Layers**: Prevent overfitting by randomly dropping units during training. - **Batch Normalization**: Normalizes activations of the previous layer. ## Usage ### Installation To use the model, ensure you have the required libraries installed. You can install them using pip: ```bash pip install transformers ``` ```python # Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Rendika/tweets-election-classification") model = AutoModelForSequenceClassification.from_pretrained("Rendika/tweets-election-classification") ``` ```python # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Rendika/tweets-election-classification") ``` ### Data Cleaning The data was cleaned using the following steps: 1. Converted text to lowercase. 2. Removed 'RT'. 3. Removed links. 4. Removed patterns like '[RE ...]'. 5. Removed patterns like '@ ... ='. 6. Removed non-ASCII characters (including emojis). 7. Removed punctuation (excluding '#'). 8. Removed excessive whitespace. ### Sample Code Here's a sample code snippet to load and use the model: ```python import tensorflow as tf from tensorflow.keras.models import load_model import pandas as pd # Load the trained model model = load_model('path_to_your_model.h5') # Preprocess new data def preprocess_text(text): # Include your text preprocessing steps here pass # Example usage new_tweets = pd.Series(["Your new tweet text here"]) preprocessed_tweets = new_tweets.apply(preprocess_text) # Tokenize and pad sequences as done during training # ... # Predict the class predictions = model.predict(preprocessed_tweets) predicted_classes = predictions.argmax(axis=-1) ``` ## Evaluation The model was evaluated using the following metrics: - **Precision**: Measure of accuracy of the positive predictions. - **Recall**: Measure of the ability to find all relevant instances. - **F1 Score**: Harmonic mean of precision and recall. - **Accuracy**: Overall accuracy of the model. - **Balanced Accuracy**: Accuracy adjusted for class imbalance. ## Conclusion This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making. ## License This project is licensed under the MIT License. ## Contact For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].