|
--- |
|
license: mit |
|
language: |
|
- en |
|
- id |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Election Tweets Classification Model |
|
|
|
This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods. |
|
|
|
## Classes |
|
|
|
The model classifies tweets into the following categories: |
|
1. **Politik** (2972 samples) |
|
2. **Sosial Budaya** (425 samples) |
|
3. **Ideologi** (343 samples) |
|
4. **Pertahanan dan Keamanan** (331 samples) |
|
5. **Ekonomi** (310 samples) |
|
6. **Sumber Daya Alam** (157 samples) |
|
7. **Demografi** (61 samples) |
|
8. **Geografi** (20 samples) |
|
|
|
## Libraries Used |
|
|
|
The following libraries were used for data processing, model training, and evaluation: |
|
|
|
- Data processing: `numpy`, `pandas`, `re`, `string`, `random` |
|
- Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory` |
|
- Word cloud generation: `PIL`, `wordcloud` |
|
- NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor` |
|
- Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch` |
|
|
|
## Data Preparation |
|
|
|
### Data Split |
|
The dataset was split into training, validation, and test sets with the following proportions: |
|
|
|
- **Training Set**: 85% (3925 samples) |
|
- **Validation Set**: 10% (463 samples) |
|
- **Test Set**: 5% (231 samples) |
|
|
|
### Training Details |
|
- **Epochs**: 3 |
|
- **Batch Size**: 32 |
|
|
|
### Training Results |
|
|
|
| Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | |
|
|-------|------------|----------------|-----------------|---------------------| |
|
| 1 | 0.9382 | 0.7167 | 0.7518 | 0.7671 | |
|
| 2 | 0.5741 | 0.8229 | 0.7081 | 0.7931 | |
|
| 3 | 0.3541 | 0.8958 | 0.7473 | 0.7953 | |
|
|
|
## Model Architecture |
|
|
|
The model is built using the TensorFlow and Keras libraries and employs the following architecture: |
|
|
|
- **Embedding Layer**: Converts input tokens into dense vectors of fixed size. |
|
- **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data. |
|
- **Dense Layers**: Fully connected layers for classification. |
|
- **Dropout Layers**: Prevent overfitting by randomly dropping units during training. |
|
- **Batch Normalization**: Normalizes activations of the previous layer. |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
To use the model, ensure you have the required libraries installed. You can install them using pip: |
|
|
|
```bash |
|
pip install numpy pandas matplotlib seaborn plotly pillow wordcloud nltk tensorflow keras scikit-learn |
|
``` |
|
|
|
### Data Cleaning |
|
|
|
The data was cleaned using the following steps: |
|
1. Converted text to lowercase. |
|
2. Removed 'RT'. |
|
3. Removed links. |
|
4. Removed patterns like '[RE ...]'. |
|
5. Removed patterns like '@ ... ='. |
|
6. Removed non-ASCII characters (including emojis). |
|
7. Removed punctuation (excluding '#'). |
|
8. Removed excessive whitespace. |
|
|
|
### Sample Code |
|
|
|
Here's a sample code snippet to load and use the model: |
|
|
|
```python |
|
import tensorflow as tf |
|
from tensorflow.keras.models import load_model |
|
import pandas as pd |
|
|
|
# Load the trained model |
|
model = load_model('path_to_your_model.h5') |
|
|
|
# Preprocess new data |
|
def preprocess_text(text): |
|
# Include your text preprocessing steps here |
|
pass |
|
|
|
# Example usage |
|
new_tweets = pd.Series(["Your new tweet text here"]) |
|
preprocessed_tweets = new_tweets.apply(preprocess_text) |
|
# Tokenize and pad sequences as done during training |
|
# ... |
|
|
|
# Predict the class |
|
predictions = model.predict(preprocessed_tweets) |
|
predicted_classes = predictions.argmax(axis=-1) |
|
``` |
|
|
|
## Evaluation |
|
|
|
The model was evaluated using the following metrics: |
|
- **Precision**: Measure of accuracy of the positive predictions. |
|
- **Recall**: Measure of the ability to find all relevant instances. |
|
- **F1 Score**: Harmonic mean of precision and recall. |
|
- **Accuracy**: Overall accuracy of the model. |
|
- **Balanced Accuracy**: Accuracy adjusted for class imbalance. |
|
|
|
## Conclusion |
|
|
|
This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making. |
|
|
|
## License |
|
|
|
This project is licensed under the MIT License. |
|
|
|
## Contact |
|
|
|
For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com]. |