File size: 4,506 Bytes
f834f49 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
license: mit
language:
- en
- id
metrics:
- accuracy
pipeline_tag: text-classification
---
# Election Tweets Classification Model
This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.
## Classes
The model classifies tweets into the following categories:
1. **Politik** (2972 samples)
2. **Sosial Budaya** (425 samples)
3. **Ideologi** (343 samples)
4. **Pertahanan dan Keamanan** (331 samples)
5. **Ekonomi** (310 samples)
6. **Sumber Daya Alam** (157 samples)
7. **Demografi** (61 samples)
8. **Geografi** (20 samples)
## Libraries Used
The following libraries were used for data processing, model training, and evaluation:
- Data processing: `numpy`, `pandas`, `re`, `string`, `random`
- Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory`
- Word cloud generation: `PIL`, `wordcloud`
- NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor`
- Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch`
## Data Preparation
### Data Split
The dataset was split into training, validation, and test sets with the following proportions:
- **Training Set**: 85% (3925 samples)
- **Validation Set**: 10% (463 samples)
- **Test Set**: 5% (231 samples)
### Training Details
- **Epochs**: 3
- **Batch Size**: 32
### Training Results
| Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy |
|-------|------------|----------------|-----------------|---------------------|
| 1 | 0.9382 | 0.7167 | 0.7518 | 0.7671 |
| 2 | 0.5741 | 0.8229 | 0.7081 | 0.7931 |
| 3 | 0.3541 | 0.8958 | 0.7473 | 0.7953 |
## Model Architecture
The model is built using the TensorFlow and Keras libraries and employs the following architecture:
- **Embedding Layer**: Converts input tokens into dense vectors of fixed size.
- **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data.
- **Dense Layers**: Fully connected layers for classification.
- **Dropout Layers**: Prevent overfitting by randomly dropping units during training.
- **Batch Normalization**: Normalizes activations of the previous layer.
## Usage
### Installation
To use the model, ensure you have the required libraries installed. You can install them using pip:
```bash
pip install numpy pandas matplotlib seaborn plotly pillow wordcloud nltk tensorflow keras scikit-learn
```
### Data Cleaning
The data was cleaned using the following steps:
1. Converted text to lowercase.
2. Removed 'RT'.
3. Removed links.
4. Removed patterns like '[RE ...]'.
5. Removed patterns like '@ ... ='.
6. Removed non-ASCII characters (including emojis).
7. Removed punctuation (excluding '#').
8. Removed excessive whitespace.
### Sample Code
Here's a sample code snippet to load and use the model:
```python
import tensorflow as tf
from tensorflow.keras.models import load_model
import pandas as pd
# Load the trained model
model = load_model('path_to_your_model.h5')
# Preprocess new data
def preprocess_text(text):
# Include your text preprocessing steps here
pass
# Example usage
new_tweets = pd.Series(["Your new tweet text here"])
preprocessed_tweets = new_tweets.apply(preprocess_text)
# Tokenize and pad sequences as done during training
# ...
# Predict the class
predictions = model.predict(preprocessed_tweets)
predicted_classes = predictions.argmax(axis=-1)
```
## Evaluation
The model was evaluated using the following metrics:
- **Precision**: Measure of accuracy of the positive predictions.
- **Recall**: Measure of the ability to find all relevant instances.
- **F1 Score**: Harmonic mean of precision and recall.
- **Accuracy**: Overall accuracy of the model.
- **Balanced Accuracy**: Accuracy adjusted for class imbalance.
## Conclusion
This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.
## License
This project is licensed under the MIT License.
## Contact
For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com]. |