Rendika's picture
add class for label encoded
5a00263 verified
---
license: mit
language:
- en
- id
metrics:
- accuracy
pipeline_tag: text-classification
---
# Election Tweets Classification Model
This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.
## Classes
The model classifies tweets into the following categories:
1. **Politik** (2972 samples)
2. **Sosial Budaya** (425 samples)
3. **Ideologi** (343 samples)
4. **Pertahanan dan Keamanan** (331 samples)
5. **Ekonomi** (310 samples)
6. **Sumber Daya Alam** (157 samples)
7. **Demografi** (61 samples)
8. **Geografi** (20 samples)
| Encoded | Label |
|:---------:|:---------------------------:|
| 0 | Demografi |
| 1 | Ekonomi |
| 2 | Geografi |
| 3 | Ideologi |
| 4 | Pertahanan dan Keamanan |
| 5 | Politik |
| 6 | Sosial Budaya |
| 7 | Sumber Daya Alam |
## Libraries Used
The following libraries were used for data processing, model training, and evaluation:
- Data processing: `numpy`, `pandas`, `re`, `string`, `random`
- Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory`
- Word cloud generation: `PIL`, `wordcloud`
- NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor`
- Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch`
## Data Preparation
### Data Split
The dataset was split into training, validation, and test sets with the following proportions:
- **Training Set**: 85% (3925 samples)
- **Validation Set**: 10% (463 samples)
- **Test Set**: 5% (231 samples)
### Training Details
- **Epochs**: 3
- **Batch Size**: 32
### Training Results
| Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy |
|-------|------------|----------------|-----------------|---------------------|
| 1 | 0.9382 | 0.7167 | 0.7518 | 0.7671 |
| 2 | 0.5741 | 0.8229 | 0.7081 | 0.7931 |
| 3 | 0.3541 | 0.8958 | 0.7473 | 0.7953 |
## Model Architecture
The model is built using the TensorFlow and Keras libraries and employs the following architecture:
- **Embedding Layer**: Converts input tokens into dense vectors of fixed size.
- **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data.
- **Dense Layers**: Fully connected layers for classification.
- **Dropout Layers**: Prevent overfitting by randomly dropping units during training.
- **Batch Normalization**: Normalizes activations of the previous layer.
## Usage
### Installation
To use the model, ensure you have the required libraries installed. You can install them using pip:
```bash
pip install transformers
```
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Rendika/tweets-election-classification")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/tweets-election-classification")
```
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="Rendika/tweets-election-classification")
```
### Data Cleaning
The data was cleaned using the following steps:
1. Converted text to lowercase.
2. Removed 'RT'.
3. Removed links.
4. Removed patterns like '[RE ...]'.
5. Removed patterns like '@ ... ='.
6. Removed non-ASCII characters (including emojis).
7. Removed punctuation (excluding '#').
8. Removed excessive whitespace.
### Sample Code
Here's a sample code snippet to load and use the model:
```python
import tensorflow as tf
from tensorflow.keras.models import load_model
import pandas as pd
# Load the trained model
model = load_model('path_to_your_model.h5')
# Preprocess new data
def preprocess_text(text):
# Include your text preprocessing steps here
pass
# Example usage
new_tweets = pd.Series(["Your new tweet text here"])
preprocessed_tweets = new_tweets.apply(preprocess_text)
# Tokenize and pad sequences as done during training
# ...
# Predict the class
predictions = model.predict(preprocessed_tweets)
predicted_classes = predictions.argmax(axis=-1)
```
## Evaluation
The model was evaluated using the following metrics:
- **Precision**: Measure of accuracy of the positive predictions.
- **Recall**: Measure of the ability to find all relevant instances.
- **F1 Score**: Harmonic mean of precision and recall.
- **Accuracy**: Overall accuracy of the model.
- **Balanced Accuracy**: Accuracy adjusted for class imbalance.
## Conclusion
This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.
## License
This project is licensed under the MIT License.
## Contact
For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].