File size: 5,322 Bytes

---
license: mit
language:
- en
- id
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- legal
---

# Election Tweets Classification Model

This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.

## Classes

The model classifies tweets into the following categories:
1. **Politik** (2972 samples)
2. **Sosial Budaya** (425 samples)
3. **Ideologi** (343 samples)
4. **Pertahanan dan Keamanan** (331 samples)
5. **Ekonomi** (310 samples)
6. **Sumber Daya Alam** (157 samples)
7. **Demografi** (61 samples)
8. **Geografi** (20 samples)

| Encoded | Label                     |
|:---------:|:---------------------------:|
| 0       | Demografi                 |
| 1       | Ekonomi                   |
| 2       | Geografi                  |
| 3       | Ideologi                  |
| 4       | Pertahanan dan Keamanan   |
| 5       | Politik                   |
| 6       | Sosial Budaya             |
| 7       | Sumber Daya Alam          |

## Libraries Used

The following libraries were used for data processing, model training, and evaluation:

- Data processing: `numpy`, `pandas`, `re`, `string`, `random`
- Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory`
- Word cloud generation: `PIL`, `wordcloud`
- NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor`
- Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch`

## Data Preparation

### Data Split
The dataset was split into training, validation, and test sets with the following proportions:

- **Training Set**: 85% (3925 samples)
- **Validation Set**: 10% (463 samples)
- **Test Set**: 5% (231 samples)

### Training Details
- **Epochs**: 3
- **Batch Size**: 32

### Training Results

| Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy |
|-------|------------|----------------|-----------------|---------------------|
| 1     | 0.9382     | 0.7167         | 0.7518          | 0.7671              |
| 2     | 0.5741     | 0.8229         | 0.7081          | 0.7931              |
| 3     | 0.3541     | 0.8958         | 0.7473          | 0.7953              |

## Model Architecture

The model is built using the TensorFlow and Keras libraries and employs the following architecture:

- **Embedding Layer**: Converts input tokens into dense vectors of fixed size.
- **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data.
- **Dense Layers**: Fully connected layers for classification.
- **Dropout Layers**: Prevent overfitting by randomly dropping units during training.
- **Batch Normalization**: Normalizes activations of the previous layer.

## Usage

### Installation

To use the model, ensure you have the required libraries installed. You can install them using pip:

```bash
pip install transformers
```

```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rendika/tweets-election-classification")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/tweets-election-classification")
```

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Rendika/tweets-election-classification")
```

### Data Cleaning

The data was cleaned using the following steps:
1. Converted text to lowercase.
2. Removed 'RT'.
3. Removed links.
4. Removed patterns like '[RE ...]'.
5. Removed patterns like '@ ... ='.
6. Removed non-ASCII characters (including emojis).
7. Removed punctuation (excluding '#').
8. Removed excessive whitespace.

### Sample Code

Here's a sample code snippet to load and use the model:

```python
import tensorflow as tf
from tensorflow.keras.models import load_model
import pandas as pd

# Load the trained model
model = load_model('path_to_your_model.h5')

# Preprocess new data
def preprocess_text(text):
    # Include your text preprocessing steps here
    pass

# Example usage
new_tweets = pd.Series(["Your new tweet text here"])
preprocessed_tweets = new_tweets.apply(preprocess_text)
# Tokenize and pad sequences as done during training
# ...

# Predict the class
predictions = model.predict(preprocessed_tweets)
predicted_classes = predictions.argmax(axis=-1)
```

## Evaluation

The model was evaluated using the following metrics:
- **Precision**: Measure of accuracy of the positive predictions.
- **Recall**: Measure of the ability to find all relevant instances.
- **F1 Score**: Harmonic mean of precision and recall.
- **Accuracy**: Overall accuracy of the model.
- **Balanced Accuracy**: Accuracy adjusted for class imbalance.

## Conclusion

This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.

## License

This project is licensed under the MIT License.

## Contact

For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].