File size: 5,322 Bytes
f834f49
 
 
 
 
 
 
 
5303004
 
f834f49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a00263
 
 
 
 
 
 
 
 
 
 
f834f49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07a118b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f834f49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: mit
language:
- en
- id
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- legal
---

# Election Tweets Classification Model

This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.

## Classes

The model classifies tweets into the following categories:
1. **Politik** (2972 samples)
2. **Sosial Budaya** (425 samples)
3. **Ideologi** (343 samples)
4. **Pertahanan dan Keamanan** (331 samples)
5. **Ekonomi** (310 samples)
6. **Sumber Daya Alam** (157 samples)
7. **Demografi** (61 samples)
8. **Geografi** (20 samples)

| Encoded | Label                     |
|:---------:|:---------------------------:|
| 0       | Demografi                 |
| 1       | Ekonomi                   |
| 2       | Geografi                  |
| 3       | Ideologi                  |
| 4       | Pertahanan dan Keamanan   |
| 5       | Politik                   |
| 6       | Sosial Budaya             |
| 7       | Sumber Daya Alam          |

## Libraries Used

The following libraries were used for data processing, model training, and evaluation:

- Data processing: `numpy`, `pandas`, `re`, `string`, `random`
- Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory`
- Word cloud generation: `PIL`, `wordcloud`
- NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor`
- Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch`

## Data Preparation

### Data Split
The dataset was split into training, validation, and test sets with the following proportions:

- **Training Set**: 85% (3925 samples)
- **Validation Set**: 10% (463 samples)
- **Test Set**: 5% (231 samples)

### Training Details
- **Epochs**: 3
- **Batch Size**: 32

### Training Results

| Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy |
|-------|------------|----------------|-----------------|---------------------|
| 1     | 0.9382     | 0.7167         | 0.7518          | 0.7671              |
| 2     | 0.5741     | 0.8229         | 0.7081          | 0.7931              |
| 3     | 0.3541     | 0.8958         | 0.7473          | 0.7953              |

## Model Architecture

The model is built using the TensorFlow and Keras libraries and employs the following architecture:

- **Embedding Layer**: Converts input tokens into dense vectors of fixed size.
- **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data.
- **Dense Layers**: Fully connected layers for classification.
- **Dropout Layers**: Prevent overfitting by randomly dropping units during training.
- **Batch Normalization**: Normalizes activations of the previous layer.

## Usage

### Installation

To use the model, ensure you have the required libraries installed. You can install them using pip:

```bash
pip install transformers
```

```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rendika/tweets-election-classification")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/tweets-election-classification")
```

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Rendika/tweets-election-classification")
```

### Data Cleaning

The data was cleaned using the following steps:
1. Converted text to lowercase.
2. Removed 'RT'.
3. Removed links.
4. Removed patterns like '[RE ...]'.
5. Removed patterns like '@ ... ='.
6. Removed non-ASCII characters (including emojis).
7. Removed punctuation (excluding '#').
8. Removed excessive whitespace.

### Sample Code

Here's a sample code snippet to load and use the model:

```python
import tensorflow as tf
from tensorflow.keras.models import load_model
import pandas as pd

# Load the trained model
model = load_model('path_to_your_model.h5')

# Preprocess new data
def preprocess_text(text):
    # Include your text preprocessing steps here
    pass

# Example usage
new_tweets = pd.Series(["Your new tweet text here"])
preprocessed_tweets = new_tweets.apply(preprocess_text)
# Tokenize and pad sequences as done during training
# ...

# Predict the class
predictions = model.predict(preprocessed_tweets)
predicted_classes = predictions.argmax(axis=-1)
```

## Evaluation

The model was evaluated using the following metrics:
- **Precision**: Measure of accuracy of the positive predictions.
- **Recall**: Measure of the ability to find all relevant instances.
- **F1 Score**: Harmonic mean of precision and recall.
- **Accuracy**: Overall accuracy of the model.
- **Balanced Accuracy**: Accuracy adjusted for class imbalance.

## Conclusion

This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.

## License

This project is licensed under the MIT License.

## Contact

For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].