Rendika commited on
Commit
f834f49
1 Parent(s): 54a7cd6

Add Readme.md for description

Browse files
Files changed (1) hide show
  1. README.md +137 -3
README.md CHANGED
@@ -1,3 +1,137 @@
1
- ---
2
- license: unlicense
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - id
6
+ metrics:
7
+ - accuracy
8
+ pipeline_tag: text-classification
9
+ ---
10
+
11
+ # Election Tweets Classification Model
12
+
13
+ This repository contains a fine-tuned of ***indolem/indobertweet-base-uncased model*** for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.
14
+
15
+ ## Classes
16
+
17
+ The model classifies tweets into the following categories:
18
+ 1. **Politik** (2972 samples)
19
+ 2. **Sosial Budaya** (425 samples)
20
+ 3. **Ideologi** (343 samples)
21
+ 4. **Pertahanan dan Keamanan** (331 samples)
22
+ 5. **Ekonomi** (310 samples)
23
+ 6. **Sumber Daya Alam** (157 samples)
24
+ 7. **Demografi** (61 samples)
25
+ 8. **Geografi** (20 samples)
26
+
27
+ ## Libraries Used
28
+
29
+ The following libraries were used for data processing, model training, and evaluation:
30
+
31
+ - Data processing: `numpy`, `pandas`, `re`, `string`, `random`
32
+ - Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory`
33
+ - Word cloud generation: `PIL`, `wordcloud`
34
+ - NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor`
35
+ - Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch`
36
+
37
+ ## Data Preparation
38
+
39
+ ### Data Split
40
+ The dataset was split into training, validation, and test sets with the following proportions:
41
+
42
+ - **Training Set**: 85% (3925 samples)
43
+ - **Validation Set**: 10% (463 samples)
44
+ - **Test Set**: 5% (231 samples)
45
+
46
+ ### Training Details
47
+ - **Epochs**: 3
48
+ - **Batch Size**: 32
49
+
50
+ ### Training Results
51
+
52
+ | Epoch | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy |
53
+ |-------|------------|----------------|-----------------|---------------------|
54
+ | 1 | 0.9382 | 0.7167 | 0.7518 | 0.7671 |
55
+ | 2 | 0.5741 | 0.8229 | 0.7081 | 0.7931 |
56
+ | 3 | 0.3541 | 0.8958 | 0.7473 | 0.7953 |
57
+
58
+ ## Model Architecture
59
+
60
+ The model is built using the TensorFlow and Keras libraries and employs the following architecture:
61
+
62
+ - **Embedding Layer**: Converts input tokens into dense vectors of fixed size.
63
+ - **LSTM Layers**: Bidirectional LSTM layers capture dependencies in the text data.
64
+ - **Dense Layers**: Fully connected layers for classification.
65
+ - **Dropout Layers**: Prevent overfitting by randomly dropping units during training.
66
+ - **Batch Normalization**: Normalizes activations of the previous layer.
67
+
68
+ ## Usage
69
+
70
+ ### Installation
71
+
72
+ To use the model, ensure you have the required libraries installed. You can install them using pip:
73
+
74
+ ```bash
75
+ pip install numpy pandas matplotlib seaborn plotly pillow wordcloud nltk tensorflow keras scikit-learn
76
+ ```
77
+
78
+ ### Data Cleaning
79
+
80
+ The data was cleaned using the following steps:
81
+ 1. Converted text to lowercase.
82
+ 2. Removed 'RT'.
83
+ 3. Removed links.
84
+ 4. Removed patterns like '[RE ...]'.
85
+ 5. Removed patterns like '@ ... ='.
86
+ 6. Removed non-ASCII characters (including emojis).
87
+ 7. Removed punctuation (excluding '#').
88
+ 8. Removed excessive whitespace.
89
+
90
+ ### Sample Code
91
+
92
+ Here's a sample code snippet to load and use the model:
93
+
94
+ ```python
95
+ import tensorflow as tf
96
+ from tensorflow.keras.models import load_model
97
+ import pandas as pd
98
+
99
+ # Load the trained model
100
+ model = load_model('path_to_your_model.h5')
101
+
102
+ # Preprocess new data
103
+ def preprocess_text(text):
104
+ # Include your text preprocessing steps here
105
+ pass
106
+
107
+ # Example usage
108
+ new_tweets = pd.Series(["Your new tweet text here"])
109
+ preprocessed_tweets = new_tweets.apply(preprocess_text)
110
+ # Tokenize and pad sequences as done during training
111
+ # ...
112
+
113
+ # Predict the class
114
+ predictions = model.predict(preprocessed_tweets)
115
+ predicted_classes = predictions.argmax(axis=-1)
116
+ ```
117
+
118
+ ## Evaluation
119
+
120
+ The model was evaluated using the following metrics:
121
+ - **Precision**: Measure of accuracy of the positive predictions.
122
+ - **Recall**: Measure of the ability to find all relevant instances.
123
+ - **F1 Score**: Harmonic mean of precision and recall.
124
+ - **Accuracy**: Overall accuracy of the model.
125
+ - **Balanced Accuracy**: Accuracy adjusted for class imbalance.
126
+
127
+ ## Conclusion
128
+
129
+ This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.
130
+
131
+ ## License
132
+
133
+ This project is licensed under the MIT License.
134
+
135
+ ## Contact
136
+
137
+ For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].