add class for label encoded

5a00263 verified 20 days ago

No virus

5.31 kB

	---
	license: mit
	language:
	- en
	- id
	metrics:
	- accuracy
	pipeline_tag: text-classification
	---

	# Election Tweets Classification Model

	This repository contains a fine-tuned of *indolem/indobertweet-base-uncased model* for classifying tweets related to election topics. The model has been trained to categorize tweets into eight distinct classes, providing valuable insights into public opinion and discourse during election periods.

	## Classes

	The model classifies tweets into the following categories:
	1. Politik (2972 samples)
	2. Sosial Budaya (425 samples)
	3. Ideologi (343 samples)
	4. Pertahanan dan Keamanan (331 samples)
	5. Ekonomi (310 samples)
	6. Sumber Daya Alam (157 samples)
	7. Demografi (61 samples)
	8. Geografi (20 samples)

	\| Encoded \| Label \|
	\|:---------:\|:---------------------------:\|
	\| 0 \| Demografi \|
	\| 1 \| Ekonomi \|
	\| 2 \| Geografi \|
	\| 3 \| Ideologi \|
	\| 4 \| Pertahanan dan Keamanan \|
	\| 5 \| Politik \|
	\| 6 \| Sosial Budaya \|
	\| 7 \| Sumber Daya Alam \|

	## Libraries Used

	The following libraries were used for data processing, model training, and evaluation:

	- Data processing: `numpy`, `pandas`, `re`, `string`, `random`
	- Visualization: `matplotlib.pyplot`, `seaborn`, `tqdm`, `plotly.graph_objs`, `plotly.express`, `plotly.figure_factory`
	- Word cloud generation: `PIL`, `wordcloud`
	- NLP: `nltk`, `nlp_id`, `Sastrawi`, `tweet-preprocessor`
	- Machine Learning: `tensorflow`, `keras`, `sklearn`, `transformers`, `torch`

	## Data Preparation

	### Data Split
	The dataset was split into training, validation, and test sets with the following proportions:

	- Training Set: 85% (3925 samples)
	- Validation Set: 10% (463 samples)
	- Test Set: 5% (231 samples)

	### Training Details
	- Epochs: 3
	- Batch Size: 32

	### Training Results

	\| Epoch \| Train Loss \| Train Accuracy \| Validation Loss \| Validation Accuracy \|
	\|-------\|------------\|----------------\|-----------------\|---------------------\|
	\| 1 \| 0.9382 \| 0.7167 \| 0.7518 \| 0.7671 \|
	\| 2 \| 0.5741 \| 0.8229 \| 0.7081 \| 0.7931 \|
	\| 3 \| 0.3541 \| 0.8958 \| 0.7473 \| 0.7953 \|

	## Model Architecture

	The model is built using the TensorFlow and Keras libraries and employs the following architecture:

	- Embedding Layer: Converts input tokens into dense vectors of fixed size.
	- LSTM Layers: Bidirectional LSTM layers capture dependencies in the text data.
	- Dense Layers: Fully connected layers for classification.
	- Dropout Layers: Prevent overfitting by randomly dropping units during training.
	- Batch Normalization: Normalizes activations of the previous layer.

	## Usage

	### Installation

	To use the model, ensure you have the required libraries installed. You can install them using pip:

	```bash
	pip install transformers
	```

	```python
	# Load model directly
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("Rendika/tweets-election-classification")
	model = AutoModelForSequenceClassification.from_pretrained("Rendika/tweets-election-classification")
	```

	```python
	# Use a pipeline as a high-level helper
	from transformers import pipeline

	pipe = pipeline("text-classification", model="Rendika/tweets-election-classification")
	```

	### Data Cleaning

	The data was cleaned using the following steps:
	1. Converted text to lowercase.
	2. Removed 'RT'.
	3. Removed links.
	4. Removed patterns like '[RE ...]'.
	5. Removed patterns like '@ ... ='.
	6. Removed non-ASCII characters (including emojis).
	7. Removed punctuation (excluding '#').
	8. Removed excessive whitespace.

	### Sample Code

	Here's a sample code snippet to load and use the model:

	```python
	import tensorflow as tf
	from tensorflow.keras.models import load_model
	import pandas as pd

	# Load the trained model
	model = load_model('path_to_your_model.h5')

	# Preprocess new data
	def preprocess_text(text):
	# Include your text preprocessing steps here
	pass

	# Example usage
	new_tweets = pd.Series(["Your new tweet text here"])
	preprocessed_tweets = new_tweets.apply(preprocess_text)
	# Tokenize and pad sequences as done during training
	# ...

	# Predict the class
	predictions = model.predict(preprocessed_tweets)
	predicted_classes = predictions.argmax(axis=-1)
	```

	## Evaluation

	The model was evaluated using the following metrics:
	- Precision: Measure of accuracy of the positive predictions.
	- Recall: Measure of the ability to find all relevant instances.
	- F1 Score: Harmonic mean of precision and recall.
	- Accuracy: Overall accuracy of the model.
	- Balanced Accuracy: Accuracy adjusted for class imbalance.

	## Conclusion

	This fine-tuned model provides a robust tool for classifying election-related tweets into distinct categories. It can be used to analyze public sentiment and trends during election periods, aiding in better understanding and decision-making.

	## License

	This project is licensed under the MIT License.

	## Contact

	For any questions or feedback, please contact [me] at [rendikarendi96@gmail.com].