DistilBERT SST-2 Sentiment Analysis Model

This repository contains a fine-tuned DistilBERT model for sentiment analysis, trained on a subset of the SST-2 dataset. The model, tokenizer, and datasets are provided for educational purposes.

Model Details

Model Name: DistilBERT SST-2 Sentiment Analysis
Architecture: DistilBERT (distilbert-base-uncased)
Task: Binary Sentiment Classification
Dataset: SST-2 (Subset: 600 training samples, 150 test samples)
Accuracy: 89% on the validation subset

Model Components

Model: The model is a DistilBERT model fine-tuned for binary sentiment analysis (positive/negative).
Tokenizer: The tokenizer used is distilbert-base-uncased, which is aligned with the DistilBERT model.

Datasets

This repository also includes the datasets used to train and evaluate the model:

Training Dataset: 600 samples from the SST-2 training set, saved in Parquet format.
Test Dataset: 150 samples from the SST-2 validation set, saved in Parquet format.

The datasets were tokenized using the DistilBERT tokenizer with the following preprocessing steps:

Padding: Sentences are padded to the longest sentence in the batch.
Truncation: Sentences longer than 512 tokens are truncated.
Max Length: 512 tokens.

Files Included

pytorch_model.bin: The model weights.
config.json: The model configuration.
tokenizer_config.json: The tokenizer configuration.
vocab.txt: The tokenizer vocabulary file.
train_dataset.parquet: Tokenized training dataset (600 samples) in Parquet format.
test_dataset.parquet: Tokenized test dataset (150 samples) in Parquet format.

Training Details

Training Configuration

The model was fine-tuned using the following hyperparameters:

Learning Rate: 2e-5
Batch Size: 16 (training), 64 (evaluation)
Number of Epochs: 4
Gradient Accumulation Steps: 3
Weight Decay: 0.01
Evaluation Strategy: Evaluated at the end of each epoch
Logging: Logs were generated every 100 steps

Training Process

The model was trained using the Hugging Face Trainer API, which provides an easy interface for training and evaluating models. The training process involved regular evaluation steps to monitor accuracy, and the best model based on validation accuracy was loaded at the end of training.

Model Performance

Validation Accuracy: 89%

The validation accuracy was calculated on the 150 samples from the SST-2 validation set.

Usage Notes

This model is provided for educational purposes. It may not be suitable for production use without further testing and validation on larger datasets.

Acknowledgements

Hugging Face: For providing the transformers library and dataset access.
GLUE Benchmark: For providing the SST-2 dataset.
SST-2 Dataset: The SST-2 dataset used in this project is part of the GLUE benchmark.

cordondata
/

distilbert_sst2_600_150_acc_89