--- license: apache-2.0 base_model: distilbert/distilbert-base-uncased tags: - generated_from_trainer metrics: - accuracy - f1 model-index: - name: distilbert-base-uncased-finetuned-imdb results: [] library_name: adapter-transformers pipeline_tag: text-classification datasets: - abhi227070/imdb-dataset --- # distilbert-base-uncased-finetuned-imdb This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDB dataset](https://huggingface.co/datasets/abhi227070/imdb-dataset). It achieves the following results on the evaluation set: - Loss: 0.2069 - Accuracy: 0.9257 - F1: 0.9257 ## Model description This model is based on DistilBERT, a smaller, faster, cheaper, and lighter version of BERT developed by Hugging Face. It has been fine-tuned specifically for sentiment analysis on the IMDB movie reviews dataset. DistilBERT retains 97% of BERT's performance while being 60% faster and 40% smaller. The model is trained to classify text into positive or negative sentiment, making it suitable for applications that need to understand user opinions or reviews. ## Intended uses & limitations ### Intended Uses - **Sentiment Analysis:** This model can be used to classify the sentiment of movie reviews as positive or negative. - **Customer Feedback Analysis:** It can be adapted to analyze the sentiment in customer feedback for products or services. - **Social Media Monitoring:** It can be used to track sentiment in social media posts or comments. ### Limitations - **Domain Specificity:** The model is specifically fine-tuned on movie reviews and may not perform as well on other types of text. - **Binary Classification:** This model only distinguishes between positive and negative sentiments and does not account for neutral sentiments. - **Language:** The model is trained on English text and may not perform well on text in other languages. ## Training and evaluation data ### Training Data The model is trained on the IMDB dataset, which consists of 50,000 highly polar movie reviews labeled as either positive or negative. The dataset is balanced, with an equal number of positive and negative reviews. ### Evaluation Data The evaluation was performed on a separate validation set derived from the IMDB dataset, ensuring that the model's performance metrics are based on data it has not seen during training. ## Training procedure ### Training Procedure ### Training Procedure The IMDB dataset was loaded and preprocessed to include numerical labels for sentiments (positive and negative). It was then split into training (60%), validation (20%), and test (20%) sets, and the indices were reset for proper formatting. The data was converted into the `DatasetDict` format compatible with the Hugging Face Transformers library. The `AutoTokenizer` for DistilBERT was used to tokenize the text data, truncating where necessary. A preprocessing function applied tokenization to the entire dataset, preparing it for model training. A `DataCollatorWithPadding` was used to handle the variability in sequence lengths during batching, ensuring efficiency and standardization. The `AutoModelForSequenceClassification` with DistilBERT as the base model was set up for binary classification, mapping labels to sentiment classes (positive and negative). Training arguments included learning rate, batch size, number of epochs, evaluation strategy, and logging steps, optimized for effective training. Evaluation metrics, including accuracy and F1 score, were defined to assess model performance, with a function to compute these metrics by comparing predictions with true labels. The `Trainer` class from the Transformers library was used to conduct the training over three epochs, integrating the model, training arguments, tokenized datasets, data collator, and evaluation metrics. This comprehensive approach ensured effective fine-tuning for sentiment analysis on the IMDB dataset, achieving high accuracy and F1 scores. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | |:-------------:|:-----:|:----:|:---------------:|:--------:|:------:| | 0.241 | 1.0 | 1875 | 0.2029 | 0.9268 | 0.9268 | | 0.1411 | 2.0 | 3750 | 0.2442 | 0.932 | 0.9320 | | 0.079 | 3.0 | 5625 | 0.2882 | 0.9347 | 0.9347 | ### Framework versions - Transformers 4.42.4 - Pytorch 2.3.0+cu121 - Datasets 2.20.0 - Tokenizers 0.19.1