Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- yelp_review_full
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- accuracy
|
9 |
+
- f1
|
10 |
+
library_name: transformers
|
11 |
+
---
|
12 |
+
# Model Card
|
13 |
+
|
14 |
+
## Sentiment Analysis of Restaurant Reviews from Yelp Dataset
|
15 |
+
|
16 |
+
### Overview
|
17 |
+
|
18 |
+
- **Task**: Sentiment classification of restaurant reviews from the Yelp dataset.
|
19 |
+
- **Model**: Fine-tuned BERT (Bidirectional Encoder Representations from Transformers) for sequence classification.
|
20 |
+
- **Training Dataset**: Yelp dataset containing restaurant reviews.
|
21 |
+
- **Training Framework**: PyTorch and Transformers library.
|
22 |
+
|
23 |
+
### Model Details
|
24 |
+
|
25 |
+
- **Pre-trained Model**: BERT-base-uncased.
|
26 |
+
- **Input**: Cleaned and preprocessed restaurant reviews.
|
27 |
+
- **Output**: Binary classification (positive or negative sentiment).
|
28 |
+
- **Tokenization**: BERT tokenizer with a maximum sequence length of 240 tokens.
|
29 |
+
- **Optimizer**: AdamW with a learning rate of 3e-5.
|
30 |
+
- **Learning Rate Scheduler**: Linear scheduler with no warmup steps.
|
31 |
+
- **Loss Function**: CrossEntropyLoss.
|
32 |
+
- **Batch Size**: 16.
|
33 |
+
- **Number of Epochs**: 2.
|
34 |
+
|
35 |
+
### Data Preprocessing
|
36 |
+
|
37 |
+
1. Loaded Yelp reviews dataset and business dataset.
|
38 |
+
2. Merged datasets on the "business_id" column.
|
39 |
+
3. Removed unnecessary columns and duplicates.
|
40 |
+
4. Translated star ratings into binary sentiment labels (positive or negative).
|
41 |
+
5. Upsampled the minority class (negative sentiment) to address imbalanced data.
|
42 |
+
6. Cleaned text data by removing non-letters, converting to lowercase, and tokenizing.
|
43 |
+
|
44 |
+
### Model Training
|
45 |
+
|
46 |
+
1. Split the dataset into training (70%), validation (15%), and test (15%) sets.
|
47 |
+
2. Tokenized, padded, and truncated input sequences.
|
48 |
+
3. Created attention masks to differentiate real tokens from padding.
|
49 |
+
4. Fine-tuned BERT using the specified hyperparameters.
|
50 |
+
5. Tracked training and validation accuracy and loss for each epoch.
|
51 |
+
|
52 |
+
### Model Evaluation
|
53 |
+
|
54 |
+
1. Achieved high accuracy and F1 scores on both the validation and test sets.
|
55 |
+
2. Generalization observed, as the accuracy on the test set was similar to the validation set.
|
56 |
+
3. The model showed improvement in validation loss, indicating no overfitting.
|
57 |
+
|
58 |
+
### Model Deployment
|
59 |
+
|
60 |
+
1. Saved the trained model and tokenizer.
|
61 |
+
2. Published the model and tokenizer to the Hugging Face Model Hub.
|
62 |
+
3. Demonstrated how to load and use the model for making predictions.
|
63 |
+
|
64 |
+
### Model Performance
|
65 |
+
|
66 |
+
- **Validation Accuracy**: ≈ 97.5% - 97.8%
|
67 |
+
- **Test Accuracy**: ≈ 97.8%
|
68 |
+
- **F1 Score**: ≈ 97.8% - 97.9%
|
69 |
+
|
70 |
+
### Limitations
|
71 |
+
|
72 |
+
- Excluding stopwords may impact contextual understanding, but it was necessary to handle token length limitations.
|
73 |
+
- Performance may vary on reviews in languages other than English.
|
74 |
+
|
75 |
+
### Conclusion
|
76 |
+
|
77 |
+
The fine-tuned BERT model demonstrates robust sentiment analysis on Yelp restaurant reviews. Its high accuracy and F1 scores indicate effectiveness in capturing sentiment from user-generated content. The model is suitable for deployment in applications requiring sentiment classification for restaurant reviews.
|