daviddrzik commited on
Commit
34ace3a
·
verified ·
1 Parent(s): c261c0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md CHANGED
@@ -1,3 +1,111 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - sk
5
+ pipeline_tag: text-classification
6
+ library_name: transformers
7
+ metrics:
8
+ - f1
9
+ base_model: daviddrzik/SK_BPE_BLM
10
+ tags:
11
+ - sentiment
12
  ---
13
+
14
+ # Fine-Tuned Sentiment Classification Model - SK_BPE_BLM (Movie reviews)
15
+
16
+ ## Model Overview
17
+
18
+ This model is a fine-tuned version of the [SK_BPE_BLM model](https://huggingface.co/daviddrzik/SK_BPE_BLM) for the task of sentiment classification. It has been trained on a dataset containing movie reviews in the Czech language from the ČSFD dataset, which were then machine-translated into Slovak using Google Cloud Translation.
19
+
20
+ ## Sentiment Labels
21
+
22
+ Each review in the dataset is labeled with one of the following sentiments:
23
+ - **Negative (0)**
24
+ - **Positive (1)**
25
+
26
+ ## Dataset Details
27
+
28
+ The dataset used for fine-tuning comprises a total of 53,402 text records, labeled with sentiment as follows:
29
+ - **Negative records (0):** 25,618
30
+ - **Positive records (1):** 27,784
31
+
32
+ For more information about the dataset, please visit [this link](https://www.kaggle.com/datasets/lowoncuties/czech-movie-review-csfd/).
33
+
34
+ ## Fine-Tuning Hyperparameters
35
+
36
+ The following hyperparameters were used during the fine-tuning process:
37
+
38
+ - **Learning Rate:** 5e-06
39
+ - **Training Batch Size:** 64
40
+ - **Evaluation Batch Size:** 64
41
+ - **Seed:** 42
42
+ - **Optimizer:** Adam (default)
43
+ - **Number of Epochs:** 5
44
+
45
+ ## Model Performance
46
+
47
+ The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.928**</span> across the folds.
48
+
49
+ ## Model Usage
50
+
51
+ This model is suitable for sentiment classification in Slovak text, especially for user reviews of movies. It is specifically designed for applications requiring sentiment analysis of user reviews and may not generalize well to other types of text.
52
+
53
+ ### Example Usage
54
+
55
+ Below is an example of how to use the fine-tuned `SK_Morph_BLM-sentiment-csfd` model in a Python script:
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import RobertaForSequenceClassification, RobertaTokenizerFast
60
+
61
+ class SentimentClassifier:
62
+ def __init__(self, tokenizer, model):
63
+ self.model = RobertaForSequenceClassification.from_pretrained(model, num_labels=2)
64
+ self.tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer, max_length=256)
65
+
66
+ def tokenize_text(self, text):
67
+ encoded_text = self.tokenizer.encode_plus(
68
+ text.lower(),
69
+ max_length=256,
70
+ padding='max_length',
71
+ truncation=True,
72
+ return_tensors='pt'
73
+ )
74
+ return encoded_text
75
+
76
+ def classify_text(self, encoded_text):
77
+ with torch.no_grad():
78
+ output = self.model(**encoded_text)
79
+ logits = output.logits
80
+ predicted_class = torch.argmax(logits, dim=1).item()
81
+ probabilities = torch.softmax(logits, dim=1)
82
+ class_probabilities = probabilities[0].tolist()
83
+ predicted_class_text = self.model.config.id2label[predicted_class]
84
+ return predicted_class, predicted_class_text, class_probabilities
85
+
86
+ # Instantiate the sentiment classifier with the specified tokenizer and model
87
+ classifier = SentimentClassifier(tokenizer="daviddrzik/SK_BPE_BLM", model="daviddrzik/SK_BPE_BLM-sentiment-csfd")
88
+
89
+ # Example text to classify sentiment
90
+ text_to_classify = "Tento film síce nebol najlepší aký som kedy videl, ale pozrel by som si ho opäť."
91
+ print("Text to classify: " + text_to_classify + "\n")
92
+
93
+ # Tokenize the input text
94
+ encoded_text = classifier.tokenize_text(text_to_classify)
95
+
96
+ # Classify the sentiment of the tokenized text
97
+ predicted_class, predicted_class_text, logits = classifier.classify_text(encoded_text)
98
+
99
+ # Print the predicted class label and index
100
+ print(f"Predicted class: {predicted_class_text} ({predicted_class})")
101
+ # Print the probabilities for each class
102
+ print(f"Class probabilities: {logits}")
103
+ ```
104
+
105
+ Here is the output when running the above example:
106
+ ```yaml
107
+ Text to classify: Tento film síce nebol najlepší aký som kedy videl, ale pozrel by som si ho opäť.
108
+
109
+ Predicted class: POSITIVE (1)
110
+ Class probabilities: [0.015124241821467876, 0.9848757386207581]
111
+ ```