julien-c HF staff commited on
Commit
02a429c
1 Parent(s): 3458d8f

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md

Files changed (1) hide show
  1. README.md +122 -11
README.md CHANGED
@@ -1,15 +1,19 @@
 
 
 
 
1
 
2
- # Details
3
- This model is used for Sentiment Analysis based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
4
 
 
5
 
6
- # Dataset
7
 
 
8
 
 
9
 
10
- We used product and movie dataset provided by a study [2] . This dataset includes
11
- movie and product reviews. The products are book, DVD, electronics, and kitchen.
12
- The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
13
  5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
14
  scale from 0 to 5 by the users who made the reviews. The study considered a review
15
  sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
@@ -18,13 +22,120 @@ Web page. They constructed benchmark dataset consisting of reviews regarding som
18
  products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
19
  and majority class of reviews are 5. Each category has 700 positive and 700 negative
20
  reviews in which average rating of negative reviews is 2.27 and of positive reviews
21
- is 4.5.
22
 
 
23
 
24
- The dataset is used by following papers
25
 
 
 
 
 
 
 
26
 
27
-
28
- 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
29
- 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
 
 
30
  Discovery and Opinion Mining (WISDOM ’13)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ ---
4
+ # Bert-base Turkish Sentiment Model
5
 
6
+ https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
 
7
 
8
+ This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
9
 
 
10
 
11
+ ## Dataset
12
 
13
+ The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.
14
 
15
+ * The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
16
+ The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
 
17
  5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
18
  scale from 0 to 5 by the users who made the reviews. The study considered a review
19
  sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
 
22
  products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
23
  and majority class of reviews are 5. Each category has 700 positive and 700 negative
24
  reviews in which average rating of negative reviews is 2.27 and of positive reviews
25
+ is 4.5. This dataset is also used by the study [[1]](#paper-1).
26
 
27
+ * The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
28
 
29
+ *Merged Dataset*
30
 
31
+ | *size* | *data* |
32
+ |--------|----|
33
+ | 8000 |dev.tsv|
34
+ | 8262 |test.tsv|
35
+ | 32000 |train.tsv|
36
+ | *48290* |*total*|
37
 
38
+ ### The dataset is used by following papers
39
+
40
+ <a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
41
+
42
+ <a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
43
  Discovery and Opinion Mining (WISDOM ’13)
44
+
45
+ <a id="paper-3">[3]</a> Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
46
+
47
+
48
+ ## Training
49
+
50
+ ```shell
51
+ export GLUE_DIR="./sst-2-newall"
52
+ export TASK_NAME=SST-2
53
+
54
+ python3 run_glue.py \
55
+ --model_type bert \
56
+ --model_name_or_path dbmdz/bert-base-turkish-uncased\
57
+ --task_name "SST-2" \
58
+ --do_train \
59
+ --do_eval \
60
+ --data_dir "./sst-2-newall" \
61
+ --max_seq_length 128 \
62
+ --per_gpu_train_batch_size 32 \
63
+ --learning_rate 2e-5 \
64
+ --num_train_epochs 3.0 \
65
+ --output_dir "./model"
66
+ ```
67
+
68
+
69
+ ## Results
70
+
71
+ > 05/10/2020 17:00:43 - INFO - transformers.trainer - \*\*\*\*\* Running Evaluation \*\*\*\*\*
72
+ > 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
73
+ > 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
74
+ > Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
75
+ > 05/10/2020 17:01:17 - INFO - \_\_main__ - \*\*\*\*\* Eval results sst-2 \*\*\*\*\*
76
+ > 05/10/2020 17:01:17 - INFO - \_\_main__ - acc = 0.9539942492811602
77
+ > 05/10/2020 17:01:17 - INFO - \_\_main__ - loss = 0.16348013816401363
78
+
79
+ Accuracy is about **95.4%**
80
+
81
+
82
+ ## Code Usage
83
+
84
+ ```python
85
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
86
+
87
+ model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
88
+ tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
89
+ sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
90
+
91
+ p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
92
+ print(p)
93
+ # [{'label': 'LABEL_1', 'score': 0.9871089}]
94
+ print(p[0]['label'] == 'LABEL_1')
95
+ # True
96
+
97
+ p = sa("Film çok kötü ve çok sahteydi")
98
+ print(p)
99
+ # [{'label': 'LABEL_0', 'score': 0.9975505}]
100
+ print(p[0]['label'] == 'LABEL_1')
101
+ # False
102
+ ```
103
+
104
+
105
+ ## Test
106
+ ### Data
107
+
108
+ Suppose your file has lots of lines of comment and label (1 or 0) at the end (tab seperated)
109
+
110
+ > comment1 ... \t label
111
+ > comment2 ... \t label
112
+ > ...
113
+
114
+ ### Code
115
+
116
+ ```python
117
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
118
+
119
+ model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
120
+ tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
121
+ sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
122
+
123
+ input_file = "/path/to/your/file/yourfile.tsv"
124
+
125
+ i, crr = 0, 0
126
+ for line in open(input_file):
127
+ lines = line.strip().split("\t")
128
+ if len(lines) == 2:
129
+
130
+ i = i + 1
131
+ if i%100 == 0:
132
+ print(i)
133
+
134
+ pred = sa(lines[0])
135
+ pred = pred[0]["label"].split("_")[1]
136
+
137
+ if pred == lines[1]:
138
+ crr = crr + 1
139
+
140
+ print(crr, i, crr/i)
141
+ ```