AlejandraVento2 commited on
Commit
70ead2f
1 Parent(s): 8103ee5

sentiment analist

Browse files
Files changed (4) hide show
  1. README.md +55 -2
  2. app.py +20 -0
  3. model.py +361 -0
  4. requirements.txt +6 -0
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: Analisis De Sentimientos
3
- emoji: 🐨
4
  colorFrom: yellow
5
  colorTo: indigo
6
  sdk: streamlit
@@ -9,4 +9,57 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Analisis De Sentimientos
3
+ emoji: 🥺😡
4
  colorFrom: yellow
5
  colorTo: indigo
6
  sdk: streamlit
 
9
  pinned: false
10
  ---
11
 
12
+
13
+ # Modelo Clasificatorio de sentiminentos
14
+
15
+ Mi proyecto es un modelo multi-label que clasifica sentimientos en tristeza (sadness), enojo(anger), alegria(joy), miedo(fear) y sorpresa(surprise) fue entrenado y validado a partir de una tabla de texto y fue modelado con BERT y PyTorch Lightning.
16
+
17
+ ## Decisions
18
+
19
+ `1` Clasificar sentimientos (anger, sadness, joy, fear, surprise, neutral).
20
+
21
+ `2` Usar de referencia estos modelos:
22
+
23
+ - https://github.com/curiousily/Getting-Things-Done-with-Pytorch/blob/master/11.multi-label-text-classification-with-bert.ipynb
24
+ - https://www.youtube.com/watch?v=UJGxCsZgalA&ab_channel=VenelinValkov
25
+ - https://github.com/theartificialguy/NLP-with-Deep-Learning/blob/master/BERT/Multi%20Label%20Text%20Classification%20using%20BERT%20PyTorch/bert_multilabel_pytorch_standard.ipynb
26
+
27
+ `3` Añadir un dropout en las capas para que el modelo no se sobreajuste.
28
+
29
+ `4` Anadir checkpoints en el entrenamiento para evitar que la ram se ocupe totalmente.
30
+
31
+ `5` Utilizar la funcion de activacion sigmoid en al ultima capa para un mejor resultado final.
32
+
33
+ `6` Utilizar optimizador AdamW y funcion de perdida binary_crossentropy con BCELoss como criterio, que son de los mas recomendados y usados para estos modelos.
34
+
35
+ `7` Utilizar 1625 ensayos de entrenamiento y 86 ensayos de validacion.
36
+
37
+ ## Data Sources
38
+
39
+ Use la tabla de entrenamiento y de validacion sobre sentimientos de esta fuente [messages_train_ready_for_WS.tsv](https://github.com/caisa-lab/wassa-empathy-adapters/tree/main/data).
40
+
41
+ ## Features
42
+
43
+ La entrada es texto a trave de un campo de texto, para predecir y/o clasificar los sentimientos que mas se asemejan al texto.
44
+
45
+ ## Data Collection
46
+
47
+ Decidi utilizar esta coleccion de datos porque tiene una gran variedad de texto en diferentes contextos y con una gran variedad de sentimientos que se midieron a aprtir de varias columnas de metricas.
48
+
49
+ ## Value Proposition
50
+
51
+ Puede ser utilizado en aplicaciones de reconocimiento de sentimientos, como filtro de textos empresariales por ejemplo, en programas psicologicos que permitan ayudar a personas que tengan algun problema en entender los sentimientos de otros o personas con discapacidad en general, por ejemplo con discapacidad en el habla que se comunica a traves de texto como entrada a un sistema que genera sonidos, podria modificar el tono de voz a partir de esta prediccion y reflejar los sentimientos de la persona.
52
+
53
+ # Environment requirements to run the model
54
+
55
+ Transformers 4.5.1
56
+
57
+ Pytorch lightning 1.2.8
58
+
59
+ Numpy
60
+
61
+ Pandas
62
+
63
+ Torch
64
+
65
+ Sklearn
app.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from transformers import pipeline
3
+ import run_sentiment_analysis from model.py
4
+
5
+ st.title("Analisis de Sentimientos")
6
+
7
+ txt = st.text_area(label="Please write what you want to analyze...")
8
+ predictions = run_sentiment_analysis(txt)
9
+
10
+ for prediction in predictions:
11
+ st.write(prediction)
12
+ col1, col2 = st.columns(2)
13
+
14
+ image = Image.open(file_name)
15
+ col1.image(image, use_column_width=True)
16
+ predictions = pipeline(image)
17
+
18
+ col2.header("Probabilities")
19
+ for p in predictions:
20
+ col2.subheader(f"{ p['label'] }: { round(p['score'] * 100, 1)}%")
model.py ADDED
@@ -0,0 +1,361 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ Automatically generated by Colaboratory.
4
+
5
+ Original file is located at
6
+ https://colab.research.google.com/drive/193Qwk9yyPHgI0H84JJOchTovg_CELJuw
7
+ """
8
+
9
+ import pandas as pd
10
+ import numpy as np
11
+ import torch
12
+ import torch.nn as nn
13
+ from torch.utils.data import Dataset, DataLoader
14
+ from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
15
+ import pytorch_lightning as pl
16
+ from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
17
+ from sklearn.model_selection import train_test_split
18
+
19
+ RANDOM_SEED = 42
20
+ np.random.seed(RANDOM_SEED)
21
+ torch.manual_seed(RANDOM_SEED)
22
+
23
+ # Preparing training data
24
+ train_file_path = '/content/sample_data/train_data.csv'
25
+ train_data = pd.read_csv(train_file_path)
26
+
27
+ filtro = (train_data['emotion'] == 'anger') | (train_data['emotion'] == 'fear') | (train_data['emotion'] == 'joy') | (train_data['emotion'] == 'sadness') | (train_data['emotion'] == 'neutral') | (train_data['emotion'] == 'surprise')
28
+ df = train_data[filtro]
29
+
30
+ angerColumn = []
31
+ fearColumn = []
32
+ surpriseColumn = []
33
+ sadnessColumn = []
34
+ joyColumn = []
35
+ neutralColumn = []
36
+
37
+ for e in df['emotion']:
38
+ if e == 'anger':
39
+ angerColumn.append(1)
40
+ joyColumn.append(0)
41
+ sadnessColumn.append(0)
42
+ fearColumn.append(0)
43
+ surpriseColumn.append(0)
44
+ neutralColumn.append(0)
45
+ elif e == 'joy':
46
+ joyColumn.append(1)
47
+ angerColumn.append(0)
48
+ sadnessColumn.append(0)
49
+ fearColumn.append(0)
50
+ surpriseColumn.append(0)
51
+ neutralColumn.append(0)
52
+ elif e == 'sadness':
53
+ sadnessColumn.append(1)
54
+ angerColumn.append(0)
55
+ joyColumn.append(0)
56
+ fearColumn.append(0)
57
+ surpriseColumn.append(0)
58
+ neutralColumn.append(0)
59
+ elif e == 'fear':
60
+ fearColumn.append(1)
61
+ angerColumn.append(0)
62
+ joyColumn.append(0)
63
+ sadnessColumn.append(0)
64
+ surpriseColumn.append(0)
65
+ neutralColumn.append(0)
66
+ elif e == 'surprise':
67
+ surpriseColumn.append(1)
68
+ angerColumn.append(0)
69
+ joyColumn.append(0)
70
+ sadnessColumn.append(0)
71
+ fearColumn.append(0)
72
+ neutralColumn.append(0)
73
+ elif e == 'neutral':
74
+ neutralColumn.append(1)
75
+ surpriseColumn.append(0)
76
+ angerColumn.append(0)
77
+ joyColumn.append(0)
78
+ sadnessColumn.append(0)
79
+ fearColumn.append(0)
80
+
81
+ df['anger'] = angerColumn
82
+ df['fear'] = fearColumn
83
+ df['surprise'] = surpriseColumn
84
+ df['joy'] = joyColumn
85
+ df['sadness'] = sadnessColumn
86
+ df['neutral'] = neutralColumn
87
+
88
+ df.drop(['emotion', 'message_id', 'response_id', 'article_id', 'empathy', 'distress',
89
+ 'empathy_bin', 'distress_bin', 'gender', 'education','race', 'age','income','personality_conscientiousness',
90
+ 'personality_openess','personality_extraversion','personality_agreeableness','personality_stability',
91
+ 'iri_perspective_taking','iri_personal_distress', 'iri_fantasy', 'iri_empathatic_concern','raw_input_emotions'],
92
+ axis=1, inplace=True)
93
+
94
+ print(df.head())
95
+
96
+ train_df, val_df = train_test_split(df, test_size=0.05)
97
+ train_df.shape, val_df.shape
98
+
99
+ LABEL_COLUMNS = ['anger','joy','fear','surprise','sadness', 'neutral']
100
+
101
+ sample_row = train_df.iloc[16]
102
+ sample_comment = sample_row.essay
103
+ sample_labels = sample_row[LABEL_COLUMNS]
104
+ print(sample_comment)
105
+ print(sample_labels.to_dict())
106
+
107
+ BERT_MODEL_NAME = 'bert-base-cased'
108
+ tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)
109
+
110
+ encoding = tokenizer.encode_plus(
111
+ sample_comment,
112
+ add_special_tokens=True,
113
+ max_length=512,
114
+ return_token_type_ids=False,
115
+ padding="max_length",
116
+ return_attention_mask=True,
117
+ return_tensors='pt',
118
+ )
119
+
120
+ encoding.keys()
121
+
122
+ encoding["input_ids"].shape, encoding["attention_mask"].shape
123
+
124
+ encoding["input_ids"].squeeze()[:20]
125
+
126
+ encoding["attention_mask"].squeeze()[:20]
127
+
128
+ print(tokenizer.convert_ids_to_tokens(encoding["input_ids"].squeeze())[:20])
129
+
130
+ class EmotionDataset(Dataset):
131
+ def __init__(
132
+ self,
133
+ data: pd.DataFrame,
134
+ tokenizer: BertTokenizer,
135
+ max_token_len: int = 128
136
+ ):
137
+ self.tokenizer = tokenizer
138
+ self.data = data
139
+ self.max_token_len = max_token_len
140
+
141
+ def __len__(self):
142
+ return len(self.data)
143
+
144
+ def __getitem__(self, index: int):
145
+ data_row = self.data.iloc[index]
146
+
147
+ comment_text = data_row.essay
148
+ labels = data_row[LABEL_COLUMNS]
149
+
150
+ encoding = self.tokenizer.encode_plus(
151
+ comment_text,
152
+ add_special_tokens=True,
153
+ max_length=self.max_token_len,
154
+ return_token_type_ids=False,
155
+ padding="max_length",
156
+ truncation=True,
157
+ return_attention_mask=True,
158
+ return_tensors='pt',
159
+ )
160
+
161
+ return dict(
162
+ comment_text=comment_text,
163
+ input_ids=encoding["input_ids"].flatten(),
164
+ attention_mask=encoding["attention_mask"].flatten(),
165
+ labels=torch.FloatTensor(labels)
166
+ )
167
+
168
+ bert_model = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
169
+
170
+ train_dataset = EmotionDataset(train_df,tokenizer)
171
+ sample_item = train_dataset[0]
172
+ sample_item.keys()
173
+
174
+ sample_batch = next(iter(DataLoader(train_dataset, batch_size=8, num_workers=2)))
175
+ sample_batch["input_ids"].shape, sample_batch["attention_mask"].shape
176
+
177
+ output = bert_model(sample_batch["input_ids"], sample_batch["attention_mask"])
178
+
179
+ output.last_hidden_state.shape, output.pooler_output.shape
180
+
181
+ class EmotionDataModule(pl.LightningDataModule):
182
+
183
+ def __init__(self, train_df, test_df, tokenizer, batch_size=8, max_token_len=128):
184
+ super().__init__()
185
+ self.batch_size = batch_size
186
+ self.train_df = train_df
187
+ self.test_df = test_df
188
+ self.tokenizer = tokenizer
189
+ self.max_token_len = max_token_len
190
+
191
+ def setup(self, stage=None):
192
+ self.train_dataset = EmotionDataset(
193
+ self.train_df,
194
+ self.tokenizer,
195
+ self.max_token_len
196
+ )
197
+
198
+ self.test_dataset = EmotionDataset(
199
+ self.test_df,
200
+ self.tokenizer,
201
+ self.max_token_len
202
+ )
203
+
204
+ def train_dataloader(self):
205
+ return DataLoader(
206
+ self.train_dataset,
207
+ batch_size=self.batch_size,
208
+ shuffle=True,
209
+ num_workers=2
210
+ )
211
+
212
+ def val_dataloader(self):
213
+ return DataLoader(
214
+ self.test_dataset,
215
+ batch_size=self.batch_size,
216
+ num_workers=2
217
+ )
218
+
219
+ def test_dataloader(self):
220
+ return DataLoader(
221
+ self.test_dataset,
222
+ batch_size=self.batch_size,
223
+ num_workers=2
224
+ )
225
+
226
+ N_EPOCHS = 10
227
+ BATCH_SIZE = 12
228
+ MAX_TOKEN_COUNT = 512
229
+
230
+ data_module = EmotionDataModule(
231
+ train_df,
232
+ val_df,
233
+ tokenizer,
234
+ batch_size=BATCH_SIZE,
235
+ max_token_len=MAX_TOKEN_COUNT
236
+ )
237
+
238
+ class EmotionTagger(pl.LightningModule):
239
+ def __init__(self, n_classes: int, n_training_steps=None, n_warmup_steps=None):
240
+ super().__init__()
241
+ self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
242
+ self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)
243
+ self.n_training_steps = n_training_steps
244
+ self.n_warmup_steps = n_warmup_steps
245
+ self.criterion = nn.BCELoss()
246
+
247
+ def forward(self, input_ids, attention_mask, labels=None):
248
+ output = self.bert(input_ids, attention_mask=attention_mask)
249
+ output = self.classifier(output.pooler_output)
250
+ output = torch.sigmoid(output)
251
+ loss = 0
252
+ if labels is not None:
253
+ loss = self.criterion(output, labels)
254
+ return loss, output
255
+
256
+ def training_step(self, batch, batch_idx):
257
+ input_ids = batch["input_ids"]
258
+ attention_mask = batch["attention_mask"]
259
+ labels = batch["labels"]
260
+ loss, outputs = self(input_ids, attention_mask, labels)
261
+ self.log("train_loss", loss, prog_bar=True, logger=True)
262
+ return {"loss": loss, "predictions": outputs, "labels": labels}
263
+
264
+ def validation_step(self, batch, batch_idx):
265
+ input_ids = batch["input_ids"]
266
+ attention_mask = batch["attention_mask"]
267
+ labels = batch["labels"]
268
+ loss, outputs = self(input_ids, attention_mask, labels)
269
+ self.log("val_loss", loss, prog_bar=True, logger=True)
270
+ return loss
271
+
272
+ def test_step(self, batch, batch_idx):
273
+ input_ids = batch["input_ids"]
274
+ attention_mask = batch["attention_mask"]
275
+ labels = batch["labels"]
276
+ loss, outputs = self(input_ids, attention_mask, labels)
277
+ self.log("test_loss", loss, prog_bar=True, logger=True)
278
+ return loss
279
+
280
+ for i, name in enumerate(LABEL_COLUMNS):
281
+ class_roc_auc = pytorch_lightning.metrics.functional.auroc(predictions[:, i], labels[:, i])
282
+ self.logger.experiment.add_scalar(f"{name}_roc_auc/Train", class_roc_auc, self.current_epoch)
283
+
284
+ def configure_optimizers(self):
285
+ optimizer = AdamW(self.parameters(), lr=2e-5)
286
+
287
+ scheduler = get_linear_schedule_with_warmup(
288
+ optimizer,
289
+ num_warmup_steps=self.n_warmup_steps,
290
+ num_training_steps=self.n_training_steps
291
+ )
292
+
293
+ return dict(
294
+ optimizer=optimizer,
295
+ lr_scheduler=dict(
296
+ scheduler=scheduler,
297
+ interval='step'
298
+ )
299
+ )
300
+
301
+ steps_per_epoch=len(train_df) // BATCH_SIZE
302
+ total_training_steps = steps_per_epoch * N_EPOCHS
303
+ warmup_steps = total_training_steps // 5
304
+
305
+ model = EmotionTagger(
306
+ n_classes=len(LABEL_COLUMNS),
307
+ n_warmup_steps=warmup_steps,
308
+ n_training_steps=total_training_steps
309
+ )
310
+
311
+ !rm -rf lightning_logs/
312
+ !rm -rf checkpoints/
313
+
314
+ checkpoint_callback = ModelCheckpoint(
315
+ dirpath="checkpoints",
316
+ filename="best-checkpoint",
317
+ save_top_k=1,
318
+ verbose=True,
319
+ monitor="val_loss",
320
+ mode="min"
321
+ )
322
+
323
+ early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)
324
+
325
+ trainer = pl.Trainer(
326
+ max_epochs=N_EPOCHS,
327
+ callbacks=[early_stopping_callback,checkpoint_callback],)
328
+
329
+ trainer.fit(model, data_module)
330
+
331
+ trained_model = EmotionTagger.load_from_checkpoint(
332
+ trainer.checkpoint_callback.best_model_path,
333
+ n_classes=len(LABEL_COLUMNS)
334
+ )
335
+ trained_model.eval()
336
+ trained_model.freeze()
337
+
338
+
339
+ def run_sentiment_analysis (txt) :
340
+ THRESHOLD = 0.5
341
+
342
+ encoding = tokenizer.encode_plus(
343
+ txt,
344
+ add_special_tokens=True,
345
+ max_length=512,
346
+ return_token_type_ids=False,
347
+ padding="max_length",
348
+ return_attention_mask=True,
349
+ return_tensors='pt',
350
+ )
351
+
352
+ _, test_prediction = trained_model(encoding["input_ids"], encoding["attention_mask"])
353
+ test_prediction = test_prediction.flatten().numpy()
354
+
355
+ predictions = []
356
+
357
+ for label, prediction in zip(LABEL_COLUMNS, test_prediction):
358
+ if prediction < THRESHOLD:
359
+ continue
360
+ predictions.append("{label}: {prediction}")
361
+ return predictions
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ transformers==4.5.1
2
+ pytorch-lightning==1.2.8
3
+ torch
4
+ pandas
5
+ numpy
6
+ sklearn