File size: 16,707 Bytes
27b305f
66cc6bb
27b305f
66cc6bb
 
 
 
 
 
 
27b305f
 
66cc6bb
 
 
 
 
 
 
 
 
 
27b305f
66cc6bb
27b305f
66cc6bb
 
 
 
 
 
27b305f
66cc6bb
27b305f
66cc6bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27b305f
66cc6bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27b305f
 
66cc6bb
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eUS7fBrdXj9A"
      },
      "source": [
        "Ensures that Transformers and pytorch_lightning libraries are installed in the runtime."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "LF6cf_ijDNhp"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "%pip install transformers\n",
        "%pip install pytorch_lightning\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pf4Ch2xaXszi"
      },
      "source": [
        "Imports everything needed for creating the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "pmd3-0J4_rnp"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "from torch.utils.data import Dataset\n",
        "import torch\n",
        "from transformers import AutoTokenizer\n",
        "import pytorch_lightning as pl\n",
        "from torch.utils.data import DataLoader\n",
        "from transformers import AutoModel, AdamW, get_cosine_schedule_with_warmup\n",
        "import torch.nn as nn\n",
        "import math\n",
        "from torchmetrics.functional.classification import auroc\n",
        "import torch.nn.functional as F\n",
        "from transformers import BertForMaskedLM, TFBertForMaskedLM"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EyhmDzknX2k8"
      },
      "source": [
        "Defines class toxicity_dataset, which extends Dataset. __init__ takes a path to the csv file, a tokenizer, a list of attributes, a maximum length for the token which defaults to 128, and a size for sampling the data which defaults to 5000.\n",
        "\n",
        "The toxicity_dataset class prepares the data by sampling the csv file in _prepare_data, returns the length of the dataset with __len__, and can acess an item with an index and return its input ids, its attention mask, and its labels.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "vS6DHv5k_xZ5"
      },
      "outputs": [],
      "source": [
        "class toxicity_dataset(Dataset):\n",
        "    def __init__(self,data_path,tokenizer,attributes,max_token_len= 128,sample = 5000):\n",
        "        self.data_path=data_path\n",
        "        self.tokenizer=tokenizer\n",
        "        self.attributes=attributes\n",
        "        self.max_token_len=max_token_len\n",
        "        self.sample=sample\n",
        "        self._prepare_data()\n",
        "    def _prepare_data(self):\n",
        "        data=pd.read_csv(self.data_path)\n",
        "        if self.sample is not None:\n",
        "            self.data=data.sample(self.sample,random_state=7)\n",
        "        else:\n",
        "            self.data=data\n",
        "    def __len__(self):\n",
        "        return(len(self.data))\n",
        "    def __getitem__(self,index):\n",
        "        item = self.data.iloc[index]\n",
        "        comment = str(item.comment_text)\n",
        "        attributes = torch.FloatTensor(item[self.attributes])\n",
        "        tokens = self.tokenizer.encode_plus(comment,add_special_tokens=True,return_tensors=\"pt\",truncation=True,max_length=self.max_token_len,padding=\"max_length\",return_attention_mask=True)\n",
        "        return{'input_ids':tokens.input_ids.flatten(),\"attention_mask\":tokens.attention_mask.flatten(),\"labels\":attributes}\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w-Kmud4YZXH2"
      },
      "source": [
        "Defines Toxicity_Data_Module, which extends pytorch_lightnign.LightningDataModule.__init__ takes a path to the training data set and a testing set, a list of attributes, batch size, tax token length, and the name of the model. It has methods to create and return a dataloader for training, validation, and prediciton."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "Txrjrl1d_6UN"
      },
      "outputs": [],
      "source": [
        "class Toxcity_Data_Module(pl.LightningDataModule):\n",
        "    def __init__(self,train_path,test_path,attributes,batch_size = 16, max_token_len = 128, model_name=\"roberta-base\"):\n",
        "        super().__init__()\n",
        "        self.train_path=train_path\n",
        "        self.test_path=test_path\n",
        "        self.attributes=attributes\n",
        "        self.batch_size=batch_size\n",
        "        self.max_token_len=max_token_len\n",
        "        self.model_name=model_name\n",
        "        self.tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
        "    def setup(self, stage = None):\n",
        "        if stage in (None, \"fit\"):\n",
        "            self.train_dataset=toxicity_dataset(self.train_path,self.tokenizer,self.attributes)\n",
        "            self.test_dataset=toxicity_dataset(self.test_path,self.tokenizer,self.attributes, sample=None)\n",
        "        if stage == \"predict\":\n",
        "            self.val_dataset=toxicity_dataset(self.test_path,self.tokenizer,self.attributes)\n",
        "    def train_dataloader(self):\n",
        "        return DataLoader(self.train_dataset,batch_size=self.batch_size,shuffle=True, num_workers=12)\n",
        "    def val_dataloader(self):\n",
        "        return DataLoader(self.train_dataset,batch_size=self.batch_size,shuffle=False, num_workers=12)\n",
        "    def predict_dataloader(self):\n",
        "        return DataLoader(self.test_dataset,batch_size=self.batch_size,shuffle=False, num_workers=12)\n",
        "    "
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Defines Toxic_Comment_Classifier, which is the actual model we are fine tuning, and extends pl.LightningModule. __init__ takes a dictionary storing model name, number of labels, batch size, learning rate, warmup, the size of the training set, weighted decay, and number of epochs. It's methods are never called explicitly in the code, and the class is instead used by a pytorch_lightning Trainer object."
      ],
      "metadata": {
        "id": "2XqQMdp1YahZ"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "7AhJvpKyADHJ"
      },
      "outputs": [],
      "source": [
        "class Toxic_Comment_Classifier(pl.LightningModule):\n",
        "  def __init__(self, config: dict):\n",
        "    super().__init__()\n",
        "    self.config = config\n",
        "    self.pretrained_model = AutoModel.from_pretrained(config['model_name'], return_dict = True)\n",
        "    self.hidden = torch.nn.Linear(self.pretrained_model.config.hidden_size, self.pretrained_model.config.hidden_size)\n",
        "    self.classifier = torch.nn.Linear(self.pretrained_model.config.hidden_size, self.config['n_labels'])\n",
        "    torch.nn.init.xavier_uniform_(self.classifier.weight)\n",
        "    self.loss_func = nn.BCEWithLogitsLoss(reduction='mean')\n",
        "    self.dropout = nn.Dropout()\n",
        "    \n",
        "  def forward(self, input_ids, attention_mask=None, labels=None):\n",
        "    # roberta layer\n",
        "    output = self.pretrained_model(input_ids=input_ids, attention_mask=attention_mask)\n",
        "    pooled_output = torch.mean(output.last_hidden_state, 1)\n",
        "    # final logits\n",
        "    pooled_output = self.dropout(pooled_output)\n",
        "    pooled_output = self.hidden(pooled_output)\n",
        "    pooled_output = F.relu(pooled_output)\n",
        "    pooled_output = self.dropout(pooled_output)\n",
        "    logits = self.classifier(pooled_output)\n",
        "    # calculate loss\n",
        "    loss = 0\n",
        "    if labels is not None:\n",
        "      loss = self.loss_func(logits.view(-1, self.config['n_labels']), labels.view(-1, self.config['n_labels']))\n",
        "    return loss, logits\n",
        "\n",
        "  def training_step(self, batch, batch_index):\n",
        "    loss, outputs = self(**batch)\n",
        "    self.log(\"train loss \", loss, prog_bar = True, logger=True)\n",
        "    return {\"loss\":loss, \"predictions\":outputs, \"labels\": batch[\"labels\"]}\n",
        "\n",
        "  def validation_step(self, batch, batch_index):\n",
        "    loss, outputs = self(**batch)\n",
        "    self.log(\"validation loss \", loss, prog_bar = True, logger=True)\n",
        "    return {\"val_loss\": loss, \"predictions\":outputs, \"labels\": batch[\"labels\"]}\n",
        "\n",
        "  def predict_step(self, batch, batch_index):\n",
        "    loss, outputs = self(**batch)\n",
        "    return outputs\n",
        "\n",
        "  def configure_optimizers(self):\n",
        "    optimizer = AdamW(self.parameters(), lr=self.config['lr'], weight_decay=self.config['w_decay'])\n",
        "    total_steps = self.config['train_size']/self.config['bs']\n",
        "    warmup_steps = math.floor(total_steps * self.config['warmup'])\n",
        "    warmup_steps = math.floor(total_steps * self.config['warmup'])\n",
        "    scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)\n",
        "    return [optimizer],[scheduler]"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The variable, attributes, stores the list of labels used to classify the comments. toxicity_data_module is created and setup for later use."
      ],
      "metadata": {
        "id": "CRZVlU89boV5"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VQwXmUnn_-co"
      },
      "outputs": [],
      "source": [
        "attributes=[\"toxic\",\"severe_toxic\",\"obscene\",\"threat\",\"insult\",\"identity_hate\"]\n",
        "\n",
        "toxicity_data_module=Toxcity_Data_Module(\"/content/sample_data/train.csv\",\"/content/sample_data/train.csv\",attributes)\n",
        "toxicity_data_module.setup()"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here, we define the config dictionary. the model we are fine tuning is the distilroberta-base model. We use attributes to get the number of labels. the batch size is 128, the learning rate is 0.0000015, and we're using 80 epochs for training."
      ],
      "metadata": {
        "id": "OTumKddHb6wB"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Q77QTZpUAFAS"
      },
      "outputs": [],
      "source": [
        "config = {\n",
        "    'model_name':\"distilroberta-base\",\n",
        "    'n_labels':len(attributes),\n",
        "    'bs':128,\n",
        "    'lr':1.5e-6,\n",
        "    'warmup':0.2,\n",
        "    \"train_size\":len(toxicity_data_module.train_dataloader()),\n",
        "    'w_decay':0.001,\n",
        "    'n_epochs':80\n",
        "}\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We recreated toxicity_data_module using the batch size from the config file, and set it up again. The variable trainer is created as a pytorch_lightning.Trainer object, taking the number of epochs and a number of sanity validation steps. The trainer object then fits the model using the model we created and the data module. torch.save() is called to save the model into a file called model.pt, which can be loaded for later use. This is how the model was loaded into the streamlit app without having to retrain it by scratch."
      ],
      "metadata": {
        "id": "raqYuS_Vepj1"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Mhk8fhgFAJNJ"
      },
      "outputs": [],
      "source": [
        "toxicity_data_module=Toxcity_Data_Module(\"/content/sample_data/train.csv\",\"/content/sample_data/test.csv\",attributes,batch_size=config['bs'])\n",
        "toxicity_data_module.setup()\n",
        "model = Toxic_Comment_Classifier(config)\n",
        "\n",
        "trainer = pl.Trainer(max_epochs=config['n_epochs'],num_sanity_val_steps=50)\n",
        "trainer.fit(model,toxicity_data_module)\n",
        "torch.save(model, \"/content/sample_data/model.pt\")"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "After all 80 epochs, the training and validatin loss are all very low, under 0.05. This shows that the model is pretty effective at determining the labels, and is not overfitting."
      ],
      "metadata": {
        "id": "pt-FyfFVqRdS"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here we load the model file back into the model variable."
      ],
      "metadata": {
        "id": "3h4j9aqTgL1Q"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "dG_Z30QlQfSP"
      },
      "outputs": [],
      "source": [
        "model = torch.load(\"/content/sample_data/model.pt\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The function predict_raw_comments() takes the model, and the data module, and gets predictions from the trainer in the form of logits."
      ],
      "metadata": {
        "id": "WxJWKhnViQjE"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "TiXWsKFZQmQs"
      },
      "outputs": [],
      "source": [
        "def predict_raw_comments(model, dm):\n",
        "  predictions = trainer.predict(model,dm)\n",
        "  logits = np.stack([torch.sigmoid(torch.Tensor(p)) for batch in predictions for p in batch])\n",
        "  return logits"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Below, we use the predict_raw_comments() function to get the predicted logits."
      ],
      "metadata": {
        "id": "ihy78y__iyQL"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8ZkNo8u4Qmd7"
      },
      "outputs": [],
      "source": [
        "logits = predict_raw_comments(model,toxicity_data_module)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "below, we convert the numpy array of logits retrieved into torch logits using torch.from_numpy(logits). We then use a softmax function to find the softmax probabilities for each label, and convert it back into a numpy array. "
      ],
      "metadata": {
        "id": "vk7FQY7ajCUJ"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Znq6JcNS8P7Z"
      },
      "outputs": [],
      "source": [
        "torch_logits = torch.from_numpy(logits)\n",
        "probabilities = F.softmax(torch_logits, dim = -1).numpy()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can then load our testing data into a dataframe, and use a nested loop to loop through both the input dataframe and the probabilities dataframe to find the label with the maximum probability. We can store the comment, the most likely labe, and the probability of that label into a new dataframe called results_df."
      ],
      "metadata": {
        "id": "S7Ixxgo5jj00"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KabNY9yrxsGn"
      },
      "outputs": [],
      "source": [
        "inputs=pd.read_csv(\"/content/sample_data/test.csv\")\n",
        "data=[]\n",
        "for i in range(20):\n",
        "  max_prob = 0\n",
        "  max_cat = 6\n",
        "  \n",
        "  prob=0\n",
        "  for j in range(6):\n",
        "      prob=probabilities[i][j]\n",
        "      if(prob >= max_prob): \n",
        "        max_prob = prob \n",
        "        max_cat = j\n",
        "  data.append([inputs[\"comment_text\"][i],attributes[max_cat],max_prob])\n",
        "results_df=pd.DataFrame(data,columns=[\"Comment Text\",\"Most Likely Classification\",\"Classification Probability\"])"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "machine_shape": "hm",
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}