File size: 4,500 Bytes
f7cd1e1
7751fbf
f7cd1e1
 
 
 
 
 
 
 
 
5b45a77
3e6436b
5b45a77
3e6436b
8942604
3e6436b
8942604
3e6436b
b6852b8
3e6436b
2d1f021
c0f871b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e6436b
c0f871b
 
 
3e6436b
c0f871b
 
 
3e6436b
7296fc1
c0f871b
 
b6852b8
c0f871b
 
 
b6852b8
c0f871b
3e6436b
c0f871b
 
 
 
5b45a77
c0f871b
 
 
 
 
 
5b45a77
c0f871b
 
5b45a77
 
c0f871b
 
 
5b45a77
c0f871b
 
 
 
5b45a77
 
c0f871b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b45a77
 
c0f871b
5b45a77
b6852b8
5b45a77
c0f871b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
title: Sentiment Analysis App
emoji: 🚀
colorFrom: green
colorTo: purple
sdk: streamlit
sdk_version: 1.17.0
app_file: app.py
pinned: false
---

# AI Project: Finetuning Language Models - Toxic Tweets

Hello! This is a project for CS-UY 4613: Artificial Intelligence. I'm providing a step-by-step instruction on finetuning language models for detecting toxic tweets.

# Milestone 3

This milestone includes finetuning a language model in HuggingFace for sentiment analysis.

Link to app: https://huggingface.co/spaces/andyqin18/sentiment-analysis-app

Here's the setup block that includes all modules:
```
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
```

## 1. Prepare Data
First we extract comment strings and labels from `train.csv` and split them into training data and validation data with a percentage of 80% vs 20%. We also create 2 dictionaries that map labels to integers and back.
```
df = pd.read_csv("milestone3/comp/train.csv")

train_texts = df["comment_text"].values
labels = df.columns[2:]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
train_labels = df[labels].values


# Randomly select 20000 samples within the data
np.random.seed(18)
small_train_texts = np.random.choice(train_texts, size=20000, replace=False)

np.random.seed(18)
small_train_labels_idx = np.random.choice(train_labels.shape[0], size=20000, replace=False)
small_train_labels = train_labels[small_train_labels_idx, :]


# Separate training data and validation data
train_texts, val_texts, train_labels, val_labels = train_test_split(small_train_texts, small_train_labels, test_size=.2)
```

## 2. Data Preprocessing
As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. The `AutoTokenizer` will automatically load the appropriate tokenizer based on the checkpoint on the hub. We can now merge the labels and texts to datasets as a class we defined.
```
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

class TextDataset(Dataset):
  def __init__(self,texts,labels):
    self.texts = texts
    self.labels = labels

  def __getitem__(self,idx):
    encodings = tokenizer(self.texts[idx], truncation=True, padding="max_length")
    item = {key: torch.tensor(val) for key, val in encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx],dtype=torch.float32)
    del encodings
    return item

  def __len__(self):
    return len(self.labels)


train_dataset = TextDataset(train_texts, train_labels)
val_dataset = TextDataset(val_texts, val_labels)
```

## 3. Train the model using Trainer
We define a model that includes a pre-trained base and also set the problem to `multi_label_classification`. Then we train the model using `Trainer`, which requires  `TrainingArguments` beforehand that specify training hyperparameters, where we can set learning rate, batch sizes and `push_to_hub=True`.

After verifying Token with HuggingFace, the model is now pushed to the hub.

```
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)
model.to(device)

training_args = TrainingArguments(
    output_dir="finetuned-bert-uncased",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    load_best_model_at_end=True,
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

trainer.train()
trainer.push_to_hub()
```

## 4. Update the app

Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:

![](milestone3/appUI.png)