Updated README
Browse files- README.md +191 -5
- app.py +1 -1
- test_model.py +6 -6
README.md
CHANGED
@@ -11,13 +11,192 @@ pinned: false
|
|
11 |
|
12 |
# AI Project: Finetuning Language Models - Toxic Tweets
|
13 |
|
14 |
-
Hello! This is a project for CS-UY 4613: Artificial Intelligence. I'm providing a step-by-step instruction on finetuning language models for detecting toxic tweets.
|
15 |
|
16 |
-
#
|
|
|
17 |
|
18 |
-
|
19 |
|
20 |
-
Link to
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
Here's the setup block that includes all modules:
|
23 |
```
|
@@ -121,4 +300,11 @@ trainer.push_to_hub()
|
|
121 |
|
122 |
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:
|
123 |
|
124 |
-
![](milestone3/appUI.png)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
# AI Project: Finetuning Language Models - Toxic Tweets
|
13 |
|
14 |
+
Hello! This is a project for CS-UY 4613: Artificial Intelligence. I'm providing a step-by-step instruction on finetuning language models for detecting toxic tweets. All codes are well commented.
|
15 |
|
16 |
+
# Everthing you need to know
|
17 |
+
Link to HuggingFace space: https://huggingface.co/spaces/andyqin18/sentiment-analysis-app
|
18 |
|
19 |
+
----Code behind app: [app.py](app.py)
|
20 |
|
21 |
+
Link to finetuned model: https://huggingface.co/andyqin18/finetuned-bert-uncased
|
22 |
+
|
23 |
+
----Code for how to finetune a language model: [finetune.ipynb](milestone3/finetune.ipynb)
|
24 |
+
|
25 |
+
Performance of the model using [test_model.py](test_model.py) is shown below. The result is generated on 2000 randomly selected samples from [train.csv](milestone3/comp/train.csv)
|
26 |
+
|
27 |
+
```
|
28 |
+
{'label_accuracy': 0.9821666666666666,
|
29 |
+
'prediction_accuracy': 0.9195,
|
30 |
+
'precision': 0.8263888888888888,
|
31 |
+
'recall': 0.719758064516129}
|
32 |
+
```
|
33 |
+
|
34 |
+
Now let's walk through the details :)
|
35 |
+
|
36 |
+
# Milestone 1 - Setup
|
37 |
+
|
38 |
+
This milestone includes setting up docker and creating a development environment on Windows 11.
|
39 |
+
|
40 |
+
## 1. Enable WSL2 feature
|
41 |
+
|
42 |
+
The Windows Subsystem for Linux (WSL) lets developers install a Linux distribution on Windows.
|
43 |
+
|
44 |
+
```
|
45 |
+
wsl --install
|
46 |
+
```
|
47 |
+
|
48 |
+
Ubuntu is the default distribution installed and WSL2 is the default version.
|
49 |
+
After creating linux username and password, Ubuntu can be seen in Windows Terminal now.
|
50 |
+
Details can be found [here](https://learn.microsoft.com/en-us/windows/wsl/install).
|
51 |
+
|
52 |
+
![](milestone1/wsl2.png)
|
53 |
+
|
54 |
+
## 2. Download and install the Linux kernel update package
|
55 |
+
|
56 |
+
The package needs to be downloaded before installing Docker Desktop.
|
57 |
+
However, this error might occur:
|
58 |
+
|
59 |
+
`Error: wsl_update_x64.msi unable to run because "This update only applies to machines with the Windows Subsystem for Linux"`
|
60 |
+
|
61 |
+
Solution: Opened Windows features and enabled "Windows Subsystem for Linux".
|
62 |
+
Successfully ran update [package](https://docs.microsoft.com/windows/wsl/wsl2-kernel).
|
63 |
+
|
64 |
+
![](milestone1/kernal_update_sol.png)
|
65 |
+
|
66 |
+
## 3. Download Docker Desktop
|
67 |
+
|
68 |
+
After downloading the [Docker App](https://www.docker.com/products/docker-desktop/), WSL2 based engine is automatically enabled.
|
69 |
+
If not, follow [this link](https://docs.docker.com/desktop/windows/wsl/) for steps to turn on WSL2 backend.
|
70 |
+
Open the app and input `docker version` in Terminal to check server running.
|
71 |
+
|
72 |
+
![](milestone1/docker_version.png)
|
73 |
+
Docker is ready to go.
|
74 |
+
|
75 |
+
## 4. Create project container and image
|
76 |
+
|
77 |
+
First we download the Ubuntu image from Docker’s library with:
|
78 |
+
```
|
79 |
+
docker pull ubuntu
|
80 |
+
```
|
81 |
+
We can check the available images with:
|
82 |
+
```
|
83 |
+
docker image ls
|
84 |
+
```
|
85 |
+
We can create a container named *AI_project* based on Ubuntu image with:
|
86 |
+
```
|
87 |
+
docker run -it --name=AI_project ubuntu
|
88 |
+
```
|
89 |
+
The `–it` options instruct the container to launch in interactive mode and enable a Terminal typing interface.
|
90 |
+
After this, a shell is generated and we are directed to Linux Terminal within the container.
|
91 |
+
`root` represents the currently logged-in user with highest privileges, and `249cf37645b4` is the container ID.
|
92 |
+
|
93 |
+
![](milestone1/docker_create_container.png)
|
94 |
+
|
95 |
+
## 5. Hello World!
|
96 |
+
|
97 |
+
Now we can mess with the container by downloading python and pip needed for the project.
|
98 |
+
First we update and upgrade packages by: (`apt` is Advanced Packaging Tool)
|
99 |
+
```
|
100 |
+
apt update && apt upgrade
|
101 |
+
```
|
102 |
+
Then we download python and pip with:
|
103 |
+
```
|
104 |
+
apt install python3 pip
|
105 |
+
```
|
106 |
+
We can confirm successful installation by checking the current version of python and pip.
|
107 |
+
Then create a script file of *hello_world.py* under `root` directory, and run the script.
|
108 |
+
You will see the following in VSCode and Terminal.
|
109 |
+
|
110 |
+
![](milestone1/vscode.png)
|
111 |
+
![](milestone1/hello_world.png)
|
112 |
+
|
113 |
+
## 6. Commit changes to a new image specifically for the project
|
114 |
+
|
115 |
+
After setting up the container we can commit changes to a specific project image with a tag of *milestone1* with:
|
116 |
+
```
|
117 |
+
docker commit [CONTAINER] [NEW_IMAGE]:[TAG]
|
118 |
+
```
|
119 |
+
Now if we check the available images there should be a new image for the project. If we list all containers we should be able to identify the one we were working on through container ID.
|
120 |
+
|
121 |
+
![](milestone1/commit_to_new_image.png)
|
122 |
+
|
123 |
+
The Docker Desktop app should match the image list we see on Terminal.
|
124 |
+
|
125 |
+
![](milestone1/app_image_list.png)
|
126 |
+
|
127 |
+
# Milestone 2 - Sentiment Analysis App w/ Pretrained Model
|
128 |
+
|
129 |
+
This milestone includes creating a Streamlit app in HuggingFace for sentiment analysis.
|
130 |
+
|
131 |
+
## 1. Space setup
|
132 |
+
|
133 |
+
After creating a HuggingFace account, we can create our app as a space and choose Streamlit as the space SDK.
|
134 |
+
|
135 |
+
![](milestone2/new_HF_space.png)
|
136 |
+
|
137 |
+
Then we can go back to our Github Repo and create the following files.
|
138 |
+
In order for the space to run properly, there must be at least three files in the root directory:
|
139 |
+
[README.md](README.md), [app.py](app.py), and [requirements.txt](requirements.txt)
|
140 |
+
|
141 |
+
Make sure the following metadata is at the top of **README.md** for HuggingFace to identify.
|
142 |
+
```
|
143 |
+
---
|
144 |
+
title: Sentiment Analysis App
|
145 |
+
emoji: 🚀
|
146 |
+
colorFrom: green
|
147 |
+
colorTo: purple
|
148 |
+
sdk: streamlit
|
149 |
+
sdk_version: 1.17.0
|
150 |
+
app_file: app.py
|
151 |
+
pinned: false
|
152 |
+
---
|
153 |
+
```
|
154 |
+
|
155 |
+
The **app.py** file is the main code of the app and **requirements.txt** should include all the libraries the code uses. HuggingFace will install the libraries listed before running the virtual environment
|
156 |
+
|
157 |
+
|
158 |
+
## 2. Connect and sync to HuggingFace
|
159 |
+
|
160 |
+
Then we go to settings of the Github Repo and create a secret token to access the new HuggingFace space.
|
161 |
+
|
162 |
+
![](milestone2/HF_token.png)
|
163 |
+
![](milestone2/github_token.png)
|
164 |
+
|
165 |
+
Next, we need to setup a workflow in Github Actions. Click "set up a workflow yourself" and replace all the code in `main.yaml` with the following: (Replace `HF_USERNAME` and `SPACE_NAME` with our own)
|
166 |
+
|
167 |
+
```
|
168 |
+
name: Sync to Hugging Face hub
|
169 |
+
on:
|
170 |
+
push:
|
171 |
+
branches: [main]
|
172 |
+
|
173 |
+
# to run this workflow manually from the Actions tab
|
174 |
+
workflow_dispatch:
|
175 |
+
|
176 |
+
jobs:
|
177 |
+
sync-to-hub:
|
178 |
+
runs-on: ubuntu-latest
|
179 |
+
steps:
|
180 |
+
- uses: actions/checkout@v3
|
181 |
+
with:
|
182 |
+
fetch-depth: 0
|
183 |
+
lfs: true
|
184 |
+
- name: Push to hub
|
185 |
+
env:
|
186 |
+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
187 |
+
run: git push --force https://HF_USERNAME:$HF_TOKEN@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main
|
188 |
+
```
|
189 |
+
The Repo is now connected and synced with HuggingFace space!
|
190 |
+
|
191 |
+
## 3. Create the app
|
192 |
+
|
193 |
+
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:
|
194 |
+
|
195 |
+
![](milestone2/app_UI.png)
|
196 |
+
|
197 |
+
# Milestone 3 - Finetuning Language Models
|
198 |
+
|
199 |
+
This milestone we wish to finetuning our own language model in HuggingFace for sentiment analysis.
|
200 |
|
201 |
Here's the setup block that includes all modules:
|
202 |
```
|
|
|
300 |
|
301 |
Modify [app.py](app.py) so that it takes in one text and generate an analysis using one of the provided models. Details are explained in comment lines. The app should look like this:
|
302 |
|
303 |
+
![](milestone3/appUI.png)
|
304 |
+
|
305 |
+
## Reference:
|
306 |
+
For connecting Github with HuggingFace, check this [video](https://www.youtube.com/watch?v=8hOzsFETm4I).
|
307 |
+
|
308 |
+
For creating the app, check this [video](https://www.youtube.com/watch?v=GSt00_-0ncQ)
|
309 |
+
|
310 |
+
The HuggingFace documentation is [here](https://huggingface.co/docs), and Streamlit APIs [here](https://docs.streamlit.io/library/api-reference).
|
app.py
CHANGED
@@ -18,7 +18,7 @@ def analyze(model_name: str, text: str, top_k=1) -> dict:
|
|
18 |
return classifier(text)
|
19 |
|
20 |
# App title
|
21 |
-
st.title("Sentiment Analysis App
|
22 |
st.write("This app is to analyze the sentiments behind a text.")
|
23 |
st.write("You can choose to use my fine-tuned model or pre-trained models.")
|
24 |
|
|
|
18 |
return classifier(text)
|
19 |
|
20 |
# App title
|
21 |
+
st.title("Toxic Tweet Detection and Sentiment Analysis App")
|
22 |
st.write("This app is to analyze the sentiments behind a text.")
|
23 |
st.write("You can choose to use my fine-tuned model or pre-trained models.")
|
24 |
|
test_model.py
CHANGED
@@ -6,8 +6,8 @@ from tqdm import tqdm
|
|
6 |
|
7 |
|
8 |
# Global var
|
9 |
-
TEST_SIZE =
|
10 |
-
FINE_TUNED_MODEL = "andyqin18/
|
11 |
|
12 |
|
13 |
# Define analyze function
|
@@ -77,8 +77,8 @@ for comment_idx in tqdm(range(TEST_SIZE), desc="Analyzing..."):
|
|
77 |
|
78 |
# Calculate performance
|
79 |
performance = {}
|
80 |
-
performance["label_accuracy"] = total_true/(len(labels) * TEST_SIZE)
|
81 |
-
performance["prediction_accuracy"] = total_success/TEST_SIZE
|
82 |
-
performance["precision"] = TP / (TP + FP)
|
83 |
-
performance["recall"] = TP / (TP + FN)
|
84 |
print(performance)
|
|
|
6 |
|
7 |
|
8 |
# Global var
|
9 |
+
TEST_SIZE = 2000
|
10 |
+
FINE_TUNED_MODEL = "andyqin18/finetuned-bert-uncased"
|
11 |
|
12 |
|
13 |
# Define analyze function
|
|
|
77 |
|
78 |
# Calculate performance
|
79 |
performance = {}
|
80 |
+
performance["label_accuracy"] = total_true/(len(labels) * TEST_SIZE) # Success prediction of each label
|
81 |
+
performance["prediction_accuracy"] = total_success/TEST_SIZE # Success prediction of all 6 labels for 1 sample
|
82 |
+
performance["precision"] = TP / (TP + FP) # Label precision
|
83 |
+
performance["recall"] = TP / (TP + FN) # Label recall
|
84 |
print(performance)
|