sentiment-analysis / README.md
dahongj's picture
edit
84f23d6
|
raw
history blame
2.42 kB
metadata
title: Sentiment Analysis
emoji: 😻
colorFrom: red
colorTo: purple
sdk: streamlit
sdk_version: 1.17.0
app_file: app.py
pinned: false
license: unknown

csuy-4613-Project

Milestone 1

The operating system that is being used is Windows 10 Home. In order to run Docker on this operating system, a Windows Subsystem for Linux (WSL) must be used.

WSL was installed by using the wsl --install command in the Windows Command Prompt. This installed the Ubuntu distribution of Linux. To set the WSL version to WSL 2, the comman wsl --set-version Ubuntu 2 was used.

The Docker app was installed from the docker website. The option for WSL 2 had to be verified in the Docker setting. Running wsl.exe -l -v checks the versions of the distro, which we could verify that the distro Ubuntu ran in version 2.

We set Ubuntu as the default distro with the command wsl --set-default ubuntu

Using VSCode as the coding environment, we can enter WSL by using wsl and code in the terminal. From there a linux command prompt can be seen, using ~ to accept new commands. Running docker run hello-world verifies that the docker is working.

docker

Milestone 2

Hugging Face URL: https://huggingface.co/spaces/dahongj/sentiment-analysis

Models Used:

https://huggingface.co/siebert/sentiment-roberta-large-english?text=I+like+you.+I+love+you

https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis

https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

Milestone 3

Finetuned Model URL: https://huggingface.co/dahongj/finetuned_toxictweets

Hugging Face URL: https://huggingface.co/spaces/dahongj/sentiment-analysis

Finetune python file was done on Google Colab following the documentation of HuggingFace's finetuning process. Initially the model distilbert-base-uncased was selected. The tweet and the labels are read into variables and ran through a Dataset class. A tokenizer for Distilbert was created. Then using the multivariable version of the distilbert-base-uncased model because there are 6 forms of toxicity included in the dataset that we want to finetune for. Using the native pytorch method of training as demonstrated on the HuggingFace documentation, the model was trained and evaluated. Both the finetuned model and its tokenizer are saved and uploaded onto HuggingFace.