{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SCC0633/SCC5908 - Processamento de Linguagem Natural\n", "> **Docente:** Thiago Alexandre Salgueiro Pardo \\\\\n", "> **Estagiário PAE:** Germano Antonio Zani Jorge\n", "\n", "\n", "# Integrantes do Grupo: GPTrouxas\n", "> André Guarnier De Mitri - 11395579 \\\\\n", "> Daniel Carvalho - 10685702 \\\\\n", "> Fernando - 11795342 \\\\\n", "> Lucas Henrique Sant'Anna - 10748521 \\\\\n", "> Magaly L Fujimoto - 4890582 \\\\\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Abordagem Neural usando BERT\n", "![alt text](../imagens/BERT_TDIDF.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###" ] }, { "cell_type": "markdown", "metadata": { "id": "6yecpJR0feeQ" }, "source": [ "## Importando bibliotecas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "FAIvyZwodEtm" }, "outputs": [], "source": [ "import torch\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import math\n", "from tqdm.notebook import tqdm\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#!pip install transformers seaborn nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carregando dados" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "LYgXl3RIfgfo", "outputId": "eb496faf-7826-44f7-fa88-3b21fb6e7cbf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...positive
1A wonderful little production. <br /><br />The...positive
2I thought this was a wonderful way to spend ti...positive
3Basically there's a family where a little boy ...negative
4Petter Mattei's \"Love in the Time of Money\" is...positive
\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... positive\n", "1 A wonderful little production.

The... positive\n", "2 I thought this was a wonderful way to spend ti... positive\n", "3 Basically there's a family where a little boy ... negative\n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_reviews = pd.read_csv('../data/imdb_reviews.csv')\n", "df_reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mapeando as classes\n", "- Sentimento positivo recebe label 1\n", "- Sentimento negativo recebe label 0" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "D-5n8XzJbWOO", "outputId": "cef630cc-b0cc-4598-c53f-d32636bfcd86" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...1
1A wonderful little production. <br /><br />The...1
2I thought this was a wonderful way to spend ti...1
3Basically there's a family where a little boy ...0
4Petter Mattei's \"Love in the Time of Money\" is...1
\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... 1\n", "1 A wonderful little production.

The... 1\n", "2 I thought this was a wonderful way to spend ti... 1\n", "3 Basically there's a family where a little boy ... 0\n", "4 Petter Mattei's \"Love in the Time of Money\" is... 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def map_sentiments(sentiment):\n", " if sentiment == 'positive':\n", " return 1\n", " return 0\n", "\n", "df_reviews['sentiment'] = df_reviews['sentiment'].apply(map_sentiments)\n", "df_reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Funções para limpeza do texto\n", "**lowercase_text(text)** Converte o texto para letras minúsculas para uniformizar o texto.\n", "\n", "\n", "**remove_html(text)** Remove quaisquer tags HTML do texto para limpar dados provenientes de fontes HTML.\n", "\n", "\n", " **remove_url(text)** Remove URLs do texto para eliminar links que podem não ser relevantes para a análise de texto.\n", "\n", "\n", "**remove_punctuations(text)** Remove pontuações do texto para simplificar a estrutura do texto, mantendo apenas palavras.\n", "\n", "**remove_emojis(text)** Remove emojis do texto para evitar caracteres não verbais que podem interferir na análise textual.\n", "\n", "**remove_stop_words(text)** Remove stop words (palavras comuns como \"e\", \"de\", \"o\") que geralmente não adicionam valor significativo à análise de texto.\n", "\n", "**stem_words(text)** Aplica stemming nas palavras do texto, reduzindo-as à sua raiz (por exemplo, \"running\" vira \"run\") para normalizar as variações das palavras.\n", "\n", "**preprocess_text(text)** Aplica todas as funções acima em sequência para pré-processar o texto de forma completa, tornando-o mais adequado para análise de texto ou modelagem.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 241 }, "id": "PnFHO62rnWn-", "outputId": "17fb6619-fab9-4395-de5d-4c5199e7e45e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0one review mention watch 1 oz episod hook righ...1
1wonder littl product film techniqu unassum old...1
2thought wonder way spend time hot summer weeke...1
3basic famili littl boy jake think zombi closet...0
4petter mattei love time money visual stun film...1
\n", "
" ], "text/plain": [ " review sentiment\n", "0 one review mention watch 1 oz episod hook righ... 1\n", "1 wonder littl product film techniqu unassum old... 1\n", "2 thought wonder way spend time hot summer weeke... 1\n", "3 basic famili littl boy jake think zombi closet... 0\n", "4 petter mattei love time money visual stun film... 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "\n", "\n", "def lowercase_text(text):\n", " return text.lower()\n", "\n", "def remove_html(text):\n", " return re.sub(r'<[^<]+?>', '', text)\n", "\n", "def remove_url(text):\n", " return re.sub(r'http[s]?://\\S+|www\\.\\S+', '', text)\n", "\n", "def remove_punctuations(text):\n", " tokens_list = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n", " for char in text:\n", " if char in tokens_list:\n", " text = text.replace(char, ' ')\n", "\n", " return text\n", "\n", "def remove_emojis(text):\n", " emojis = re.compile(\"[\"\n", " u\"\\U0001F600-\\U0001F64F\"\n", " u\"\\U0001F300-\\U0001F5FF\"\n", " u\"\\U0001F680-\\U0001F6FF\"\n", " u\"\\U0001F1E0-\\U0001F1FF\"\n", " u\"\\U00002500-\\U00002BEF\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U000024C2-\\U0001F251\"\n", " u\"\\U0001f926-\\U0001f937\"\n", " u\"\\U00010000-\\U0010ffff\"\n", " u\"\\u2640-\\u2642\"\n", " u\"\\u2600-\\u2B55\"\n", " u\"\\u200d\"\n", " u\"\\u23cf\"\n", " u\"\\u23e9\"\n", " u\"\\u231a\"\n", " u\"\\ufe0f\"\n", " u\"\\u3030\"\n", " \"]+\", re.UNICODE)\n", "\n", " text = re.sub(emojis, '', text)\n", " return text\n", "\n", "def remove_stop_words(text):\n", " stop_words = stopwords.words('english')\n", " new_text = ''\n", " for word in text.split():\n", " if word not in stop_words:\n", " new_text += ''.join(f'{word} ')\n", "\n", " return new_text.strip()\n", "\n", "def stem_words(text):\n", " stemmer = PorterStemmer()\n", " new_text = ''\n", " for word in text.split():\n", " new_text += ''.join(f'{stemmer.stem(word)} ')\n", "\n", " return new_text\n", "\n", "def preprocess_text(text):\n", " text = lowercase_text(text)\n", " text = remove_html(text)\n", " text = remove_url(text)\n", " text = remove_punctuations(text)\n", " text = remove_emojis(text)\n", " text = remove_stop_words(text)\n", " text = stem_words(text)\n", "\n", " return text\n", "\n", "nltk.download('stopwords')\n", "df_reviews['review'] = df_reviews['review'].apply(preprocess_text)\n", "df_reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizando balancemento da classes" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 452 }, "id": "Gdi_L0HWfntv", "outputId": "bce77594-f662-4b3f-c8eb-27d8a188b4f2" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.title('Target value distribution')\n", "plt.hist(df_reviews['sentiment'])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modelo BERT" ] }, { "cell_type": "markdown", "metadata": { "id": "EDkjlPDakskM" }, "source": [ "## Instalando Bibliotecas" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lk7m_1xvmWvz", "outputId": "ce842053-b261-4768-d9d7-fe9c65c9f6aa" }, "outputs": [], "source": [ "#pip install transformers\n", "#pip install accelerate -U\n", "#pip install transformers[torch]\n", "#pip install datasets evaluate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carregando o modelo treinado e tokenizador" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GlyrkK52zMcc", "outputId": "a938653b-92c3-4b4e-802c-eacc3f1b6ecf" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ], "source": [ "from transformers import AutoTokenizer\n", "from transformers import BertForSequenceClassification\n", "\n", "pre_trained_base = \"bert-base-uncased\"\n", "tokenizer = AutoTokenizer.from_pretrained(pre_trained_base)\n", "model = BertForSequenceClassification.from_pretrained(pre_trained_base, num_labels = 2, output_attentions=False, output_hidden_states=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenização das Sentenças e Cálculo do Tamanho dos Tokens" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "LKEjDZCHpk4e" }, "outputs": [], "source": [ "token_lens = []\n", "\n", "for sentence in df_reviews['review']:\n", " tokens = tokenizer.encode(sentence, max_length=200, truncation=True)\n", " token_lens.append(len(tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Divisão dos Dados em Conjunto de Treinamento e Validação:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "H7PfXaVVp2uQ" }, "outputs": [], "source": [ "SEED=42\n", "MAX_LEN = 200\n", "from sklearn.model_selection import train_test_split\n", "df_train, df_val = train_test_split(df_reviews, test_size=0.2, random_state=SEED)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Processando os dados\n", "A função process_data recebe uma linha de um dataframe contendo uma revisão de texto e sua respectiva classificação de sentimento. Ela começa extraindo e limpando o texto da revisão, removendo quaisquer espaços extras. Em seguida, utiliza o tokenizer BERT para tokenizar o texto, aplicando padding e truncamento para garantir que todas as sequências tenham um comprimento fixo definido pela variável MAX_LEN. A função então adiciona a etiqueta de sentimento original e o texto limpo às codificações geradas, retornando um dicionário que contém os tokens do texto, a etiqueta de sentimento e o texto original." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "v7EZ6wd-qDfd" }, "outputs": [], "source": [ "def process_data(row):\n", "\n", " text = row['review']\n", " text = str(text)\n", " text = ' '.join(text.split())\n", "\n", " encodings = tokenizer(text, padding=\"max_length\", truncation=True, max_length=MAX_LEN)\n", "\n", " encodings['label'] = row['sentiment']\n", " encodings['text'] = text\n", "\n", " return encodings" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "d9VgrXNSqIYL" }, "outputs": [], "source": [ "# Treino\n", "processed_data_tr = []\n", "for i in range(df_train.shape[0]):\n", " processed_data_tr.append(process_data(df_train.iloc[i]))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "p0NLQxoKqJ_k" }, "outputs": [], "source": [ "# Validação\n", "processed_data_val = []\n", "for i in range(df_val.shape[0]):\n", " processed_data_val.append(process_data(df_val.iloc[i]))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "ac76Rb6fqP_G" }, "outputs": [], "source": [ "# Dataframes de Treino e Validação\n", "df_train = pd.DataFrame(processed_data_tr)\n", "df_val = pd.DataFrame(processed_data_val)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "RdbHaVy_fd64", "outputId": "a9aed834-81b7-4223-da42-6289799c2e1e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
attention_maskinput_idslabeltexttoken_type_ids
0[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 2921, 3198, 23624, 2954, 6978, 2674, 841...0kept ask mani fight scream match swear gener m...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 3422, 4372, 3775, 2099, 9587, 5737, 2071...0watch entir movi could watch entir movi stop d...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 3543, 2293, 2358, 10050, 2128, 25300, 11...1touch love stori reminisc ‘in mood love draw h...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 3732, 2154, 11865, 15472, 2072, 8040, 73...0latter day fulci schlocker total abysm concoct...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 2034, 3813, 3669, 19337, 2666, 2615, 504...0first firmli believ norwegian movi continu get...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
\n", "
" ], "text/plain": [ " attention_mask \\\n", "0 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "2 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "3 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "4 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "\n", " input_ids label \\\n", "0 [101, 2921, 3198, 23624, 2954, 6978, 2674, 841... 0 \n", "1 [101, 3422, 4372, 3775, 2099, 9587, 5737, 2071... 0 \n", "2 [101, 3543, 2293, 2358, 10050, 2128, 25300, 11... 1 \n", "3 [101, 3732, 2154, 11865, 15472, 2072, 8040, 73... 0 \n", "4 [101, 2034, 3813, 3669, 19337, 2666, 2615, 504... 0 \n", "\n", " text \\\n", "0 kept ask mani fight scream match swear gener m... \n", "1 watch entir movi could watch entir movi stop d... \n", "2 touch love stori reminisc ‘in mood love draw h... \n", "3 latter day fulci schlocker total abysm concoct... \n", "4 first firmli believ norwegian movi continu get... \n", "\n", " token_type_ids \n", "0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "0lTWT8JwkRic" }, "source": [ "## Fine Tunning do Modelo\n", "Ajuste fino do BERT para tarefas específica de classificação de sentimento para o dataset do IMDB" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import pyarrow as pa\n", "from datasets import Dataset\n", "import evaluate\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kW53p7VQqUDD", "outputId": "8231f3ba-37d5-4546-c4d0-6b4ff317ecf3" }, "outputs": [ { "data": { "text/plain": [ "device(type='cuda', index=0)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "id": "68OdbTv5rLrm" }, "outputs": [], "source": [ "train_hg = Dataset(pa.Table.from_pandas(df_train))\n", "valid_hg = Dataset(pa.Table.from_pandas(df_val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Metricas de avaliação F1 Score e Acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`compute_metrics` calcula tanto a acurácia quanto o F1-score para avaliar um modelo de classificação. Primeiramente, são carregadas as métricas de acurácia e F1-score usando evaluate.load. Em seguida, a função compute_metrics recebe um par de arrays eval_pred, contendo as previsões do modelo e os rótulos verdadeiros. Utilizando as previsões, a função calcula a acurácia e o F1-score ponderado, onde a acurácia é obtida através da comparação das previsões com os rótulos utilizando a métrica de acurácia previamente carregada, e o F1-score é calculado utilizando a métrica de F1 previamente carregada, com ponderação \"weighted\". Os resultados de ambas as métricas são então combinados em um dicionário e retornados como um único objeto contendo as métricas de avaliação calculadas." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "lUNhDPs0ry4m" }, "outputs": [], "source": [ "# Load both accuracy and f1 metrics\n", "accuracy_metric = evaluate.load(\"accuracy\")\n", "f1_metric = evaluate.load(\"f1\")\n", "\n", "# Metric helper method\n", "def compute_metrics(eval_pred):\n", " predictions, labels = eval_pred\n", " predictions = np.argmax(predictions, axis=1)\n", "\n", " # Compute accuracy\n", " accuracy = accuracy_metric.compute(predictions=predictions, references=labels)\n", "\n", " # Compute F1 score\n", " f1 = f1_metric.compute(predictions=predictions, references=labels, average=\"weighted\")\n", "\n", " # Combine the metrics into a single dictionary\n", " combined_metrics = {\n", " 'accuracy': accuracy['accuracy'],\n", " 'f1': f1['f1']\n", " }\n", "\n", " return combined_metrics" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9jJYTWsHjnEc", "outputId": "fe45691a-4476-4978-89b8-15f36465c37c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: accelerateNote: you may need to restart the kernel to use updated packages.\n", "\n", "Version: 0.31.0\n", "Summary: Accelerate\n", "Home-page: https://github.com/huggingface/accelerate\n", "Author: The HuggingFace team\n", "Author-email: zach.mueller@huggingface.co\n", "License: Apache\n", "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n", "Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch\n", "Required-by: \n", "---\n", "Name: transformers\n", "Version: 4.41.2\n", "Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow\n", "Home-page: https://github.com/huggingface/transformers\n", "Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)\n", "Author-email: transformers@huggingface.co\n", "License: Apache 2.0 License\n", "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n", "Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm\n", "Required-by: \n" ] } ], "source": [ "pip show accelerate transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Treinamento do modelo" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QlaLCwf7rLtp", "outputId": "7e10e82a-8bc7-478b-851e-c7b628b46c41" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\transformers\\training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n", " warnings.warn(\n" ] } ], "source": [ "from transformers import TrainingArguments, Trainer\n", "\n", "EPOCHS = 1\n", "\n", "training_args = TrainingArguments(output_dir=\"./result\",\n", " evaluation_strategy=\"epoch\",\n", " num_train_epochs= EPOCHS,\n", " per_device_train_batch_size=16,\n", " per_device_eval_batch_size=8\n", " )\n", "\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=train_hg,\n", " eval_dataset=valid_hg,\n", " tokenizer=tokenizer,\n", " compute_metrics=compute_metrics\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CUDA available: True\n", "CUDA version: 12.1\n" ] } ], "source": [ "print(\"CUDA available: \", torch.cuda.is_available())\n", "print(\"CUDA version: \", torch.version.cuda)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 141 }, "id": "3s6lVFz_rLwO", "outputId": "ee64e8e9-9c8c-42a8-c355-f51410cc33df" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/2500 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentbert_results
651The film is content as it is to run clever one...negativePositive
2205&#91;Has&#93; a surprising and somewhat disapp...negativePositive
362Absurdly over-rated...negativeNegative
2784A rare bird, not because of what it's like but...negativePositive
1914Lord of Illusions is also quite repulsive, as ...negativePositive
............
2230The movie is completely innocuous, passably en...negativePositive
2354A mud-simple horror trudge set in a swamp colo...negativeNegative
2404Just plain generic.negativeNegative
720Ulmer brings an enormous amount of impressioni...positiveNegative
527In their directorial debut, Britt Poulton and ...negativeNegative
\n", "

3000 rows × 3 columns

\n", "" ], "text/plain": [ " review sentiment bert_results\n", "651 The film is content as it is to run clever one... negative Positive\n", "2205 [Has] a surprising and somewhat disapp... negative Positive\n", "362 Absurdly over-rated... negative Negative\n", "2784 A rare bird, not because of what it's like but... negative Positive\n", "1914 Lord of Illusions is also quite repulsive, as ... negative Positive\n", "... ... ... ...\n", "2230 The movie is completely innocuous, passably en... negative Positive\n", "2354 A mud-simple horror trudge set in a swamp colo... negative Negative\n", "2404 Just plain generic. negative Negative\n", "720 Ulmer brings an enormous amount of impressioni... positive Negative\n", "527 In their directorial debut, Britt Poulton and ... negative Negative\n", "\n", "[3000 rows x 3 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from preprocess_data import preprocess_text,get_stopwords\n", "from transformers import AutoTokenizer, pipeline\n", "\n", "df = pd.read_csv('../data/rotten_tomatos.csv')\n", "\n", "MODEL_PATH = 'danielcd99/BERT_imdb'\n", "\n", "def load_pipeline():\n", " tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)\n", " tokenizer.model_max_length = 200\n", "\n", " pipe=pipeline(\n", " \"text-classification\",\n", " model=MODEL_PATH\n", " )\n", " return pipe\n", "\n", "pipe = load_pipeline()\n", "get_stopwords()\n", "df['preprocessed_review'] = df['review'].copy()\n", "df['preprocessed_review'] = df['preprocessed_review'].apply(preprocess_text)\n", " \n", "predictions = []\n", "for review in df['preprocessed_review']:\n", " try:\n", " label = pipe(review)[0]['label']\n", " except:\n", " print(\"Ocorreu um erro de carregamento, tente novamente!\")\n", " \n", " if label == 'LABEL_0':\n", " predictions.append(0)\n", " else:\n", " predictions.append(1)\n", "\n", "df['bert_results'] = predictions\n", "\n", "cols = ['review','sentiment', 'bert_results']\n", "df = df[cols]\n", "df" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.8066\n", "Recall: 0.8449\n", "F1 Score: 0.8253\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import confusion_matrix, precision_recall_fscore_support\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Mapear 'Positive' para 1 e 'Negative' para 0 em 'sentiment'\n", "df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})\n", "df['bert_results'] = df['bert_results'].map({'Positive': 1, 'Negative': 0})\n", "\n", "# Calcular métricas de avaliação: precision, recall, f1-score\n", "precision, recall, f1_score, _ = precision_recall_fscore_support(df['sentiment'], df['bert_results'], average='binary')\n", "\n", "print(f\"Precision: {precision:.4f}\")\n", "print(f\"Recall: {recall:.4f}\")\n", "print(f\"F1 Score: {f1_score:.4f}\")\n", "\n", "# Calcular e plotar a matriz de confusão\n", "cm = confusion_matrix(df['sentiment'], df['bert_results'])\n", "plt.figure(figsize=(8, 6))\n", "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])\n", "plt.xlabel('Predicted')\n", "plt.ylabel('True')\n", "plt.title('Confusion Matrix')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Após ajustar o modelo BERT utilizando a base de dados do IMDb, avaliada com referência aos dados do Rotten Tomatoes, obtivemos as seguintes métricas de desempenho:\n", "\n", "Precision: 0.8562 --- Recall: 0.8654 --- F1 Score: 0.8608\n", "\n", "Essas métricas indicam que o modelo ajustado conseguiu classificar de forma bastante precisa os sentimentos dos textos da base de dados IMDb, utilizando o BERT finetunado com dados do Rotten Tomatoes como referência." ] } ], "metadata": { "accelerator": "GPU", "colab": { "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 0 }