{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SCC0633/SCC5908 - Processamento de Linguagem Natural\n", "> **Docente:** Thiago Alexandre Salgueiro Pardo \\\\\n", "> **Estagiário PAE:** Germano Antonio Zani Jorge\n", "\n", "\n", "# Integrantes do Grupo: GPTrouxas\n", "> André Guarnier De Mitri - 11395579 \\\\\n", "> Daniel Carvalho - 10685702 \\\\\n", "> Fernando - 11795342 \\\\\n", "> Lucas Henrique Sant'Anna - 10748521 \\\\\n", "> Magaly L Fujimoto - 4890582 \\\\\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Abordagem Neural usando BERT\n", "![alt text](../imagens/BERT_TDIDF.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###" ] }, { "cell_type": "markdown", "metadata": { "id": "6yecpJR0feeQ" }, "source": [ "## Importando bibliotecas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "FAIvyZwodEtm" }, "outputs": [], "source": [ "import torch\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import math\n", "from tqdm.notebook import tqdm\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#!pip install transformers seaborn nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carregando dados" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "LYgXl3RIfgfo", "outputId": "eb496faf-7826-44f7-fa88-3b21fb6e7cbf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...positive
1A wonderful little production. <br /><br />The...positive
2I thought this was a wonderful way to spend ti...positive
3Basically there's a family where a little boy ...negative
4Petter Mattei's \"Love in the Time of Money\" is...positive
\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... positive\n", "1 A wonderful little production.

The... positive\n", "2 I thought this was a wonderful way to spend ti... positive\n", "3 Basically there's a family where a little boy ... negative\n", "4 Petter Mattei's \"Love in the Time of Money\" is... positive" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_reviews = pd.read_csv('imdb_reviews.csv')\n", "df_reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mapeando as classes\n", "- Sentimento positivo recebe label 1\n", "- Sentimento negativo recebe label 0" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "D-5n8XzJbWOO", "outputId": "cef630cc-b0cc-4598-c53f-d32636bfcd86" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0One of the other reviewers has mentioned that ...1
1A wonderful little production. <br /><br />The...1
2I thought this was a wonderful way to spend ti...1
3Basically there's a family where a little boy ...0
4Petter Mattei's \"Love in the Time of Money\" is...1
\n", "
" ], "text/plain": [ " review sentiment\n", "0 One of the other reviewers has mentioned that ... 1\n", "1 A wonderful little production.

The... 1\n", "2 I thought this was a wonderful way to spend ti... 1\n", "3 Basically there's a family where a little boy ... 0\n", "4 Petter Mattei's \"Love in the Time of Money\" is... 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def map_sentiments(sentiment):\n", " if sentiment == 'positive':\n", " return 1\n", " return 0\n", "\n", "df_reviews['sentiment'] = df_reviews['sentiment'].apply(map_sentiments)\n", "df_reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Funções para limpeza do texto\n", "**lowercase_text(text)** Converte o texto para letras minúsculas para uniformizar o texto.\n", "\n", "\n", "**remove_html(text)** Remove quaisquer tags HTML do texto para limpar dados provenientes de fontes HTML.\n", "\n", "\n", " **remove_url(text)** Remove URLs do texto para eliminar links que podem não ser relevantes para a análise de texto.\n", "\n", "\n", "**remove_punctuations(text)** Remove pontuações do texto para simplificar a estrutura do texto, mantendo apenas palavras.\n", "\n", "**remove_emojis(text)** Remove emojis do texto para evitar caracteres não verbais que podem interferir na análise textual.\n", "\n", "**remove_stop_words(text)** Remove stop words (palavras comuns como \"e\", \"de\", \"o\") que geralmente não adicionam valor significativo à análise de texto.\n", "\n", "**stem_words(text)** Aplica stemming nas palavras do texto, reduzindo-as à sua raiz (por exemplo, \"running\" vira \"run\") para normalizar as variações das palavras.\n", "\n", "**preprocess_text(text)** Aplica todas as funções acima em sequência para pré-processar o texto de forma completa, tornando-o mais adequado para análise de texto ou modelagem.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 241 }, "id": "PnFHO62rnWn-", "outputId": "17fb6619-fab9-4395-de5d-4c5199e7e45e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0one review mention watch 1 oz episod hook righ...1
1wonder littl product film techniqu unassum old...1
2thought wonder way spend time hot summer weeke...1
3basic famili littl boy jake think zombi closet...0
4petter mattei love time money visual stun film...1
\n", "
" ], "text/plain": [ " review sentiment\n", "0 one review mention watch 1 oz episod hook righ... 1\n", "1 wonder littl product film techniqu unassum old... 1\n", "2 thought wonder way spend time hot summer weeke... 1\n", "3 basic famili littl boy jake think zombi closet... 0\n", "4 petter mattei love time money visual stun film... 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "\n", "\n", "def lowercase_text(text):\n", " return text.lower()\n", "\n", "def remove_html(text):\n", " return re.sub(r'<[^<]+?>', '', text)\n", "\n", "def remove_url(text):\n", " return re.sub(r'http[s]?://\\S+|www\\.\\S+', '', text)\n", "\n", "def remove_punctuations(text):\n", " tokens_list = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n", " for char in text:\n", " if char in tokens_list:\n", " text = text.replace(char, ' ')\n", "\n", " return text\n", "\n", "def remove_emojis(text):\n", " emojis = re.compile(\"[\"\n", " u\"\\U0001F600-\\U0001F64F\"\n", " u\"\\U0001F300-\\U0001F5FF\"\n", " u\"\\U0001F680-\\U0001F6FF\"\n", " u\"\\U0001F1E0-\\U0001F1FF\"\n", " u\"\\U00002500-\\U00002BEF\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U000024C2-\\U0001F251\"\n", " u\"\\U0001f926-\\U0001f937\"\n", " u\"\\U00010000-\\U0010ffff\"\n", " u\"\\u2640-\\u2642\"\n", " u\"\\u2600-\\u2B55\"\n", " u\"\\u200d\"\n", " u\"\\u23cf\"\n", " u\"\\u23e9\"\n", " u\"\\u231a\"\n", " u\"\\ufe0f\"\n", " u\"\\u3030\"\n", " \"]+\", re.UNICODE)\n", "\n", " text = re.sub(emojis, '', text)\n", " return text\n", "\n", "def remove_stop_words(text):\n", " stop_words = stopwords.words('english')\n", " new_text = ''\n", " for word in text.split():\n", " if word not in stop_words:\n", " new_text += ''.join(f'{word} ')\n", "\n", " return new_text.strip()\n", "\n", "def stem_words(text):\n", " stemmer = PorterStemmer()\n", " new_text = ''\n", " for word in text.split():\n", " new_text += ''.join(f'{stemmer.stem(word)} ')\n", "\n", " return new_text\n", "\n", "def preprocess_text(text):\n", " text = lowercase_text(text)\n", " text = remove_html(text)\n", " text = remove_url(text)\n", " text = remove_punctuations(text)\n", " text = remove_emojis(text)\n", " text = remove_stop_words(text)\n", " text = stem_words(text)\n", "\n", " return text\n", "\n", "nltk.download('stopwords')\n", "df_reviews['review'] = df_reviews['review'].apply(preprocess_text)\n", "df_reviews.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizando balancemento da classes" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 452 }, "id": "Gdi_L0HWfntv", "outputId": "bce77594-f662-4b3f-c8eb-27d8a188b4f2" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.title('Target value distribution')\n", "plt.hist(df_reviews['sentiment'])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modelo BERT" ] }, { "cell_type": "markdown", "metadata": { "id": "EDkjlPDakskM" }, "source": [ "## Instalando Bibliotecas" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lk7m_1xvmWvz", "outputId": "ce842053-b261-4768-d9d7-fe9c65c9f6aa" }, "outputs": [], "source": [ "#pip install transformers\n", "#pip install accelerate -U\n", "#pip install transformers[torch]\n", "#pip install datasets evaluate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carregando o modelo treinado e tokenizador" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GlyrkK52zMcc", "outputId": "a938653b-92c3-4b4e-802c-eacc3f1b6ecf" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from transformers import AutoTokenizer\n", "from transformers import BertForSequenceClassification\n", "\n", "pre_trained_base = \"bert-base-uncased\"\n", "tokenizer = AutoTokenizer.from_pretrained(pre_trained_base)\n", "model = BertForSequenceClassification.from_pretrained(pre_trained_base, num_labels = 2, output_attentions=False, output_hidden_states=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenização das Sentenças e Cálculo do Tamanho dos Tokens" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "LKEjDZCHpk4e" }, "outputs": [], "source": [ "token_lens = []\n", "\n", "for sentence in df_reviews['review']:\n", " tokens = tokenizer.encode(sentence, max_length=200, truncation=True)\n", " token_lens.append(len(tokens))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Divisão dos Dados em Conjunto de Treinamento e Validação:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "H7PfXaVVp2uQ" }, "outputs": [], "source": [ "SEED=42\n", "MAX_LEN = 200\n", "from sklearn.model_selection import train_test_split\n", "df_train, df_val = train_test_split(df_reviews, test_size=0.2, random_state=SEED)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Processando os dados\n", "A função process_data recebe uma linha de um dataframe contendo uma revisão de texto e sua respectiva classificação de sentimento. Ela começa extraindo e limpando o texto da revisão, removendo quaisquer espaços extras. Em seguida, utiliza o tokenizer BERT para tokenizar o texto, aplicando padding e truncamento para garantir que todas as sequências tenham um comprimento fixo definido pela variável MAX_LEN. A função então adiciona a etiqueta de sentimento original e o texto limpo às codificações geradas, retornando um dicionário que contém os tokens do texto, a etiqueta de sentimento e o texto original." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "v7EZ6wd-qDfd" }, "outputs": [], "source": [ "def process_data(row):\n", "\n", " text = row['review']\n", " text = str(text)\n", " text = ' '.join(text.split())\n", "\n", " encodings = tokenizer(text, padding=\"max_length\", truncation=True, max_length=MAX_LEN)\n", "\n", " encodings['label'] = row['sentiment']\n", " encodings['text'] = text\n", "\n", " return encodings" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "d9VgrXNSqIYL" }, "outputs": [], "source": [ "# Treino\n", "processed_data_tr = []\n", "for i in range(df_train.shape[0]):\n", " processed_data_tr.append(process_data(df_train.iloc[i]))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "p0NLQxoKqJ_k" }, "outputs": [], "source": [ "# Validação\n", "processed_data_val = []\n", "for i in range(df_val.shape[0]):\n", " processed_data_val.append(process_data(df_val.iloc[i]))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "ac76Rb6fqP_G" }, "outputs": [], "source": [ "# Dataframes de Treino e Validação\n", "df_train = pd.DataFrame(processed_data_tr)\n", "df_val = pd.DataFrame(processed_data_val)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "RdbHaVy_fd64", "outputId": "a9aed834-81b7-4223-da42-6289799c2e1e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
attention_maskinput_idslabeltexttoken_type_ids
0[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 2921, 3198, 23624, 2954, 6978, 2674, 841...0kept ask mani fight scream match swear gener m...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 3422, 4372, 3775, 2099, 9587, 5737, 2071...0watch entir movi could watch entir movi stop d...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 3543, 2293, 2358, 10050, 2128, 25300, 11...1touch love stori reminisc ‘in mood love draw h...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 3732, 2154, 11865, 15472, 2072, 8040, 73...0latter day fulci schlocker total abysm concoct...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...[101, 2034, 3813, 3669, 19337, 2666, 2615, 504...0first firmli believ norwegian movi continu get...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
\n", "
" ], "text/plain": [ " attention_mask \\\n", "0 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "2 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "3 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "4 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n", "\n", " input_ids label \\\n", "0 [101, 2921, 3198, 23624, 2954, 6978, 2674, 841... 0 \n", "1 [101, 3422, 4372, 3775, 2099, 9587, 5737, 2071... 0 \n", "2 [101, 3543, 2293, 2358, 10050, 2128, 25300, 11... 1 \n", "3 [101, 3732, 2154, 11865, 15472, 2072, 8040, 73... 0 \n", "4 [101, 2034, 3813, 3669, 19337, 2666, 2615, 504... 0 \n", "\n", " text \\\n", "0 kept ask mani fight scream match swear gener m... \n", "1 watch entir movi could watch entir movi stop d... \n", "2 touch love stori reminisc ‘in mood love draw h... \n", "3 latter day fulci schlocker total abysm concoct... \n", "4 first firmli believ norwegian movi continu get... \n", "\n", " token_type_ids \n", "0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n", "4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "0lTWT8JwkRic" }, "source": [ "## Fine Tunning do Modelo\n", "Ajuste fino do BERT para tarefas específica de classificação de sentimento para o dataset do IMDB" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import pyarrow as pa\n", "from datasets import Dataset\n", "import evaluate\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kW53p7VQqUDD", "outputId": "8231f3ba-37d5-4546-c4d0-6b4ff317ecf3" }, "outputs": [ { "data": { "text/plain": [ "device(type='cuda', index=0)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "id": "68OdbTv5rLrm" }, "outputs": [], "source": [ "train_hg = Dataset(pa.Table.from_pandas(df_train))\n", "valid_hg = Dataset(pa.Table.from_pandas(df_val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Metricas de avaliação F1 Score e Acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`compute_metrics` calcula tanto a acurácia quanto o F1-score para avaliar um modelo de classificação. Primeiramente, são carregadas as métricas de acurácia e F1-score usando evaluate.load. Em seguida, a função compute_metrics recebe um par de arrays eval_pred, contendo as previsões do modelo e os rótulos verdadeiros. Utilizando as previsões, a função calcula a acurácia e o F1-score ponderado, onde a acurácia é obtida através da comparação das previsões com os rótulos utilizando a métrica de acurácia previamente carregada, e o F1-score é calculado utilizando a métrica de F1 previamente carregada, com ponderação \"weighted\". Os resultados de ambas as métricas são então combinados em um dicionário e retornados como um único objeto contendo as métricas de avaliação calculadas." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "lUNhDPs0ry4m" }, "outputs": [], "source": [ "\n", "# Load both accuracy and f1 metrics\n", "accuracy_metric = evaluate.load(\"accuracy\")\n", "f1_metric = evaluate.load(\"f1\")\n", "\n", "# Metric helper method\n", "def compute_metrics(eval_pred):\n", " predictions, labels = eval_pred\n", " predictions = np.argmax(predictions, axis=1)\n", "\n", " # Compute accuracy\n", " accuracy = accuracy_metric.compute(predictions=predictions, references=labels)\n", "\n", " # Compute F1 score\n", " f1 = f1_metric.compute(predictions=predictions, references=labels, average=\"weighted\")\n", "\n", " # Combine the metrics into a single dictionary\n", " combined_metrics = {\n", " 'accuracy': accuracy['accuracy'],\n", " 'f1': f1['f1']\n", " }\n", "\n", " return combined_metrics" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9jJYTWsHjnEc", "outputId": "fe45691a-4476-4978-89b8-15f36465c37c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: accelerateNote: you may need to restart the kernel to use updated packages.\n", "\n", "Version: 0.31.0\n", "Summary: Accelerate\n", "Home-page: https://github.com/huggingface/accelerate\n", "Author: The HuggingFace team\n", "Author-email: zach.mueller@huggingface.co\n", "License: Apache\n", "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n", "Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch\n", "Required-by: \n", "---\n", "Name: transformers\n", "Version: 4.41.2\n", "Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow\n", "Home-page: https://github.com/huggingface/transformers\n", "Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)\n", "Author-email: transformers@huggingface.co\n", "License: Apache 2.0 License\n", "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n", "Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm\n", "Required-by: \n" ] } ], "source": [ "pip show accelerate transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Treinamento do modelo" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QlaLCwf7rLtp", "outputId": "7e10e82a-8bc7-478b-851e-c7b628b46c41" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\transformers\\training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n", " warnings.warn(\n" ] } ], "source": [ "from transformers import TrainingArguments, Trainer\n", "\n", "EPOCHS = 1\n", "\n", "training_args = TrainingArguments(output_dir=\"./result\",\n", " evaluation_strategy=\"epoch\",\n", " num_train_epochs= EPOCHS,\n", " per_device_train_batch_size=16,\n", " per_device_eval_batch_size=8\n", " )\n", "\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=train_hg,\n", " eval_dataset=valid_hg,\n", " tokenizer=tokenizer,\n", " compute_metrics=compute_metrics\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CUDA available: True\n", "CUDA version: 12.1\n" ] } ], "source": [ "print(\"CUDA available: \", torch.cuda.is_available())\n", "print(\"CUDA version: \", torch.version.cuda)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 141 }, "id": "3s6lVFz_rLwO", "outputId": "ee64e8e9-9c8c-42a8-c355-f51410cc33df" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/2500 [00:00