{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "GZiMfnKVCniS" }, "source": [ "# PROYECTO III PROGRAMA DE FORMACIÓN MLDS AVANZADO\n", "## Daniel F. Benavides R. \n", "## Módulo VI - Entrenamiento de modelo de red neuronal y disposición del mismo a nivel local. \n", "\n", "### OBJETIVO\n", "\n", "El objetivo de este proyecto es realizar el despliegue de un modelo a nivel local. El mismo se llevará a cabo en dos partes: La primera en la cual se realiza el entrenamiento del modelo. El mismo se guarda a nivel local para su posterior uso. \n", "\n", "Es así como a continuación se ve el ejercicio de fine-tuning del modelo preentrenado de transformers [_'distilbert-base-uncased'_](https://huggingface.co/distilbert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France.). Este modelo inicialmente fue entrenado para labores de _fill mask_ y se adaptará como modelo clasificación de **SMS** no deseado. \n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gTYjO-PJIUT-", "outputId": "27f1d993-d34f-4f4b-bbe1-ab6c96bc2825" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ] }, { "cell_type": "markdown", "metadata": { "id": "GRFpQTqzDCLT" }, "source": [ "### Carga y manipulación de los datos \n", "\n", "A continuación importamos pandas, por medio del cual hacemos el respectivo cargue del dataset, delimitamos por el espacio la etiqueta del mensaje.\n", "\n", "Luego por medio de la función _list_ convertimos el mensaje y las etiquetas en un par de listas. luego convertimos las etiquetas en una variable dummie, debido a que tenemos una salida binaria _(el mensaje es spam o no lo es)_\n", "\n", "## Importamos el dataset de entrenamiento" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3Dt9fFKs74zR", "outputId": "271ff620-a449-4e87-a163-14621651b48e" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')\n", "path= \"/content/drive/MyDrive/MLDS-2/MODULO II/Talleres/SMSSpamCollection\"" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "id": "awPXefiYqQsF" }, "outputs": [], "source": [ "\n", "import pandas as pd\n", "df=messages = pd.read_csv(path, sep='\\t',\n", " names=[\"label\", \"message\"])\n", "X=list(df['message'])\n", "y=list(df['label'])\n", "y=list(pd.get_dummies(y,drop_first=True)['spam'])\n" ] }, { "cell_type": "markdown", "metadata": { "id": "1T8CpN3YDar6" }, "source": [ "### Preprocesamiento \n", "\n", "Ahora importamos la función *train_test_split* del módulo *model_selection* de la librería *scikit-learn* y por medio de este dividimos en set de entrenamiento y prueba. Definimos el tamaño de set de prueba en 20% de la muestra. También definimos el parámetro *random_state* para efectos de controlar la generación de los dos conjuntos de tal manera que no sean aleatorios. \n", "\n", "Luego instalamos la librería transformers, aunque en mi caso ya lo había realizado. \n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "dLFDWda0rIKw" }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AqOBGiGErZgj", "outputId": "1a461d33-55ae-4a22-9746-8c01e99d49bd" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: transformers in /usr/local/lib/python3.8/dist-packages (4.25.1)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers) (21.3)\n", "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (2022.6.2)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in /usr/local/lib/python3.8/dist-packages (from transformers) (0.11.1)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers) (3.8.2)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers) (2.23.0)\n", "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.8/dist-packages (from transformers) (4.64.1)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (1.21.6)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers) (6.0)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.8/dist-packages (from transformers) (0.13.2)\n", "Requirement already satisfied: typing-extensions>= in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0,>=0.10.0->transformers) (4.4.0)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->transformers) (3.0.9)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2022.12.7)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (1.24.3)\n", "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2.10)\n", "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (3.0.4)\n" ] } ], "source": [ "!pip install transformers" ] }, { "cell_type": "markdown", "metadata": { "id": "31aeQ6u-Dq-K" }, "source": [ "Ahora debemos invocar los modelos de que vamos a utilizar de la librería transformers en los siguientes pasos: \n", "\n", "* Llamamos el modelo preentrenado\n", "* Llamamos el tokenizador \n", "\n", "Necesitamos aplicar el tokenizador sobre nuestro conjunto de datos. \n", "\n", "Así que acontinuación llamamos de la librería transformers el tokenizador _\"DistilBertTokenizerFast\"_ luego lo definimos como nuestro **tokenizer** indicando que el mismo proviene del modelo preentrenado [_'distilbert-base-uncased'_](https://huggingface.co/distilbert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France.)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "id": "bcNEJ6perOSs" }, "outputs": [], "source": [ "from transformers import DistilBertTokenizerFast\n", "tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')" ] }, { "cell_type": "markdown", "metadata": { "id": "3RdR0eZaDyyi" }, "source": [ "Luego aplicamos el tokenizador que acabamos de definir sobre nuestro conjunto de mensajes de entrenamiento y prueba. Como los SMS no tienen la misma longitud (cantidad de tokens) debemos definir los parámetros truncation y padding como True para que se obtener oraciones del mismo tamaño; uno se encarga de rellenar de ceros (padding) y el otro de truncar las oraciones más largas. Esto para obtener un conjunto y luego tensores rectangulares. " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "-OL3fgLvrXvH" }, "outputs": [], "source": [ "train_encodings = tokenizer(X_train, truncation=True, padding=True)\n", "test_encodings = tokenizer(X_test, truncation=True, padding=True)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "7JTdRQNVD4AK" }, "source": [ "Ahora se procede a importar Tensorflow para efecto de convertir en tensores los encodings generados en el paso anterior. Acá se junta cada uno a su correspondiente etiqueta. " ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "9B42CTCnrrEx" }, "outputs": [], "source": [ "import tensorflow as tf\n", "\n", "train_dataset = tf.data.Dataset.from_tensor_slices((\n", " dict(train_encodings),\n", " y_train\n", "))\n", "\n", "test_dataset = tf.data.Dataset.from_tensor_slices((\n", " dict(test_encodings),\n", " y_test\n", "))" ] }, { "cell_type": "markdown", "metadata": { "id": "G3Wj2cqXD5hx" }, "source": [ "### Entrenamiento\n", "\n", "A continuación se importan los módulos de TFDistilBertForSequenceClassification que es usado para la tarea de clasificación de sentimientos. from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    eval_steps = 10, 
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
) We recommend using native Keras instead, by calling methods like `fit()` and `predict()` directly on the model object. Detailed examples of the Keras style can be found in our examples at https://github.com/huggingface/transformers/tree/main/examples/tensorflow\n", " warnings.warn(\n" ] } ], "source": [ "with training_args.strategy.scope():\n", " model = TFDistilBertForSequenceClassification.from_pretrained(\"distilbert-base-uncased\")\n", "\n", "trainer = TFTrainer(\n", " model=model, # the instantiated 🤗 Transformers model to be trained\n", " args=training_args, # training arguments, defined above\n", " train_dataset=train_dataset, # training dataset\n", " eval_dataset=test_dataset # evaluation dataset\n", ")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "tnAE3agZ21dq" }, "source": [ "una vez instanciado el modelo que será reentrenado, parametrizados los argumentos para ello, se toma la data y se realiza el reentrenamiento del modelo. " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "id": "bIba4vQg7Ecp" }, "outputs": [], "source": [ "trainer.train()" ] }, { "cell_type": "markdown", "metadata": { "id": "Zerz-bv8EENp" }, "source": [ "Ahora solo queda por aplicar modelo que reentrenamos con el dataset de **entrenamiento**, hacer la predicción, y la evaluación de las predicciones. Este procedimiento se encuentra definido en el [manual de fine-tuning](https://huggingface.co/transformers/v3.5.1/training.html) que tiene Hugging Face disponible. " ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "R534aDi3xD0s", "outputId": "65c5ac93-eb67-4413-e048-f7b4d9fd8931" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'eval_loss': 0.02398163080215454}" ] }, "metadata": {}, "execution_count": 44 } ], "source": [ "trainer.evaluate(test_dataset)" ] }, { "cell_type": "markdown", "metadata": { "id": "4rLF3nApUndt" }, "source": [ "A continuación aplicamos el modelo reentrenado al set de prueba hacer la respectiva clasificación de cada una de las muestras. " ] }, { "cell_type": "markdown", "metadata": { "id": "jpGNNvWEWU9u" }, "source": [ "### Predicción del modelo\n", "\n", "Se aplica el modelo reentrenado al dataset de prueba *test_dataset* y se evalúa la precisión del mismo por medio del accuracy, es decir, acá le pasamos mensajes sin etiquetas y le pedimos que prediga si son o no spam. El modelo para la tarea que fue entrenado presenta un accuracy de 1, es decir clasifica perfectamente el set de prueba. " ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UyBmI1WcxKjG", "outputId": "53067a82-55bf-4500-a38e-d890be6f7bf5" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "PredictionOutput(predictions=array([[ 3.4155877, -3.1767924],\n", " [-3.2374823, 3.135958 ],\n", " [ 3.348417 , -3.1216612],\n", " ...,\n", " [ 3.04905 , -2.8354154],\n", " [-3.1865208, 3.0687277],\n", " [ 3.212608 , -3.0316095]], dtype=float32), label_ids=array([0, 1, 0, ..., 0, 1, 0], dtype=int32), metrics={'eval_loss': 0.023984665530068533})" ] }, "metadata": {}, "execution_count": 45 } ], "source": [ "trainer.predict(test_dataset)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9Qc5FtM8xn9A", "outputId": "0d517424-b5d1-4324-be3c-6f90335aa4fd" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(1115,)" ] }, "metadata": {}, "execution_count": 46 } ], "source": [ "trainer.predict(test_dataset)[1].shape" ] }, { "cell_type": "markdown", "metadata": { "id": "LUHX_tCTWFuu" }, "source": [ "#### Salidas" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fUVX_IhWxkxg", "outputId": "a2e94ee6-54a2-414f-c2e7-98950deb7732" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([0, 1, 0, ..., 0, 1, 0], dtype=int32)" ] }, "metadata": {}, "execution_count": 47 } ], "source": [ "output=trainer.predict(test_dataset)[1]\n", "output" ] }, { "cell_type": "markdown", "metadata": { "id": "lUxvb6JcYB7_" }, "source": [ "#### Matriz de confusión, Accuracy" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cfCE06jQu5cI", "outputId": "a1d10897-a36f-47a8-e038-0f68ec5e7ded" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[955, 0],\n", " [ 0, 160]])" ] }, "metadata": {}, "execution_count": 48 } ], "source": [ "from sklearn.metrics import confusion_matrix, accuracy_score\n", "\n", "confusion_matrix=confusion_matrix(y_test,output)\n", "confusion_matrix\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mv83DD8sl8JO", "outputId": "97612c62-b15f-453f-d51e-5cd12e554421" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1.0" ] }, "metadata": {}, "execution_count": 49 } ], "source": [ "acc=accuracy_score(y_test,output)\n", "acc" ] }, { "cell_type": "markdown", "metadata": { "id": "Zm3mF58zYYze" }, "source": [ "#### Descarga del modelo" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "id": "okD5we1NwhQW" }, "outputs": [], "source": [ "trainer.save_model('ft_model')\n", "trainer.save_model('/content/drive/MyDrive/MLDS-2/MODULO III/Talleres/Modelo Entrenado')\n" ] }, { model2 = TFDistilBertForSequenceClassification.from_pretrained('/content/drive/MyDrive/MLDS-2/MODULO III/Talleres/Modelo Entrenado/') 