{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eBpjBBZc6IvA"
   },
   "source": [
    "# Fatima Fellowship Coding Challenge: Finetune a Generative AI Model\n",
    "\n",
    "Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge.\n",
    "\n",
    "**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook along with your application. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lQNUZjvuRt-m"
   },
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n",
    "\n",
    "### **Important**: Beore you get started, please make sure to make a **copy of this notebook** and set sharing permissions so that **anyone with the link can view**. Otherwise, we will NOT be able to assess your application.\n",
    "\n",
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qpFKMaPeggyX"
   },
   "source": [
    "# 0. Description\n",
    "\n",
    "The purpose of this coding challenge is to finetune a generative AI model on a dataset that *you* build.\n",
    "\n",
    "The dataset can be of any kind! For example, you could collect a dataset of football jerserys and train a machine learning model to be able to generate jerseys different teams apart. Or, you could finetune a generation model to be able to generate accurate recipes about a particular dish specific to your cuisine.\n",
    "\n",
    "We are interested in learning more about you and your coding abilities through this short exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "braBzmRpMe7_"
   },
   "source": [
    "# 1. Build a Dataset Based on Your Interests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1IWw-NZf5WfF"
   },
   "source": [
    "In the first step, you'll be building your OWN dataset of any kind. We expect that many students might build this dataset by scraping the web e.g. Google Images, or extracting samples from existing datasets (e.g. [from Hugging Face](https://huggingface.co/datasets)). Some suggestions:\n",
    "\n",
    "* Dataset size: although this can very, we generally recommend that the dataset should have at least 100 (training and validation) samples.\n",
    "* Dataset diversity: make sure your dataset is sufficiently varied. For example, if your dataset consists of celebrity images, you probably want celebrities of different ages, ethnicities, genders, etc.\n",
    "\n",
    "You may find Python libraries that download images such as `google_images_download` useful.\n",
    "\n",
    "Once you have built your dataset, please upload it to Hugging Face Hub using the `datasets` library and include the link below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:01.063480Z",
     "iopub.status.busy": "2024-07-14T23:54:01.063126Z",
     "iopub.status.idle": "2024-07-14T23:54:14.512034Z",
     "shell.execute_reply": "2024-07-14T23:54:14.510890Z",
     "shell.execute_reply.started": "2024-07-14T23:54:01.063450Z"
    },
    "id": "K2GJaYBpw91T",
    "outputId": "3c4ec6d3-2a83-4a81-8243-29d1cd77612a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: datasets in /opt/conda/lib/python3.10/site-packages (2.20.0)\n",
      "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from datasets) (3.13.1)\n",
      "Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from datasets) (1.26.4)\n",
      "Requirement already satisfied: pyarrow>=15.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (16.1.0)\n",
      "Requirement already satisfied: pyarrow-hotfix in /opt/conda/lib/python3.10/site-packages (from datasets) (0.6)\n",
      "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.3.8)\n",
      "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets) (2.2.2)\n",
      "Requirement already satisfied: requests>=2.32.2 in /opt/conda/lib/python3.10/site-packages (from datasets) (2.32.3)\n",
      "Requirement already satisfied: tqdm>=4.66.3 in /opt/conda/lib/python3.10/site-packages (from datasets) (4.66.4)\n",
      "Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets) (3.4.1)\n",
      "Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets) (0.70.16)\n",
      "Requirement already satisfied: fsspec<=2024.5.0,>=2023.1.0 in /opt/conda/lib/python3.10/site-packages (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets) (2024.5.0)\n",
      "Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets) (3.9.1)\n",
      "Requirement already satisfied: huggingface-hub>=0.21.2 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.23.4)\n",
      "Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from datasets) (21.3)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (6.0.1)\n",
      "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (23.2.0)\n",
      "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.4)\n",
      "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.3)\n",
      "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.4.1)\n",
      "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n",
      "Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.3)\n",
      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.21.2->datasets) (4.9.0)\n",
      "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->datasets) (3.1.1)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.3.2)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.6)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (1.26.18)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2024.7.4)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2.9.0.post0)\n",
      "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2023.3.post1)\n",
      "Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2023.4)\n",
      "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n"
     ]
    }
   ],
   "source": [
    "!pip install datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:14.514248Z",
     "iopub.status.busy": "2024-07-14T23:54:14.513956Z",
     "iopub.status.idle": "2024-07-14T23:54:31.621227Z",
     "shell.execute_reply": "2024-07-14T23:54:31.620455Z",
     "shell.execute_reply.started": "2024-07-14T23:54:14.514223Z"
    },
    "id": "Z0x_qADbxA5x",
    "outputId": "344c2357-410f-49b0-8761-3f118615a95d"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-07-14 23:54:22.010940: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2024-07-14 23:54:22.011038: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2024-07-14 23:54:22.120615: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset, DatasetDict, Dataset\n",
    "\n",
    "from transformers import (\n",
    "    AutoTokenizer,\n",
    "    AutoConfig,\n",
    "    AutoModelForSequenceClassification,\n",
    "    DataCollatorWithPadding,\n",
    "    TrainingArguments,\n",
    "    Trainer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:31.622913Z",
     "iopub.status.busy": "2024-07-14T23:54:31.622369Z",
     "iopub.status.idle": "2024-07-14T23:54:45.202850Z",
     "shell.execute_reply": "2024-07-14T23:54:45.201709Z",
     "shell.execute_reply.started": "2024-07-14T23:54:31.622886Z"
    },
    "id": "yFUGXvOCxKzX",
    "outputId": "d7d91cc7-d663-4f0d-9f79-6acd8850d92e",
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.10/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n",
      "  pid, fd = os.forkpty()\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting peft\n",
      "  Downloading peft-0.11.1-py3-none-any.whl.metadata (13 kB)\n",
      "Requirement already satisfied: torch in /opt/conda/lib/python3.10/site-packages (2.1.2)\n",
      "Collecting evaluate\n",
      "  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)\n",
      "Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from peft) (1.26.4)\n",
      "Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from peft) (21.3)\n",
      "Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from peft) (5.9.3)\n",
      "Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from peft) (6.0.1)\n",
      "Requirement already satisfied: transformers in /opt/conda/lib/python3.10/site-packages (from peft) (4.42.3)\n",
      "Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from peft) (4.66.4)\n",
      "Requirement already satisfied: accelerate>=0.21.0 in /opt/conda/lib/python3.10/site-packages (from peft) (0.32.1)\n",
      "Requirement already satisfied: safetensors in /opt/conda/lib/python3.10/site-packages (from peft) (0.4.3)\n",
      "Requirement already satisfied: huggingface-hub>=0.17.0 in /opt/conda/lib/python3.10/site-packages (from peft) (0.23.4)\n",
      "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch) (3.13.1)\n",
      "Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch) (4.9.0)\n",
      "Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch) (1.13.0)\n",
      "Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch) (3.2.1)\n",
      "Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch) (3.1.2)\n",
      "Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch) (2024.5.0)\n",
      "Requirement already satisfied: datasets>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from evaluate) (2.20.0)\n",
      "Requirement already satisfied: dill in /opt/conda/lib/python3.10/site-packages (from evaluate) (0.3.8)\n",
      "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from evaluate) (2.2.2)\n",
      "Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from evaluate) (2.32.3)\n",
      "Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from evaluate) (3.4.1)\n",
      "Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from evaluate) (0.70.16)\n",
      "Requirement already satisfied: pyarrow>=15.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets>=2.0.0->evaluate) (16.1.0)\n",
      "Requirement already satisfied: pyarrow-hotfix in /opt/conda/lib/python3.10/site-packages (from datasets>=2.0.0->evaluate) (0.6)\n",
      "Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets>=2.0.0->evaluate) (3.9.1)\n",
      "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging>=20.0->peft) (3.1.1)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate) (3.3.2)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate) (3.6)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate) (1.26.18)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->evaluate) (2024.7.4)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch) (2.1.3)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->evaluate) (2.9.0.post0)\n",
      "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->evaluate) (2023.3.post1)\n",
      "Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->evaluate) (2023.4)\n",
      "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from sympy->torch) (1.3.0)\n",
      "Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers->peft) (2023.12.25)\n",
      "Requirement already satisfied: tokenizers<0.20,>=0.19 in /opt/conda/lib/python3.10/site-packages (from transformers->peft) (0.19.1)\n",
      "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (23.2.0)\n",
      "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.0.4)\n",
      "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.9.3)\n",
      "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.4.1)\n",
      "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.1)\n",
      "Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate) (4.0.3)\n",
      "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->evaluate) (1.16.0)\n",
      "Downloading peft-0.11.1-py3-none-any.whl (251 kB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m251.6/251.6 kB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0mm\n",
      "\u001b[?25hDownloading evaluate-0.4.2-py3-none-any.whl (84 kB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hInstalling collected packages: peft, evaluate\n",
      "Successfully installed evaluate-0.4.2 peft-0.11.1\n"
     ]
    }
   ],
   "source": [
    "!pip install peft  evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:45.205611Z",
     "iopub.status.busy": "2024-07-14T23:54:45.205289Z",
     "iopub.status.idle": "2024-07-14T23:54:48.418952Z",
     "shell.execute_reply": "2024-07-14T23:54:48.418056Z",
     "shell.execute_reply.started": "2024-07-14T23:54:45.205583Z"
    },
    "id": "CLX2Z20zxpc9",
    "outputId": "56192048-fdd7-4291-8c3f-bc201b54a214"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6b02c7d5863a45c89f774634dfec4e02",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading readme:   0%|          | 0.00/592 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "226df3fb69bc4d99b0067ba86a7369f9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading data:   0%|          | 0.00/836k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "182957d058f84874803975438d05e7d3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading data:   0%|          | 0.00/853k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d4cfec7d671c49e09ba848234b9f6d1e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "93690d0b8eb542aba20c9ab3e661b3e5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['label', 'text'],\n",
       "        num_rows: 1000\n",
       "    })\n",
       "    validation: Dataset({\n",
       "        features: ['label', 'text'],\n",
       "        num_rows: 1000\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load dataset\n",
    "dataset = load_dataset('osabobo/IMD-dataset')\n",
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:48.420302Z",
     "iopub.status.busy": "2024-07-14T23:54:48.420016Z",
     "iopub.status.idle": "2024-07-14T23:54:50.640531Z",
     "shell.execute_reply": "2024-07-14T23:54:50.639459Z",
     "shell.execute_reply.started": "2024-07-14T23:54:48.420277Z"
    },
    "id": "04upPZHdx8XW"
   },
   "outputs": [],
   "source": [
    "from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig\n",
    "import evaluate\n",
    "import torch\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:50.642234Z",
     "iopub.status.busy": "2024-07-14T23:54:50.641928Z",
     "iopub.status.idle": "2024-07-14T23:54:50.652666Z",
     "shell.execute_reply": "2024-07-14T23:54:50.651652Z",
     "shell.execute_reply.started": "2024-07-14T23:54:50.642209Z"
    },
    "id": "fODqNWzJxzIj",
    "outputId": "ae9417a4-9ded-4d16-db6e-9badcb470043"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# display % of training data with label=1\n",
    "np.array(dataset['train']['label']).sum()/len(dataset['train']['label'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qSeLed2JxvGI"
   },
   "source": [
    "**Link to the dataset on Hugging Face Hub:**[Dataset](https://huggingface.co/datasets/osabobo/IMD-dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sFU9LTOyMiMj"
   },
   "source": [
    "# 2. Finetune a Foundation Model\n",
    "\n",
    "Now that you have collected a dataset, its time to pick a base model to finetune.\n",
    "\n",
    "\n",
    "* Go to the [Hugging Face Hub](https://huggingface.co/models) and pick a foundation model to fine-tune. (For example, if you are interested in generating images, you could pick [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) or [Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium) as your base model.) Make sure to pick a model that can be loaded in the free tier of the Colab Notebook.\n",
    "* Then finetine the your model on the dataset that you collected in Step 1. There are different ways to finetune a model: [from LoRA to a full finetune](https://huggingface.co/docs/diffusers/v0.13.0/en/training/lora). Pick one of these methods, and explain your reasoning below. We suggest that you use use the `transformers` or `diffusers` library to finetune a foundation model.\n",
    "* Generate some samples from the base model and from the final finetuned model. How do they compare?  \n",
    "* [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sTxd8qJgyOUW"
   },
   "source": [
    "### model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:50.654582Z",
     "iopub.status.busy": "2024-07-14T23:54:50.654171Z",
     "iopub.status.idle": "2024-07-14T23:54:52.626447Z",
     "shell.execute_reply": "2024-07-14T23:54:52.625623Z",
     "shell.execute_reply.started": "2024-07-14T23:54:50.654547Z"
    },
    "id": "tj42qvjByQ2B",
    "outputId": "8da19bd1-e713-41dd-919f-5255462a348d"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "98f8947966f248619e057f11d35cdc0d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "41dcb8dc1fb24c1fa68fdb7f1ddbb171",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    }
   ],
   "source": [
    "model_checkpoint = 'distilbert-base-uncased'\n",
    "# model_checkpoint = 'roberta-base' # you can alternatively use roberta-base but this model is bigger thus training will take longer\n",
    "\n",
    "# define label maps\n",
    "id2label = {0: \"Negative\", 1: \"Positive\"}\n",
    "label2id = {\"Negative\":0, \"Positive\":1}\n",
    "\n",
    "# generate classification model from model_checkpoint\n",
    "model = AutoModelForSequenceClassification.from_pretrained(\n",
    "    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:52.628409Z",
     "iopub.status.busy": "2024-07-14T23:54:52.627831Z",
     "iopub.status.idle": "2024-07-14T23:54:52.635147Z",
     "shell.execute_reply": "2024-07-14T23:54:52.634101Z",
     "shell.execute_reply.started": "2024-07-14T23:54:52.628372Z"
    },
    "id": "E90i018KyJH3",
    "outputId": "5ba6ea14-f569-4e0f-d2e9-aaf5409803b7"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DistilBertForSequenceClassification(\n",
       "  (distilbert): DistilBertModel(\n",
       "    (embeddings): Embeddings(\n",
       "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
       "      (position_embeddings): Embedding(512, 768)\n",
       "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
       "      (dropout): Dropout(p=0.1, inplace=False)\n",
       "    )\n",
       "    (transformer): Transformer(\n",
       "      (layer): ModuleList(\n",
       "        (0-5): 6 x TransformerBlock(\n",
       "          (attention): MultiHeadSelfAttention(\n",
       "            (dropout): Dropout(p=0.1, inplace=False)\n",
       "            (q_lin): Linear(in_features=768, out_features=768, bias=True)\n",
       "            (k_lin): Linear(in_features=768, out_features=768, bias=True)\n",
       "            (v_lin): Linear(in_features=768, out_features=768, bias=True)\n",
       "            (out_lin): Linear(in_features=768, out_features=768, bias=True)\n",
       "          )\n",
       "          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
       "          (ffn): FFN(\n",
       "            (dropout): Dropout(p=0.1, inplace=False)\n",
       "            (lin1): Linear(in_features=768, out_features=3072, bias=True)\n",
       "            (lin2): Linear(in_features=3072, out_features=768, bias=True)\n",
       "            (activation): GELUActivation()\n",
       "          )\n",
       "          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
       "        )\n",
       "      )\n",
       "    )\n",
       "  )\n",
       "  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)\n",
       "  (classifier): Linear(in_features=768, out_features=2, bias=True)\n",
       "  (dropout): Dropout(p=0.2, inplace=False)\n",
       ")"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# display architecture\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vsVJW7TQyyHQ"
   },
   "source": [
    "### preprocess data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:52.636870Z",
     "iopub.status.busy": "2024-07-14T23:54:52.636509Z",
     "iopub.status.idle": "2024-07-14T23:54:53.419088Z",
     "shell.execute_reply": "2024-07-14T23:54:53.418290Z",
     "shell.execute_reply.started": "2024-07-14T23:54:52.636827Z"
    },
    "id": "NDdHasyrysy0"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bf102ba85e644539929c669af61a243c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5c88f68deb51424b95f30acd8e8336a4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1e5a5f022f1a45119d96c6fa35aa50a1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# create tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)\n",
    "\n",
    "# add pad token if none exists\n",
    "if tokenizer.pad_token is None:\n",
    "    tokenizer.add_special_tokens({'pad_token': '[PAD]'})\n",
    "    model.resize_token_embeddings(len(tokenizer))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:53.422895Z",
     "iopub.status.busy": "2024-07-14T23:54:53.422594Z",
     "iopub.status.idle": "2024-07-14T23:54:53.428556Z",
     "shell.execute_reply": "2024-07-14T23:54:53.427473Z",
     "shell.execute_reply.started": "2024-07-14T23:54:53.422873Z"
    },
    "id": "wNeERWOTys11"
   },
   "outputs": [],
   "source": [
    "# create tokenize function\n",
    "def tokenize_function(examples):\n",
    "    # extract text\n",
    "    text = examples[\"text\"]\n",
    "\n",
    "    #tokenize and truncate text\n",
    "    tokenizer.truncation_side = \"left\"\n",
    "    tokenized_inputs = tokenizer(\n",
    "        text,\n",
    "        return_tensors=\"np\",\n",
    "        truncation=True,\n",
    "        max_length=512\n",
    "    )\n",
    "\n",
    "    return tokenized_inputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:53.430306Z",
     "iopub.status.busy": "2024-07-14T23:54:53.429938Z",
     "iopub.status.idle": "2024-07-14T23:54:54.447690Z",
     "shell.execute_reply": "2024-07-14T23:54:54.446671Z",
     "shell.execute_reply.started": "2024-07-14T23:54:53.430275Z"
    },
    "id": "ZBVlRE6ry6x6",
    "outputId": "e48e34e9-264d-429a-9efd-b06192e30202"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3da8aa6443244cb69fec95d521bc0ac2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/1000 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1ddd7909cf474a3bb4df5bf7d72966ae",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/1000 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['label', 'text', 'input_ids', 'attention_mask'],\n",
       "        num_rows: 1000\n",
       "    })\n",
       "    validation: Dataset({\n",
       "        features: ['label', 'text', 'input_ids', 'attention_mask'],\n",
       "        num_rows: 1000\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# tokenize training and validation datasets\n",
    "tokenized_dataset = dataset.map(tokenize_function, batched=True)\n",
    "tokenized_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:54.449258Z",
     "iopub.status.busy": "2024-07-14T23:54:54.448965Z",
     "iopub.status.idle": "2024-07-14T23:54:54.453874Z",
     "shell.execute_reply": "2024-07-14T23:54:54.452576Z",
     "shell.execute_reply.started": "2024-07-14T23:54:54.449233Z"
    },
    "id": "bgvnCM9Py603"
   },
   "outputs": [],
   "source": [
    "# create data collator\n",
    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fRckxXulzRPV"
   },
   "source": [
    "### evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:54.455722Z",
     "iopub.status.busy": "2024-07-14T23:54:54.455323Z",
     "iopub.status.idle": "2024-07-14T23:54:54.942049Z",
     "shell.execute_reply": "2024-07-14T23:54:54.941158Z",
     "shell.execute_reply.started": "2024-07-14T23:54:54.455688Z"
    },
    "id": "ZvW09r7vzJwu"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "eca166b3f5fc4f0f9bfbbb1ded64397d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# import accuracy evaluation metric\n",
    "accuracy = evaluate.load(\"accuracy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:54.943903Z",
     "iopub.status.busy": "2024-07-14T23:54:54.943443Z",
     "iopub.status.idle": "2024-07-14T23:54:54.949279Z",
     "shell.execute_reply": "2024-07-14T23:54:54.948317Z",
     "shell.execute_reply.started": "2024-07-14T23:54:54.943869Z"
    },
    "id": "6R-shGndzJy7"
   },
   "outputs": [],
   "source": [
    "# define an evaluation function to pass into trainer later\n",
    "def compute_metrics(p):\n",
    "    predictions, labels = p\n",
    "    predictions = np.argmax(predictions, axis=1)\n",
    "\n",
    "    return {\"accuracy\": accuracy.compute(predictions=predictions, references=labels)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JHbcLhWvzmN8"
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MAyi20Npzmnz"
   },
   "source": [
    "### Apply untrained model to text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:54.950805Z",
     "iopub.status.busy": "2024-07-14T23:54:54.950495Z",
     "iopub.status.idle": "2024-07-14T23:54:55.231971Z",
     "shell.execute_reply": "2024-07-14T23:54:55.230948Z",
     "shell.execute_reply.started": "2024-07-14T23:54:54.950779Z"
    },
    "id": "tvEMxMu7zJ2s",
    "outputId": "92066969-c9b5-4caf-a04d-da813b3cc895"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Untrained model predictions:\n",
      "----------------------------\n",
      "It was good. - Positive\n",
      "Not a fan, don't recommed. - Positive\n",
      "Better than the first one. - Positive\n",
      "This is not worth watching even once. - Positive\n",
      "This one is a pass. - Positive\n"
     ]
    }
   ],
   "source": [
    "# define list of examples\n",
    "text_list = [\"It was good.\", \"Not a fan, don't recommed.\", \"Better than the first one.\", \"This is not worth watching even once.\", \"This one is a pass.\"]\n",
    "\n",
    "print(\"Untrained model predictions:\")\n",
    "print(\"----------------------------\")\n",
    "for text in text_list:\n",
    "    # tokenize text\n",
    "    inputs = tokenizer.encode(text, return_tensors=\"pt\")\n",
    "    # compute logits\n",
    "    logits = model(inputs).logits\n",
    "    # convert logits to label\n",
    "    predictions = torch.argmax(logits)\n",
    "\n",
    "    print(text + \" - \" + id2label[predictions.tolist()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CKHNn6ZNz-y3"
   },
   "source": [
    "### Train model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:55.233869Z",
     "iopub.status.busy": "2024-07-14T23:54:55.233390Z",
     "iopub.status.idle": "2024-07-14T23:54:55.239067Z",
     "shell.execute_reply": "2024-07-14T23:54:55.237919Z",
     "shell.execute_reply.started": "2024-07-14T23:54:55.233833Z"
    },
    "id": "-xuIRAPr0CWG"
   },
   "outputs": [],
   "source": [
    "peft_config = LoraConfig(task_type=\"SEQ_CLS\",\n",
    "                        r=4,\n",
    "                        lora_alpha=32,\n",
    "                        lora_dropout=0.01,\n",
    "                        target_modules = ['q_lin'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:55.240708Z",
     "iopub.status.busy": "2024-07-14T23:54:55.240406Z",
     "iopub.status.idle": "2024-07-14T23:54:55.249582Z",
     "shell.execute_reply": "2024-07-14T23:54:55.248639Z",
     "shell.execute_reply.started": "2024-07-14T23:54:55.240685Z"
    },
    "id": "P2h1nrWU0IqN",
    "outputId": "1af146f2-8fdf-4559-d996-b1b48e4f25d6"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='SEQ_CLS', inference_mode=False, r=4, target_modules={'q_lin'}, lora_alpha=32, lora_dropout=0.01, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "peft_config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:55.251220Z",
     "iopub.status.busy": "2024-07-14T23:54:55.250714Z",
     "iopub.status.idle": "2024-07-14T23:54:55.274109Z",
     "shell.execute_reply": "2024-07-14T23:54:55.273307Z",
     "shell.execute_reply.started": "2024-07-14T23:54:55.251193Z"
    },
    "id": "lGdxTfFv0I-9",
    "outputId": "987f0000-697c-458c-942c-f19af4e74181"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9307\n"
     ]
    }
   ],
   "source": [
    "model = get_peft_model(model, peft_config)\n",
    "model.print_trainable_parameters()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:55.275504Z",
     "iopub.status.busy": "2024-07-14T23:54:55.275234Z",
     "iopub.status.idle": "2024-07-14T23:54:55.279915Z",
     "shell.execute_reply": "2024-07-14T23:54:55.279014Z",
     "shell.execute_reply.started": "2024-07-14T23:54:55.275481Z"
    },
    "id": "XrMxqk_90S6Q"
   },
   "outputs": [],
   "source": [
    "# hyperparameters\n",
    "lr = 1e-3\n",
    "batch_size = 4\n",
    "num_epochs = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:55.281084Z",
     "iopub.status.busy": "2024-07-14T23:54:55.280845Z",
     "iopub.status.idle": "2024-07-14T23:54:55.350814Z",
     "shell.execute_reply": "2024-07-14T23:54:55.349830Z",
     "shell.execute_reply.started": "2024-07-14T23:54:55.281063Z"
    },
    "id": "kYJDZJt50S9Q",
    "outputId": "c7e9d13f-d31d-4724-9638-7d7351bd3d7b"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "# define training arguments\n",
    "training_args = TrainingArguments(\n",
    "    output_dir= model_checkpoint + \"-lora-text-classification\",\n",
    "    learning_rate=lr,\n",
    "    per_device_train_batch_size=batch_size,\n",
    "    per_device_eval_batch_size=batch_size,\n",
    "    num_train_epochs=num_epochs,\n",
    "    weight_decay=0.01,\n",
    "    evaluation_strategy=\"epoch\",\n",
    "    save_strategy=\"epoch\",\n",
    "    load_best_model_at_end=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 75
    },
    "execution": {
     "iopub.execute_input": "2024-07-14T23:54:55.352902Z",
     "iopub.status.busy": "2024-07-14T23:54:55.352113Z",
     "iopub.status.idle": "2024-07-15T00:01:58.470190Z",
     "shell.execute_reply": "2024-07-15T00:01:58.469378Z",
     "shell.execute_reply.started": "2024-07-14T23:54:55.352863Z"
    },
    "id": "0V25wLEN0fTk",
    "outputId": "830beb8b-cf8d-4b0e-f759-ffd32e3836be"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.\n",
      "\u001b[34m\u001b[1mwandb\u001b[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)\n",
      "\u001b[34m\u001b[1mwandb\u001b[0m: You can find your API key in your browser here: https://wandb.ai/authorize\n",
      "\u001b[34m\u001b[1mwandb\u001b[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "  ········································\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "Tracking run with wandb version 0.17.4"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Run data is saved locally in <code>/kaggle/working/wandb/run-20240714_235539-ypukib7l</code>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "Syncing run <strong><a href='https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface/runs/ypukib7l' target=\"_blank\">distilbert-base-uncased-lora-text-classification</a></strong> to <a href='https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface' target=\"_blank\">Weights & Biases</a> (<a href='https://wandb.me/run' target=\"_blank\">docs</a>)<br/>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       " View project at <a href='https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface' target=\"_blank\">https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface</a>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       " View run at <a href='https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface/runs/ypukib7l' target=\"_blank\">https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface/runs/ypukib7l</a>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='1250' max='1250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [1250/1250 06:01, Epoch 10/10]\n",
       "    </div>\n",
       "    <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       " <tr style=\"text-align: left;\">\n",
       "      <th>Epoch</th>\n",
       "      <th>Training Loss</th>\n",
       "      <th>Validation Loss</th>\n",
       "      <th>Accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>No log</td>\n",
       "      <td>0.259819</td>\n",
       "      <td>{'accuracy': 0.896}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>No log</td>\n",
       "      <td>0.357995</td>\n",
       "      <td>{'accuracy': 0.888}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>No log</td>\n",
       "      <td>0.403524</td>\n",
       "      <td>{'accuracy': 0.885}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>0.262200</td>\n",
       "      <td>0.513255</td>\n",
       "      <td>{'accuracy': 0.881}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>0.262200</td>\n",
       "      <td>0.614568</td>\n",
       "      <td>{'accuracy': 0.886}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>0.262200</td>\n",
       "      <td>0.757634</td>\n",
       "      <td>{'accuracy': 0.885}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>0.262200</td>\n",
       "      <td>0.749906</td>\n",
       "      <td>{'accuracy': 0.885}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>0.045000</td>\n",
       "      <td>0.808175</td>\n",
       "      <td>{'accuracy': 0.891}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>0.045000</td>\n",
       "      <td>0.804488</td>\n",
       "      <td>{'accuracy': 0.89}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>10</td>\n",
       "      <td>0.045000</td>\n",
       "      <td>0.800608</td>\n",
       "      <td>{'accuracy': 0.893}</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Trainer is attempting to log a value of \"{'accuracy': 0.896}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.888}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.885}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.881}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.886}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.885}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.885}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.891}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.89}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "Trainer is attempting to log a value of \"{'accuracy': 0.893}\" of type <class 'dict'> for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "TrainOutput(global_step=1250, training_loss=0.12382552947998048, metrics={'train_runtime': 421.8436, 'train_samples_per_second': 23.705, 'train_steps_per_second': 2.963, 'total_flos': 1253694805157184.0, 'train_loss': 0.12382552947998048, 'epoch': 10.0})"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# creater trainer object\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=tokenized_dataset[\"train\"],\n",
    "    eval_dataset=tokenized_dataset[\"validation\"],\n",
    "    tokenizer=tokenizer,\n",
    "    data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length\n",
    "    compute_metrics=compute_metrics,\n",
    ")\n",
    "\n",
    "# train model\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5lbssHsP0prm"
   },
   "source": [
    "### Generate prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:01:58.471810Z",
     "iopub.status.busy": "2024-07-15T00:01:58.471432Z",
     "iopub.status.idle": "2024-07-15T00:02:14.131264Z",
     "shell.execute_reply": "2024-07-15T00:02:14.130128Z",
     "shell.execute_reply.started": "2024-07-15T00:01:58.471777Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.10/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n",
      "  pid, fd = os.forkpty()\n",
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found existing installation: torch 2.1.2\n",
      "Uninstalling torch-2.1.2:\n",
      "  Successfully uninstalled torch-2.1.2\n"
     ]
    }
   ],
   "source": [
    "!pip uninstall -y torch\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:02:14.135181Z",
     "iopub.status.busy": "2024-07-15T00:02:14.134858Z",
     "iopub.status.idle": "2024-07-15T00:06:34.130741Z",
     "shell.execute_reply": "2024-07-15T00:06:34.129542Z",
     "shell.execute_reply.started": "2024-07-15T00:02:14.135154Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/nightly/gpu\n",
      "Collecting torch\n",
      "  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)\n",
      "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch) (3.13.1)\n",
      "Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from torch) (4.9.0)\n",
      "Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch) (1.13.0)\n",
      "Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch) (3.2.1)\n",
      "Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch) (3.1.2)\n",
      "Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch) (2024.5.0)\n",
      "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)\n",
      "  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
      "Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)\n",
      "  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
      "Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)\n",
      "  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
      "Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)\n",
      "  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
      "Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)\n",
      "  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
      "Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)\n",
      "  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
      "Collecting nvidia-curand-cu12==10.3.2.106 (from torch)\n",
      "  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n",
      "Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch)\n",
      "  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
      "Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch)\n",
      "  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n",
      "Collecting nvidia-nccl-cu12==2.20.5 (from torch)\n",
      "  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)\n",
      "Collecting nvidia-nvtx-cu12==12.1.105 (from torch)\n",
      "  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)\n",
      "Collecting triton==2.3.1 (from torch)\n",
      "  Downloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)\n",
      "Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch)\n",
      "  Downloading nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch) (2.1.3)\n",
      "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from sympy->torch) (1.3.0)\n",
      "Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m779.1/779.1 MB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m410.6/410.6 MB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m51.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.7/23.7 MB\u001b[0m \u001b[31m46.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m823.6/823.6 kB\u001b[0m \u001b[31m39.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m731.7/731.7 MB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:02\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.6/121.6 MB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:02\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.5/56.5 MB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.2/124.2 MB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m196.0/196.0 MB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m176.2/176.2 MB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n",
      "\u001b[?25hDownloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hDownloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.1/168.1 MB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hDownloading nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl (21.3 MB)\n",
      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.3/21.3 MB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
      "\u001b[?25hInstalling collected packages: triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch\n",
      "Successfully installed nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.5.82 nvidia-nvtx-cu12-12.1.105 torch-2.3.1 triton-2.3.1\n"
     ]
    }
   ],
   "source": [
    "!pip install torch --pre --extra-index-url https://download.pytorch.org/whl/nightly/gpu\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:06:34.133020Z",
     "iopub.status.busy": "2024-07-15T00:06:34.132587Z",
     "iopub.status.idle": "2024-07-15T00:06:34.142491Z",
     "shell.execute_reply": "2024-07-15T00:06:34.140853Z",
     "shell.execute_reply.started": "2024-07-15T00:06:34.132982Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MPS backend is not available\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "if torch.backends.mps.is_available():\n",
    "    print(\"MPS backend is available\")\n",
    "else:\n",
    "    print(\"MPS backend is not available\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:09:13.790787Z",
     "iopub.status.busy": "2024-07-15T00:09:13.790051Z",
     "iopub.status.idle": "2024-07-15T00:09:13.797350Z",
     "shell.execute_reply": "2024-07-15T00:09:13.796206Z",
     "shell.execute_reply.started": "2024-07-15T00:09:13.790754Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using device: cuda\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "print(f\"Using device: {device}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:09:37.751116Z",
     "iopub.status.busy": "2024-07-15T00:09:37.750202Z",
     "iopub.status.idle": "2024-07-15T00:09:37.880663Z",
     "shell.execute_reply": "2024-07-15T00:09:37.879602Z",
     "shell.execute_reply.started": "2024-07-15T00:09:37.751070Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trained model predictions:\n",
      "--------------------------\n",
      "Text: It was good.\n",
      "Prediction: tensor([1], device='cuda:0')\n",
      "\n",
      "Text: Not a fan, don't recommed.\n",
      "Prediction: tensor([0], device='cuda:0')\n",
      "\n",
      "Text: Better than the first one.\n",
      "Prediction: tensor([1], device='cuda:0')\n",
      "\n",
      "Text: This is not worth watching even once.\n",
      "Prediction: tensor([1], device='cuda:0')\n",
      "\n",
      "Text: This one is a pass.\n",
      "Prediction: tensor([0], device='cuda:0')\n",
      "\n"
     ]
    }
   ],
   "source": [
    "model.to(device)  # move the model to the appropriate device\n",
    "\n",
    "print(\"Trained model predictions:\")\n",
    "print(\"--------------------------\")\n",
    "\n",
    "for text in text_list:\n",
    "    inputs = tokenizer.encode(text, return_tensors=\"pt\").to(device)  # move the inputs to the appropriate device\n",
    "    logits = model(inputs).logits\n",
    "    predictions = torch.max(logits, 1).indices\n",
    "    print(f\"Text: {text}\\nPrediction: {predictions}\\n\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Y0iH6E6-0z-r"
   },
   "source": [
    "### Optional: push model to hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:10:50.355689Z",
     "iopub.status.busy": "2024-07-15T00:10:50.354890Z",
     "iopub.status.idle": "2024-07-15T00:10:50.386851Z",
     "shell.execute_reply": "2024-07-15T00:10:50.385997Z",
     "shell.execute_reply.started": "2024-07-15T00:10:50.355623Z"
    },
    "id": "2I3YvXHo06c6"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fca9611543134ed49ba804e1472e7b45",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# option 1: notebook login\n",
    "from huggingface_hub import notebook_login\n",
    "notebook_login() # ensure token gives write access\n",
    "\n",
    "# # option 2: key login\n",
    "# from huggingface_hub import login\n",
    "# write_key = 'hf_' # paste token here\n",
    "# login(write_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:35:51.569492Z",
     "iopub.status.busy": "2024-07-15T00:35:51.568682Z",
     "iopub.status.idle": "2024-07-15T00:36:04.438463Z",
     "shell.execute_reply": "2024-07-15T00:36:04.437328Z",
     "shell.execute_reply.started": "2024-07-15T00:35:51.569455Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.\n",
      "Token is valid (permission: fineGrained).\n",
      "Your token has been saved to /root/.cache/huggingface/token\n",
      "Login successful\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fe6b291faf654789b80c53c84e17f1e8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "db0d84eb42a940f993dc7f40af1f2e1e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "CommitInfo(commit_url='https://huggingface.co/osabobo/distilbert-base-uncased-lora-text-classification/commit/60d2fab6a0d1c0f71458aa6a3bb06c577c8d489f', commit_message='Upload tokenizer', commit_description='', oid='60d2fab6a0d1c0f71458aa6a3bb06c577c8d489f', pr_url=None, pr_revision=None, pr_num=None)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from huggingface_hub import HfApi, login\n",
    "from transformers import AutoModel, AutoTokenizer\n",
    "import torch\n",
    "\n",
    "# Replace 'YOUR_TOKEN_HERE' with your actual token\n",
    "login(token=\"    \")\n",
    "\n",
    "# Define model and tokenizer\n",
    "model = AutoModel.from_pretrained('distilbert-base-uncased')\n",
    "tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')\n",
    "\n",
    "# Define the model ID with your username\n",
    "model_checkpoint = \"distilbert-base-uncased\"\n",
    "model_id = \"osabobo/\" + model_checkpoint + \"-lora-text-classification\"\n",
    "\n",
    "# Move model to the appropriate device\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "model.to(device)\n",
    "\n",
    "# Push the model to Hugging Face Hub\n",
    "model.push_to_hub(model_id)\n",
    "\n",
    "# Push the tokenizer to Hugging Face Hub\n",
    "tokenizer.push_to_hub(model_id)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-15T00:36:48.763372Z",
     "iopub.status.busy": "2024-07-15T00:36:48.762719Z",
     "iopub.status.idle": "2024-07-15T00:36:59.101381Z",
     "shell.execute_reply": "2024-07-15T00:36:59.100264Z",
     "shell.execute_reply.started": "2024-07-15T00:36:48.763342Z"
    },
    "id": "LLHAhdj51Pz_"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0e374e74b92d413f90f5c29a8cb8e5c9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/270M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5f345fe5bf9248619f928f254ad5ed3c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "training_args.bin:   0%|          | 0.00/5.18k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "82af5130bf9d433dabcc44dd545aaa00",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "events.out.tfevents.1721001296.819ddbe43a1c.34.0:   0%|          | 0.00/8.43k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a19a2835e09e40af94992b08eb12e162",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "CommitInfo(commit_url='https://huggingface.co/osabobo/distilbert-base-uncased-lora-text-classification/commit/dec6e947bef4f3d614979efbe354d5d041765de4', commit_message='osabobo/distilbert-base-uncased-lora-text-classification', commit_description='', oid='dec6e947bef4f3d614979efbe354d5d041765de4', pr_url=None, pr_revision=None, pr_num=None)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.push_to_hub(model_id) # save trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kpInVUMLyJ24"
   },
   "source": [
    "**Write up**:\n",
    "* Explain what finetuning strategy you used and why\n",
    "\n",
    "The fine-tuning strategy used in this context involves taking a pre-trained language model, specifically the distilbert-base-uncased model, and adapting it to a specific task or domain, which appears to be text classification based on your model checkpoint name (distilbert-base-uncased-lora-text-classification).\n",
    "\n",
    "Here’s a breakdown of the strategy and its rationale:\n",
    "\n",
    "Strategy:\n",
    "Choice of Pre-trained Model:\n",
    "\n",
    "DistilBERT is chosen as the base model. It is a smaller, faster, and cheaper version of BERT (Bidirectional Encoder Representations from Transformers), trained to retain much of BERT's performance while being more resource-efficient. This choice is suitable for many NLP tasks due to its balance between performance and computational cost.\n",
    "Task-Specific Adaptation (Fine-tuning):\n",
    "\n",
    "Fine-tuning involves training the existing distilbert-base-uncased model on a specific dataset related to text classification. During fine-tuning:\n",
    "The model's weights are adjusted based on the task-specific data (in this case, the text classification data).\n",
    "The last few layers of the model are often retrained while keeping the initial layers frozen (or training them with a very low learning rate) to retain the learned representations from the original pre-training.\n",
    "This process allows the model to learn task-specific features and patterns from the labeled data, thereby improving its performance on the specific classification task.\n",
    "Training Parameters:\n",
    "\n",
    "Learning Rate (lr): Typically set to a small value initially, then adjusted based on validation performance to balance between rapid convergence and fine-grained learning.\n",
    "Batch Size: Determined based on available memory and computational resources, optimizing for training efficiency.\n",
    "Number of Epochs: The number of times the entire dataset is passed through the model during training, optimizing for convergence without overfitting.\n",
    "Rationale:\n",
    "Transfer Learning Benefit: By starting with a pre-trained model like DistilBERT, which has been trained on a large corpus of text, the fine-tuning process leverages this rich representation of language. This approach often leads to better performance compared to training a model from scratch, especially when data for the specific task is limited.\n",
    "\n",
    "Resource Efficiency: DistilBERT is chosen for its efficiency in terms of memory and computation, making it easier to train and deploy in various environments including cloud platforms and edge devices.\n",
    "\n",
    "Domain Adaptation: Fine-tuning allows the model to adapt to the specific nuances of the text classification task at hand (presumably \"lora\" classification in your case). This ensures that the model learns relevant features and can make accurate predictions in the target domain.\n",
    "\n",
    "In summary, the chosen fine-tuning strategy with DistilBERT is aimed at leveraging pre-trained language representations to improve performance on a specific text classification task efficiently. This approach balances between leveraging existing knowledge and adapting to new data, making it suitable for a wide range of NLP applications.\n",
    "\n",
    "* Share some samples from the base model and from the final finetuned model. How do they compare?\n",
    "\n",
    "Samples from the Base Model\n",
    "The base model, typically a pre-trained transformer like BERT or GPT, starts with a generic understanding of language derived from extensive training on diverse text corpora. This base model can perform various language tasks but is not specialized in any specific domain or task. For instance, in sentiment analysis, the base model might accurately predict general sentiments but struggle with domain-specific nuances.\n",
    "\n",
    "Samples from the Fine-Tuned Model\n",
    "The fine-tuned model, in this case, is specialized for a specific task, such as text classification or sentiment analysis. By training the model on a targeted dataset (e.g., IMDB movie reviews), the model learns the specific patterns and vocabulary related to the task. The notebook outlines the process of fine-tuning using a smaller, domain-specific dataset, leading to improved performance in recognizing sentiment in movie reviews.\n",
    "\n",
    "Comparison\n",
    "Accuracy and Precision: The fine-tuned model shows improved accuracy and precision in its predictions. For example, in sentiment analysis of movie reviews, the fine-tuned model is better at distinguishing between positive and negative sentiments, even in the presence of domain-specific language and slang.\n",
    "\n",
    "Generalization: While the base model can handle a wide range of tasks, the fine-tuned model excels in its specific domain. However, this specialization might reduce its performance in other unrelated tasks.\n",
    "\n",
    "Efficiency: The fine-tuned model, optimized for a specific task, can perform more efficiently and with lower computational requirements compared to using the base model directly for the same task.\n",
    "\n",
    "In summary, the fine-tuned model demonstrates a significant improvement in performance for the specific task of text classification or sentiment analysis, showcasing the value of task-specific training on top of a pre-trained base model\n",
    "\n",
    "**Link to the model on Hugging Face Hub:**[Model](https://huggingface.co/asavochkin/distilbert-base-uncased-lora-text-classification)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kaggle": {
   "accelerator": "nvidiaTeslaT4",
   "dataSources": [],
   "dockerImageVersionId": 30747,
   "isGpuEnabled": true,
   "isInternetEnabled": true,
   "language": "python",
   "sourceType": "notebook"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}