{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "eBpjBBZc6IvA" }, "source": [ "# Fatima Fellowship Coding Challenge: Finetune a Generative AI Model\n", "\n", "Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge.\n", "\n", "**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook along with your application. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw)" ] }, { "cell_type": "markdown", "metadata": { "id": "lQNUZjvuRt-m" }, "source": [ "\n", "\n", "---\n", "\n", "\n", "### **Important**: Beore you get started, please make sure to make a **copy of this notebook** and set sharing permissions so that **anyone with the link can view**. Otherwise, we will NOT be able to assess your application.\n", "\n", "\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qpFKMaPeggyX" }, "source": [ "# 0. Description\n", "\n", "The purpose of this coding challenge is to finetune a generative AI model on a dataset that *you* build.\n", "\n", "The dataset can be of any kind! For example, you could collect a dataset of football jerserys and train a machine learning model to be able to generate jerseys different teams apart. Or, you could finetune a generation model to be able to generate accurate recipes about a particular dish specific to your cuisine.\n", "\n", "We are interested in learning more about you and your coding abilities through this short exercise." ] }, { "cell_type": "markdown", "metadata": { "id": "braBzmRpMe7_" }, "source": [ "# 1. Build a Dataset Based on Your Interests" ] }, { "cell_type": "markdown", "metadata": { "id": "1IWw-NZf5WfF" }, "source": [ "In the first step, you'll be building your OWN dataset of any kind. We expect that many students might build this dataset by scraping the web e.g. Google Images, or extracting samples from existing datasets (e.g. [from Hugging Face](https://huggingface.co/datasets)). Some suggestions:\n", "\n", "* Dataset size: although this can very, we generally recommend that the dataset should have at least 100 (training and validation) samples.\n", "* Dataset diversity: make sure your dataset is sufficiently varied. For example, if your dataset consists of celebrity images, you probably want celebrities of different ages, ethnicities, genders, etc.\n", "\n", "You may find Python libraries that download images such as `google_images_download` useful.\n", "\n", "Once you have built your dataset, please upload it to Hugging Face Hub using the `datasets` library and include the link below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-07-14T23:54:01.063480Z", "iopub.status.busy": "2024-07-14T23:54:01.063126Z", "iopub.status.idle": "2024-07-14T23:54:14.512034Z", "shell.execute_reply": "2024-07-14T23:54:14.510890Z", "shell.execute_reply.started": "2024-07-14T23:54:01.063450Z" }, "id": "K2GJaYBpw91T", "outputId": "3c4ec6d3-2a83-4a81-8243-29d1cd77612a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: datasets in "2024-07-14T23:54:45.205611Z", "iopub.status.busy": "2024-07-14T23:54:45.205289Z", "iopub.status.idle": "2024-07-14T23:54:48.418952Z", "shell.execute_reply": "2024-07-14T23:54:48.418056Z", "shell.execute_reply.started": "2024-07-14T23:54:45.205583Z" }, "id": "CLX2Z20zxpc9", "outputId": "56192048-fdd7-4291-8c3f-bc201b54a214" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6b02c7d5863a45c89f774634dfec4e02", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading readme: 0%| | 0.00/592 [00:00, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='SEQ_CLS', inference_mode=False, r=4, target_modules={'q_lin'}, lora_alpha=32, lora_dropout=0.01, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "peft_config" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-07-14T23:54:55.251220Z", "iopub.status.busy": "2024-07-14T23:54:55.250714Z", "iopub.status.idle": "2024-07-14T23:54:55.274109Z", "shell.execute_reply": "2024-07-14T23:54:55.273307Z", "shell.execute_reply.started": "2024-07-14T23:54:55.251193Z" }, "id": "lGdxTfFv0I-9", "outputId": "987f0000-697c-458c-942c-f19af4e74181" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9307\n" ] } ], "source": [ "model = get_peft_model(model, peft_config)\n", "model.print_trainable_parameters()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-07-14T23:54:55.275504Z", "iopub.status.busy": "2024-07-14T23:54:55.275234Z", "iopub.status.idle": "2024-07-14T23:54:55.279915Z", "shell.execute_reply": "2024-07-14T23:54:55.279014Z", "shell.execute_reply.started": "2024-07-14T23:54:55.275481Z" }, "id": "XrMxqk_90S6Q" }, "outputs": [], "source": [ "# hyperparameters\n", "lr = 1e-3\n", "batch_size = 4\n", "num_epochs = 10" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-07-14T23:54:55.281084Z", "iopub.status.busy": "2024-07-14T23:54:55.280845Z", "iopub.status.idle": "2024-07-14T23:54:55.350814Z", "shell.execute_reply": "2024-07-14T23:54:55.349830Z", "shell.execute_reply.started": "2024-07-14T23:54:55.281063Z" }, "id": "kYJDZJt50S9Q", "outputId": "c7e9d13f-d31d-4724-9638-7d7351bd3d7b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead Use `eval_strategy` instead\n", " warnings.warn(\n" ] } ], "source": [ "# define training arguments\n", "training_args = TrainingArguments(\n", " output_dir= model_checkpoint + \"-lora-text-classification\",\n", " learning_rate=lr,\n", " per_device_train_batch_size=batch_size,\n", " per_device_eval_batch_size=batch_size,\n", " num_train_epochs=num_epochs,\n", " weight_decay=0.01,\n", " evaluation_strategy=\"epoch\",\n", " save_strategy=\"epoch\",\n", " load_best_model_at_end=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "execution": { "iopub.execute_input": "2024-07-14T23:54:55.352902Z", "iopub.status.busy": "2024-07-14T23:54:55.352113Z", "iopub.status.idle": "2024-07-15T00:01:58.470190Z", "shell.execute_reply": "2024-07-15T00:01:58.469378Z", "shell.execute_reply.started": "2024-07-14T23:54:55.352863Z" }, "id": "0V25wLEN0fTk", "outputId": "830beb8b-cf8d-4b0e-f759-ffd32e3836be" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. Tracking run with wandb version 0.17.4 (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: You can find your API key in your browser here: https://wandb.ai/authorize\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:" ] }, { "name": "stdin", "output_type": "stream", "text": [ " ········································\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\n" ] }, { "data": { "text/html": [ "Tracking run with wandb version 0.17.4" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Run data is saved locally in /kaggle/working/wandb/run-20240714_235539-ypukib7l" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Syncing run distilbert-base-uncased-lora-text-classification to Weights & Biases (docs)
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " View project at https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " View run at https://wandb.ai/johanneseboigbe55-octave-analytics/huggingface/runs/ypukib7l" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " [1250/1250 06:01, Epoch 10/10]\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EpochTraining LossValidation LossAccuracy
1No log0.259819{'accuracy': 0.896}
2No log0.357995{'accuracy': 0.888}
3No log0.403524{'accuracy': 0.885}
40.2622000.513255{'accuracy': 0.881}
50.2622000.614568{'accuracy': 0.886}
60.2622000.757634{'accuracy': 0.885}
70.2622000.749906{'accuracy': 0.885}
80.0450000.808175{'accuracy': 0.891}
90.0450000.804488{'accuracy': 0.89}
100.0450000.800608{'accuracy': 0.893}

Trainer is attempting to log a value of "{'accuracy': 0.896}" of type for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.885}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.881}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.886}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.885}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.885}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.891}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.89}\" of type for key \"eval/accuracy\" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n", " warnings.warn('Was asked to gather along dimension 0, but all '\n", "Trainer is attempting to log a value of \"{'accuracy': 0.893}\" of type for key \"eval/accuracy\" as a scalar. "2024-07-15T00:06:34.142491Z", "shell.execute_reply": "2024-07-15T00:06:34.140853Z", "shell.execute_reply.started": "2024-07-15T00:06:34.132982Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MPS backend is not available\n" ] } ], "source": [ "import torch\n", "\n", "if torch.backends.mps.is_available():\n", " print(\"MPS backend is available\")\n", "else:\n", " print(\"MPS backend is not available\")\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-07-15T00:09:13.790787Z", "iopub.status.busy": "2024-07-15T00:09:13.790051Z", "iopub.status.idle": "2024-07-15T00:09:13.797350Z", "shell.execute_reply": "2024-07-15T00:09:13.796206Z", "shell.execute_reply.started": "2024-07-15T00:09:13.790754Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using device: cuda\n" ] } ], "source": [ "import torch\n", "\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "print(f\"Using device: {device}\")\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-07-15T00:09:37.751116Z", "iopub.status.busy": "2024-07-15T00:09:37.750202Z", "iopub.status.idle": "2024-07-15T00:09:37.880663Z", "shell.execute_reply": "2024-07-15T00:09:37.879602Z", "shell.execute_reply.started": "2024-07-15T00:09:37.751070Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trained model predictions:\n", "--------------------------\n", "Text: It was good.\n", "Prediction: tensor([1], device='cuda:0')\n", "\n", "Text: Not a fan, don't recommed.\n", "Prediction: tensor([0], device='cuda:0')\n", "\n", "Text: Better than the first one.\n", "Prediction: tensor([1], device='cuda:0')\n", "\n", "Text: This is not worth watching even once.\n", "Prediction: tensor([1], device='cuda:0')\n", "\n", "Text: This one is a pass.\n", "Prediction: tensor([0], device='cuda:0')\n", "\n" ] } ], "source": [ "model.to(device) # move the model to the appropriate device\n", "\n", "print(\"Trained model predictions:\")\n", "print(\"--------------------------\")\n", "\n", "for text in text_list:\n", " inputs = tokenizer.encode(text, return_tensors=\"pt\").to(device) # move the inputs to the appropriate device\n", " logits = model(inputs).logits\n", " predictions = torch.max(logits, 1).indices\n", " print(f\"Text: {text}\\nPrediction: {predictions}\\n\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Y0iH6E6-0z-r" }, "source": [ "### Optional: push model to hub" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2024-07-15T00:10:50.355689Z", "iopub.status.busy": "2024-07-15T00:10:50.354890Z", "iopub.status.idle": "2024-07-15T00:10:50.386851Z", "shell.execute_reply": "2024-07-15T00:10:50.385997Z", "shell.execute_reply.started": "2024-07-15T00:10:50.355623Z" }, "id": "2I3YvXHo06c6" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fca9611543134ed49ba804e1472e7b45", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='