{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "PGnlRWvkY-2c" }, "source": [ "# Sentiment Analysis with BERT\n", "\n", "> TL;DR In this tutorial, you'll learn how to fine-tune BERT for sentiment analysis. You'll do the required text preprocessing (special tokens, padding, and attention masks) and build a Sentiment Classifier using the amazing Transformers library by Hugging Face!\n", "\n", "- [Read the tutorial](https://www.curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/)\n", "- [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/1PHv-IRLPCtv7oTcIGbsgZHqrB5LPvB7S)\n", "- [Read the `Getting Things Done with Pytorch` book](https://github.com/curiousily/Getting-Things-Done-with-Pytorch)\n", "\n", "You'll learn how to:\n", "\n", "- Intuitively understand what BERT is\n", "- Preprocess text data for BERT and build PyTorch Dataset (tokenization, attention masks, and padding)\n", "- Use Transfer Learning to build Sentiment Classifier using the Transformers library by Hugging Face\n", "- Evaluate the model on test data\n", "- Predict sentiment on raw text\n", "\n", "Let's get started!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "id": "fH8xHMfdX974", "outputId": "a417ae99-a1de-4683-f1bf-b86bbda8ca4e" }, "outputs": [ { "data": { "image/jpeg": "", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#@title Watch the video tutorial\n", "\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('8N-nM3QW7O0', width=720, height=420)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NJ6MhJYYBCwu", "outputId": "f07a7d16-bec0-4cdc-bb9a-630c3be37075" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wed Jun 28 17:10:17 2023 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |\n", "| N/A 70C P8 11W / 70W | 0MiB / 15360MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", "| No running processes found |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "!nvidia-smi" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Tbodro8Fpmwr" }, "source": [ "## What is BERT?\n", "\n", "BERT (introduced in [this paper](https://arxiv.org/abs/1810.04805)) stands for Bidirectional Encoder Representations from Transformers. If you don't know what most of that means - you've come to the right place! Let's unpack the main ideas:\n", "\n", "- Bidirectional - to understand the text you're looking you'll have to look back (at the previous words) and forward (at the next words)\n", "- Transformers - The [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper presented the Transformer model. The Transformer reads entire sequences of tokens at once. In a sense, the model is non-directional, while LSTMs read sequentially (left-to-right or right-to-left). The attention mechanism allows for learning contextual relations between words (e.g. `his` in a sentence refers to Jim).\n", "- (Pre-trained) contextualized word embeddings - [The ELMO paper](https://arxiv.org/abs/1802.05365v2) introduced a way to encode words based on their meaning/context. Nails has multiple meanings - fingernails and metal nails.\n", "\n", "BERT was trained by masking 15% of the tokens with the goal to guess them. An additional objective was to predict the next sentence. Let's look at examples of these tasks:\n", "\n", "### Masked Language Modeling (Masked LM)\n", "\n", "The objective of this task is to guess the masked tokens. Let's look at an example, and try to not make it harder than it has to be:\n", "\n", "That's `[mask]` she `[mask]` -> That's what she said\n", "\n", "### Next Sentence Prediction (NSP)\n", "\n", "Given a pair of two sentences, the task is to say whether or not the second follows the first (binary classification). Let's continue with the example:\n", "\n", "*Input* = `[CLS]` That's `[mask]` she `[mask]`. [SEP] Hahaha, nice! [SEP]\n", "\n", "*Label* = *IsNext*\n", "\n", "*Input* = `[CLS]` That's `[mask]` she `[mask]`. [SEP] Dwight, you ignorant `[mask]`! [SEP]\n", "\n", "*Label* = *NotNext*\n", "\n", "The training corpus was comprised of two entries: [Toronto Book Corpus](https://arxiv.org/abs/1506.06724) (800M words) and English Wikipedia (2,500M words). While the original Transformer has an encoder (for reading the input) and a decoder (that makes the prediction), BERT uses only the decoder.\n", "\n", "BERT is simply a pre-trained stack of Transformer Encoders. How many Encoders? We have two versions - with 12 (BERT base) and 24 (BERT Large).\n", "\n", "### Is This Thing Useful in Practice?\n", "\n", "The BERT paper was released along with [the source code](https://github.com/google-research/bert) and pre-trained models.\n", "\n", "The best part is that you can do Transfer Learning (thanks to the ideas from OpenAI Transformer) with BERT for many NLP tasks - Classification, Question Answering, Entity Recognition, etc. You can train with small amounts of data and achieve great performance!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "wmj22-TcZMef" }, "source": [ "## Setup\n", "\n", "We'll need [the Transformers library](https://huggingface.co/transformers/) by Hugging Face:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Kj_7Tz0-pK69", "outputId": "5b72004f-c34c-4fc2-a5fe-51b52efc990b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/55.5 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.6 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m61.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "!pip install -q -U watermark" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Jjsbi1u3QFEM", "outputId": "17c7203e-853a-4b34-ad04-280ea83da05b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.2/7.2 MB\u001b[0m \u001b[31m43.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m236.8/236.8 kB\u001b[0m \u001b[31m24.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m76.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m81.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "!pip install -qq transformers" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AJqoaFpVpoM8", "outputId": "22e292c3-8154-483b-ce04-b16e2f52027f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.10.12\n", "IPython version : 7.34.0\n", "\n", "numpy : 1.22.4\n", "pandas : 1.5.3\n", "torch : 2.0.1+cu118\n", "transformers: 4.30.2\n", "\n" ] } ], "source": [ "%reload_ext watermark\n", "%watermark -v -p numpy,pandas,torch,transformers" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/" }, "id": "w68CZpOwFoly", "outputId": "6d9115dd-96e7-4244-b6a9-f7238b1da4c3" }, "outputs": [ { "data": { "text/plain": [ "device(type='cuda', index=0)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#@title Setup & Config\n", "import transformers\n", "from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup\n", "import torch\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from pylab import rcParams\n", "import matplotlib.pyplot as plt\n", "from matplotlib import rc\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, classification_report\n", "from collections import defaultdict\n", "from textwrap import wrap\n", "\n", "from torch import nn, optim\n", "from torch.utils.data import Dataset, DataLoader\n", "import torch.nn.functional as F\n", "\n", "%matplotlib inline\n", "%config InlineBackend.figure_format='retina'\n", "\n", "sns.set(style='whitegrid', palette='muted', font_scale=1.2)\n", "\n", "HAPPY_COLORS_PALETTE = [\"#01BEFE\", \"#FFDD00\", \"#FF7D00\", \"#FF006D\", \"#ADFF02\", \"#8F00FF\"]\n", "\n", "sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))\n", "\n", "rcParams['figure.figsize'] = 12, 8\n", "\n", "RANDOM_SEED = 42\n", "np.random.seed(RANDOM_SEED)\n", "torch.manual_seed(RANDOM_SEED)\n", "\n", "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "device" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ufzPdoTtNikq" }, "source": [ "## Data Exploration\n", "\n", "We'll load the Google Play app reviews dataset, that we've put together in the previous part:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SgPRhuMzi9ot", "outputId": "68b8fe7b-942a-48d7-f45e-67f108f15872" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/gdown/cli.py:121: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.\n", " warnings.warn(\n", "Downloading...\n", "From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV\n", "To: /content/apps.csv\n", "100% 134k/134k [00:00<00:00, 91.6MB/s]\n", "/usr/local/lib/python3.10/dist-packages/gdown/cli.py:121: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.\n", " warnings.warn(\n", "Downloading...\n", "From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv\n", "To: /content/reviews.csv\n", "100% 7.17M/7.17M [00:00<00:00, 138MB/s]\n" ] } ], "source": [ "!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV\n", "!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "mUKLyKc7I6Qp", "outputId": "9810e8b3-7979-484c-eff1-e67f70460a3b" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userNameuserImagecontentscorethumbsUpCountreviewCreatedVersionatreplyContentrepliedAtsortOrderappId
0Andrew Thomashttps://lh3.googleusercontent.com/a-/AOh14GiHd...Update: After getting a response from the deve...1214.17.0.32020-04-05 22:25:57According to our TOS, and the term you have ag...2020-04-05 15:10:24most_relevantcom.anydo
1Craig Haineshttps://lh3.googleusercontent.com/-hoe0kwSJgPQ...Used it for a fair amount of time without any ...1114.17.0.32020-04-04 13:40:01It sounds like you logged in with a different ...2020-04-05 15:11:35most_relevantcom.anydo
2steven adkinshttps://lh3.googleusercontent.com/a-/AOh14GiXw...Your app sucks now!!!!! Used to be good but no...1174.17.0.32020-04-01 16:18:13This sounds odd! We are not aware of any issue...2020-04-02 16:05:56most_relevantcom.anydo
3Lars Panzerbjørnhttps://lh3.googleusercontent.com/a-/AOh14Gg-h...It seems OK, but very basic. Recurring tasks n...11924.17.0.22020-03-12 08:17:34We do offer this option as part of the Advance...2020-03-15 06:20:13most_relevantcom.anydo
4Scott Prewitthttps://lh3.googleusercontent.com/-K-X1-YsVd6U...Absolutely worthless. This app runs a prohibit...1424.17.0.22020-03-14 17:41:01We're sorry you feel this way! 90% of the app ...2020-03-15 23:45:51most_relevantcom.anydo
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " userName userImage \\\n", "0 Andrew Thomas https://lh3.googleusercontent.com/a-/AOh14GiHd... \n", "1 Craig Haines https://lh3.googleusercontent.com/-hoe0kwSJgPQ... \n", "2 steven adkins https://lh3.googleusercontent.com/a-/AOh14GiXw... \n", "3 Lars Panzerbjørn https://lh3.googleusercontent.com/a-/AOh14Gg-h... \n", "4 Scott Prewitt https://lh3.googleusercontent.com/-K-X1-YsVd6U... \n", "\n", " content score thumbsUpCount \\\n", "0 Update: After getting a response from the deve... 1 21 \n", "1 Used it for a fair amount of time without any ... 1 11 \n", "2 Your app sucks now!!!!! Used to be good but no... 1 17 \n", "3 It seems OK, but very basic. Recurring tasks n... 1 192 \n", "4 Absolutely worthless. This app runs a prohibit... 1 42 \n", "\n", " reviewCreatedVersion at \\\n", "0 4.17.0.3 2020-04-05 22:25:57 \n", "1 4.17.0.3 2020-04-04 13:40:01 \n", "2 4.17.0.3 2020-04-01 16:18:13 \n", "3 4.17.0.2 2020-03-12 08:17:34 \n", "4 4.17.0.2 2020-03-14 17:41:01 \n", "\n", " replyContent repliedAt \\\n", "0 According to our TOS, and the term you have ag... 2020-04-05 15:10:24 \n", "1 It sounds like you logged in with a different ... 2020-04-05 15:11:35 \n", "2 This sounds odd! We are not aware of any issue... 2020-04-02 16:05:56 \n", "3 We do offer this option as part of the Advance... 2020-03-15 06:20:13 \n", "4 We're sorry you feel this way! 90% of the app ... 2020-03-15 23:45:51 \n", "\n", " sortOrder appId \n", "0 most_relevant com.anydo \n", "1 most_relevant com.anydo \n", "2 most_relevant com.anydo \n", "3 most_relevant com.anydo \n", "4 most_relevant com.anydo " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"reviews.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dB2jE6am7Dpo", "outputId": "88d5bdf9-a95d-4f2f-b712-7385a3532aae" }, "outputs": [ { "data": { "text/plain": [ "(15746, 11)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "TWqVNHJbn10l" }, "source": [ "We have about 16k examples. Let's check for missing values:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VA_wGSLQLKCh", "outputId": "244a9422-e781-4d45-d9f3-c0bb4c5ce1b9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 15746 entries, 0 to 15745\n", "Data columns (total 11 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 userName 15746 non-null object\n", " 1 userImage 15746 non-null object\n", " 2 content 15746 non-null object\n", " 3 score 15746 non-null int64 \n", " 4 thumbsUpCount 15746 non-null int64 \n", " 5 reviewCreatedVersion 13533 non-null object\n", " 6 at 15746 non-null object\n", " 7 replyContent 7367 non-null object\n", " 8 repliedAt 7367 non-null object\n", " 9 sortOrder 15746 non-null object\n", " 10 appId 15746 non-null object\n", "dtypes: int64(2), object(9)\n", "memory usage: 1.3+ MB\n" ] } ], "source": [ "df.info()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "H3cL_1qVn_6h" }, "source": [ "Great, no missing values in the score and review texts! Do we have class imbalance?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QzqfEbgjo5p8", "outputId": "0a0c490f-f216-44a6-a6a6-3f5685b995c3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", " ..\n", "15741 5\n", "15742 5\n", "15743 5\n", "15744 5\n", "15745 5\n", "Name: score, Length: 15746, dtype: int64\n" ] } ], "source": [ "print(df.score)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 710 }, "id": "Wwh_rW4Efhs3", "outputId": "099f692b-9151-42ac-c30d-a5fba9f7abf4" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 693, "width": 1035 } }, "output_type": "display_data" } ], "source": [ "sns.countplot(x='score', data = df)\n", "plt.xlabel('review score');" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "nZM0GKviobjM" }, "source": [ "That's hugely imbalanced, but it's okay. We're going to convert the dataset into negative, neutral and positive sentiment:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "ei0xmdi1Chp0" }, "outputs": [], "source": [ "def to_sentiment(rating):\n", " rating = int(rating)\n", " if rating <= 2:\n", " return 0\n", " elif rating == 3:\n", " return 1\n", " else:\n", " return 2\n", "\n", "df['sentiment'] = df.score.apply(to_sentiment)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "V-155O-SFSqE" }, "outputs": [], "source": [ "class_names = ['negative', 'neutral', 'positive']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KM69l8w6oM-s", "outputId": "49cc4503-ca75-4bcc-eb14-c214319265ff" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "15741 2\n", "15742 2\n", "15743 2\n", "15744 2\n", "15745 2\n", "Name: sentiment, Length: 15746, dtype: int64\n" ] } ], "source": [ "print(df.sentiment)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 710 }, "id": "y3tY3ECJDPaz", "outputId": "2a4003f3-c32e-43ac-9ea1-17af9fb4b707" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 693, "width": 1035 } }, "output_type": "display_data" } ], "source": [ "ax = sns.countplot(x='sentiment', data = df)\n", "plt.xlabel('review sentiment')\n", "ax.set_xticklabels(class_names);" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "tOssB4CKnAX2" }, "source": [ "The balance was (mostly) restored." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "9aHyGuTFgyPO" }, "source": [ "## Data Preprocessing\n", "\n", "You might already know that Machine Learning models don't work with raw text. You need to convert text to numbers (of some sort). BERT requires even more attention (good one, right?). Here are the requirements:\n", "\n", "- Add special tokens to separate sentences and do classification\n", "- Pass sequences of constant length (introduce padding)\n", "- Create array of 0s (pad token) and 1s (real token) called *attention mask*\n", "\n", "The Transformers library provides (you've guessed it) a wide variety of Transformer models (including BERT). It works with TensorFlow and PyTorch! It also includes prebuild tokenizers that do the heavy lifting for us!\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "E7Mj-0ne--5t" }, "outputs": [], "source": [ "PRE_TRAINED_MODEL_NAME = 'bert-base-cased'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "fMSr7C-F_sey" }, "source": [ "> You can use a cased and uncased version of BERT and tokenizer. I've experimented with both. The cased version works better. Intuitively, that makes sense, since \"BAD\" might convey more sentiment than \"bad\"." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "NiLb-ltM-ZRz" }, "source": [ "Let's load a pre-trained [BertTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 113, "referenced_widgets": [ "5968e9e39f1b466dbacf6cece45d7ee2", "5b828d7f8bd747619cddc4b28d1d5383", "c8135ded16e24ec1bb3ab6b26aa1e01d", "cdbbeea71cf1478a93df34da79abb1e1", "acd7b4228d984976b0aa16eda3a74e2a", "a5aaec5b93e04200bf59024dd6497283", "300407d9c525447cbc2223a30c17deac", "f59117f9fb414783b339226908a3d32f", "6f54d4cafe384c67a459065119006b4e", "fabf9140688d49d99dd799a2b364ec27", "3dac470fee3c4f9eb8bebb1992c19688", "c5c2e50239a44566b376e233b36e1220", "841966fa79134f75a1fed6db97f34dc9", "5d54820c46e44b54aaeb002e1054d495", "30cc769f98f9447eae28e7060c943d6f", "a7733a2a0059411db9f7edb5e4dd1a92", "62b3f7375b0243239eed24bd9b342939", "c23dcfab453442b4b41a90c2f096302d", "34bbd61a5c844010bbd47539fcf2abc6", "b389a069f9aa433b97f81043b5dc573a", "ce8cb11f151b45a69431a11d0789bb93", "6652821599284ab3bcb86f8dfee60da5", "7fb2d01f05914c7b97e58dc877c67c62", "3ed9609b1b804e079e548c6e87410911", "7022e627b3934baababd16449a13ab16", "b376b41872cd490e8bcd7d64db66941a", "0c85e11e27d6484ea31b766aecf0a06e", "5f9e170e18d8413ea9811e3cd501fcb7", "c65deba3b9c1414eb489317f2fb18b12", "b7336a3c03804a1889bd4539e8574100", "44b33027973c41d69730f66aeef675ef", "9ff5213e4251453b92de1eeeb9c7acb6", "0a3631a4ad9c49909c63b9cabdd926d8" ] }, "id": "H3AfJSZ8NNLF", "outputId": "c9d366bd-dc31-452e-bcb4-c9cff5c3898e" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5968e9e39f1b466dbacf6cece45d7ee2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (…)solve/main/vocab.txt: 0%| | 0.00/213k [00:00:1: UserWarning: \n", "\n", "`distplot` is a deprecated function and will be removed in seaborn v0.14.0.\n", "\n", "Please adapt your code to use either `displot` (a figure-level function with\n", "similar flexibility) or `histplot` (an axes-level function for histograms).\n", "\n", "For a guide to updating your code to use the new functions, please see\n", "https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751\n", "\n", " sns.distplot(token_lens)\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 693, "width": 1052 } }, "output_type": "display_data" } ], "source": [ "sns.distplot(token_lens)\n", "plt.xlim([0, 256]);\n", "plt.xlabel('Token count');" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "oW6ajl30t6du" }, "source": [ "Most of the reviews seem to contain less than 128 tokens, but we'll be on the safe side and choose a maximum length of 160." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "id": "t7xSmJtLuoxW" }, "outputs": [], "source": [ "MAX_LEN = 160" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "XvvcoU6nurHy" }, "source": [ "We have all building blocks required to create a PyTorch dataset. Let's do it:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "id": "E2BPgRJ7YBK0" }, "outputs": [], "source": [ "class GPReviewDataset(Dataset):\n", "\n", " def __init__(self, reviews, targets, tokenizer, max_len):\n", " self.reviews = reviews\n", " self.targets = targets\n", " self.tokenizer = tokenizer\n", " self.max_len = max_len\n", "\n", " def __len__(self):\n", " return len(self.reviews)\n", "\n", " def __getitem__(self, item):\n", " review = str(self.reviews[item])\n", " target = self.targets[item]\n", "\n", " encoding = self.tokenizer.encode_plus(\n", " review,\n", " add_special_tokens=True,\n", " max_length=self.max_len,\n", " return_token_type_ids=False,\n", " pad_to_max_length=True,\n", " return_attention_mask=True,\n", " return_tensors='pt',\n", " )\n", "\n", " return {\n", " 'review_text': review,\n", " 'input_ids': encoding['input_ids'].flatten(),\n", " 'attention_mask': encoding['attention_mask'].flatten(),\n", " 'targets': torch.tensor(target, dtype=torch.long)\n", " }" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "x2uwsvCYqDJK" }, "source": [ "The tokenizer is doing most of the heavy lifting for us. We also return the review texts, so it'll be easier to evaluate the predictions from our model. Let's split the data:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "id": "B-vWzoo81dvO" }, "outputs": [], "source": [ "df_train, df_test = train_test_split(df, test_size=0.1, random_state=RANDOM_SEED)\n", "df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=RANDOM_SEED)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xz3ZOQXVPCwh", "outputId": "fd7343ac-5f64-4fab-8694-4684bfdd1ff5" }, "outputs": [ { "data": { "text/plain": [ "((14171, 12), (787, 12), (788, 12))" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.shape, df_val.shape, df_test.shape" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "J4tQ1x-vqNab" }, "source": [ "We also need to create a couple of data loaders. Here's a helper function to do it:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "id": "KEGqcvkuOuTX" }, "outputs": [], "source": [ "def create_data_loader(df, tokenizer, max_len, batch_size):\n", " ds = GPReviewDataset(\n", " reviews=df.content.to_numpy(),\n", " targets=df.sentiment.to_numpy(),\n", " tokenizer=tokenizer,\n", " max_len=max_len\n", " )\n", "\n", " return DataLoader(\n", " ds,\n", " batch_size=batch_size,\n", " num_workers=4\n", " )" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vODDxMKsPHqI", "outputId": "b177c19e-503c-4af3-e707-afa365afe0a0" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.\n", " warnings.warn(_create_warning_msg(\n" ] } ], "source": [ "BATCH_SIZE = 16\n", "\n", "train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)\n", "val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)\n", "test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "A6dlOptwqlhF" }, "source": [ "Let's have a look at an example batch from our training data loader:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Y93ldSN47FeT", "outputId": "d21fc9dd-c4b7-4ed3-c312-82275db23d1c" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "dict_keys(['review_text', 'input_ids', 'attention_mask', 'targets'])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = next(iter(train_data_loader))\n", "data.keys()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IdU4YVqb7N8M", "outputId": "d8b1d7db-e725-4f87-9702-42c59f8e9c0f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([16, 160])\n", "torch.Size([16, 160])\n", "torch.Size([16])\n" ] } ], "source": [ "print(data['input_ids'].shape)\n", "print(data['attention_mask'].shape)\n", "print(data['targets'].shape)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "H63Y-TjyRC7S" }, "source": [ "## Sentiment Classification with BERT and Hugging Face" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 121, "referenced_widgets": [ "8432242e44044eb48b132453b519947e", "924c0e3b25054a68b37090823bd07a96", "fa802e4f8a3c4d20b0fabeaeaab7ee18", "d48ccf851db0492d805f2ac9342839b4", "7532b67ceaf74f6c8263ae7185a7b04d", "fdd1cbca5dd74a7e9c91d15cb6abd7ce", "b3940458f5134c4a8a16432113ef2561", "648c9e36f406464b8a5be4d19a69a4e7", "a364d729ab9342dfa53bb39e8f068688", "e8592109366842438c9295552aac501a", "90ececf282a94f6ca593d361ad6d9212" ] }, "id": "0P41FayISNRI", "outputId": "f189569f-338e-4a1f-dbfc-fd14e876918a" }, "outputs": [], "source": [ "from transformers import AutoModelForSequenceClassification\n", "\n", "model = AutoModelForSequenceClassification.from_pretrained(\n", " 'bert-base-cased',\n", " num_labels=len(class_names),\n", " output_attentions=False,\n", " output_hidden_states=False\n", ")\n", "\n", "last_hidden_state = model.bert(\n", " input_ids=encoding['input_ids'],\n", " attention_mask=encoding['attention_mask']\n", ").last_hidden_state\n", "\n", "last_hidden_state.shape\n", "\n", "pooled_output = model.bert(\n", " input_ids=encoding['input_ids'],\n", " attention_mask=encoding['attention_mask']\n", ").pooler_output\n", "\n", "pooled_output.shape" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "0o_NiS3WgOFf" }, "source": [ "We can use all of this knowledge to create a classifier that uses the BERT model:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "id": "m_mRflxPl32F" }, "outputs": [], "source": [ "class SentimentClassifier(nn.Module):\n", "\n", " def __init__(self, n_classes):\n", " super(SentimentClassifier, self).__init__()\n", " self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)\n", " self.drop = nn.Dropout(p=0.3)\n", " self.out = nn.Linear(self.bert.config.hidden_size, n_classes)\n", "\n", " def forward(self, input_ids, attention_mask):\n", " returned = self.bert(\n", " input_ids=input_ids,\n", " attention_mask=attention_mask\n", " )\n", " pooled_output = returned[\"pooler_output\"]\n", " output = self.drop(pooled_output)\n", " return self.out(output)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "UJg8m3NQJahc" }, "source": [ "Our classifier delegates most of the heavy lifting to the BertModel. We use a dropout layer for some regularization and a fully-connected layer for our output. Note that we're returning the raw output of the last layer since that is required for the cross-entropy loss function in PyTorch to work.\n", "\n", "This should work like any other PyTorch model. Let's create an instance and move it to the GPU:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "i0yQnuSFsjDp", "outputId": "d404ee38-b875-4e30-b85a-30ea0fef33be" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']\n", "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] } ], "source": [ "model = SentimentClassifier(len(class_names))\n", "model = model.to(device)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "VCPCFDLlKIQd" }, "source": [ "We'll move the example batch of our training data to the GPU:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mz7p__CqdaMO", "outputId": "bedb8633-6ab1-47f1-ec51-eb2b76d2ad5e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([16, 160])\n", "torch.Size([16, 160])\n" ] } ], "source": [ "input_ids = data['input_ids'].to(device)\n", "attention_mask = data['attention_mask'].to(device)\n", "\n", "print(input_ids.shape) # batch size x seq length\n", "print(attention_mask.shape) # batch size x seq length" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Hr1EgkEtKOIB" }, "source": [ "To get the predicted probabilities from our trained model, we'll apply the softmax function to the outputs:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2rTCj46Zamry", "outputId": "dc98bf73-9918-46c0-dd2a-9fe6a5e3b996" }, "outputs": [ { "data": { "text/plain": [ "tensor([[0.2332, 0.4717, 0.2951],\n", " [0.2219, 0.3612, 0.4168],\n", " [0.3589, 0.2427, 0.3984],\n", " [0.2221, 0.3218, 0.4561],\n", " [0.5762, 0.2056, 0.2181],\n", " [0.2249, 0.4422, 0.3329],\n", " [0.2750, 0.3181, 0.4069],\n", " [0.3732, 0.2454, 0.3813],\n", " [0.3927, 0.1892, 0.4181],\n", " [0.3349, 0.1948, 0.4703],\n", " [0.3043, 0.2263, 0.4695],\n", " [0.3710, 0.1894, 0.4396],\n", " [0.2090, 0.4449, 0.3462],\n", " [0.3396, 0.2594, 0.4010],\n", " [0.3456, 0.2260, 0.4284],\n", " [0.1500, 0.3302, 0.5198]], device='cuda:0', grad_fn=)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "F.softmax(model(input_ids, attention_mask), dim=1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "g9xikRdtRN1N" }, "source": [ "### Training" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "76g7FV85H-T8" }, "source": [ "To reproduce the training procedure from the BERT paper, we'll use the [AdamW](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#adamw) optimizer provided by Hugging Face. It corrects weight decay, so it's similar to the original paper. We'll also use a linear scheduler with no warmup steps:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5v-ArJ2fCCcU", "outputId": "5ac88e1a-670a-4d50-d21c-0d0874d1b649" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n", " warnings.warn(\n" ] } ], "source": [ "EPOCHS = 6\n", "\n", "optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)\n", "total_steps = len(train_data_loader) * EPOCHS\n", "\n", "scheduler = get_linear_schedule_with_warmup(\n", " optimizer,\n", " num_warmup_steps=0,\n", " num_training_steps=total_steps\n", ")\n", "\n", "loss_fn = nn.CrossEntropyLoss().to(device)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "A8522g7JIu5J" }, "source": [ "How do we come up with all hyperparameters? The BERT authors have some recommendations for fine-tuning:\n", "\n", "- Batch size: 16, 32\n", "- Learning rate (Adam): 5e-5, 3e-5, 2e-5\n", "- Number of epochs: 2, 3, 4\n", "\n", "We're going to ignore the number of epochs recommendation but stick with the rest. Note that increasing the batch size reduces the training time significantly, but gives you lower accuracy.\n", "\n", "Let's continue with writing a helper function for training our model for one epoch:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "id": "bzl9UhuNx1_Q" }, "outputs": [], "source": [ "def train_epoch(\n", " model,\n", " data_loader,\n", " loss_fn,\n", " optimizer,\n", " device,\n", " scheduler,\n", " n_examples\n", "):\n", " model = model.train()\n", "\n", " losses = []\n", " correct_predictions = 0\n", "\n", " for d in data_loader:\n", " input_ids = d[\"input_ids\"].to(device)\n", " attention_mask = d[\"attention_mask\"].to(device)\n", " targets = d[\"targets\"].to(device)\n", "\n", " outputs = model(\n", " input_ids=input_ids,\n", " attention_mask=attention_mask\n", " )\n", "\n", " _, preds = torch.max(outputs, dim=1)\n", " loss = loss_fn(outputs, targets)\n", "\n", " correct_predictions += torch.sum(preds == targets)\n", " losses.append(loss.item())\n", "\n", " loss.backward()\n", " nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n", " optimizer.step()\n", " scheduler.step()\n", " optimizer.zero_grad()\n", "\n", " return correct_predictions.double() / n_examples, np.mean(losses)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "E4PniYIte0fr" }, "source": [ "Training the model should look familiar, except for two things. The scheduler gets called every time a batch is fed to the model. We're avoiding exploding gradients by clipping the gradients of the model using [clip_grad_norm_](https://pytorch.org/docs/stable/nn.html#clip-grad-norm).\n", "\n", "Let's write another one that helps us evaluate the model on a given data loader:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "CXeRorVGIKre" }, "outputs": [], "source": [ "def eval_model(model, data_loader, loss_fn, device, n_examples):\n", " model = model.eval()\n", "\n", " losses = []\n", " correct_predictions = 0\n", "\n", " with torch.no_grad():\n", " for d in data_loader:\n", " input_ids = d[\"input_ids\"].to(device)\n", " attention_mask = d[\"attention_mask\"].to(device)\n", " targets = d[\"targets\"].to(device)\n", "\n", " outputs = model(\n", " input_ids=input_ids,\n", " attention_mask=attention_mask\n", " )\n", " _, preds = torch.max(outputs, dim=1)\n", "\n", " loss = loss_fn(outputs, targets)\n", "\n", " correct_predictions += torch.sum(preds == targets)\n", " losses.append(loss.item())\n", "\n", " return correct_predictions.double() / n_examples, np.mean(losses)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "a_rdSDBHhhCh" }, "source": [ "Using those two, we can write our training loop. We'll also store the training history:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1zhHoFNsxufs", "outputId": "fa239d8b-385a-40d1-cd45-d05a6494a388" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/6\n", "----------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train loss 0.7357264192392272 accuracy 0.6690424105567709\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Val loss 0.559217081964016 accuracy 0.7789072426937739\n", "\n", "Epoch 2/6\n", "----------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train loss 0.42908171780878346 accuracy 0.8358619716322067\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Val loss 0.5046901363134384 accuracy 0.8373570520965693\n", "\n", "Epoch 3/6\n", "----------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train loss 0.2530873617710247 accuracy 0.9162373862112766\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Val loss 0.6362524032965302 accuracy 0.8398983481575604\n", "\n", "Epoch 4/6\n", "----------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train loss 0.16919020543878258 accuracy 0.9513090113612307\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Val loss 0.6386195559287444 accuracy 0.8640406607369759\n", "\n", "Epoch 5/6\n", "----------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train loss 0.11795092619438897 accuracy 0.9676099075576883\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Val loss 0.7068283748999238 accuracy 0.866581956797967\n", "\n", "Epoch 6/6\n", "----------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train loss 0.08801285981362382 accuracy 0.9757956389810176\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Val loss 0.7396833367331419 accuracy 0.866581956797967\n", "\n", "CPU times: user 37min 33s, sys: 22.2 s, total: 37min 55s\n", "Wall time: 38min 45s\n" ] } ], "source": [ "%%time\n", "\n", "history = defaultdict(list)\n", "best_accuracy = 0\n", "\n", "for epoch in range(EPOCHS):\n", "\n", " print(f'Epoch {epoch + 1}/{EPOCHS}')\n", " print('-' * 10)\n", "\n", " train_acc, train_loss = train_epoch(\n", " model,\n", " train_data_loader,\n", " loss_fn,\n", " optimizer,\n", " device,\n", " scheduler,\n", " len(df_train)\n", " )\n", "\n", " print(f'Train loss {train_loss} accuracy {train_acc}')\n", "\n", " val_acc, val_loss = eval_model(\n", " model,\n", " val_data_loader,\n", " loss_fn,\n", " device,\n", " len(df_val)\n", " )\n", "\n", " print(f'Val loss {val_loss} accuracy {val_acc}')\n", " print()\n", "\n", " history['train_acc'].append(train_acc)\n", " history['train_loss'].append(train_loss)\n", " history['val_acc'].append(val_acc)\n", " history['val_loss'].append(val_loss)\n", "\n", " if val_acc > best_accuracy:\n", " torch.save(model.state_dict(), 'best_model_state.bin')\n", " best_accuracy = val_acc" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "4r8-5zWsiVur" }, "source": [ "Note that we're storing the state of the best model, indicated by the highest validation accuracy." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "wLQf52c7fbzr" }, "source": [ "Whoo, this took some time! We can look at the training vs validation accuracy:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DX8VC8xiKaX-", "outputId": "6a3e7fe5-0865-42b7-cc89-5d93d0cf1e15" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[tensor(0.6690, device='cuda:0', dtype=torch.float64), tensor(0.8359, device='cuda:0', dtype=torch.float64), tensor(0.9162, device='cuda:0', dtype=torch.float64), tensor(0.9513, device='cuda:0', dtype=torch.float64), tensor(0.9676, device='cuda:0', dtype=torch.float64), tensor(0.9758, device='cuda:0', dtype=torch.float64)]\n" ] } ], "source": [ "print(history['train_acc'])" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "t9hOVE-XO74-", "outputId": "557e727d-9ab5-4280-a08a-2c877a3dff67" }, "outputs": [ { "data": { "text/plain": [ "[array(0.66904241),\n", " array(0.83586197),\n", " array(0.91623739),\n", " array(0.95130901),\n", " array(0.96760991),\n", " array(0.97579564)]" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_of_train_accuracy= [t.cpu().numpy() for t in history['train_acc']]\n", "list_of_train_accuracy" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dZPHZf3gPccd", "outputId": "807f278a-aa9e-4ecb-d9f7-d4a611990345" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[tensor(0.7789, device='cuda:0', dtype=torch.float64), tensor(0.8374, device='cuda:0', dtype=torch.float64), tensor(0.8399, device='cuda:0', dtype=torch.float64), tensor(0.8640, device='cuda:0', dtype=torch.float64), tensor(0.8666, device='cuda:0', dtype=torch.float64), tensor(0.8666, device='cuda:0', dtype=torch.float64)]\n" ] } ], "source": [ "print(history['val_acc'])" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "G8HIB1IaPP3x", "outputId": "fcf6ac92-a94f-40e9-d77c-6551058a48e6" }, "outputs": [ { "data": { "text/plain": [ "[array(0.77890724),\n", " array(0.83735705),\n", " array(0.83989835),\n", " array(0.86404066),\n", " array(0.86658196),\n", " array(0.86658196)]" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_of_val_accuracy= [t.cpu().numpy() for t in history['val_acc']]\n", "list_of_val_accuracy" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 734 }, "id": "-FWG7kBm372V", "outputId": "e05f378b-3b4e-4747-9b3b-d9c55bafbe1e" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 717, "width": 1017 } }, "output_type": "display_data" } ], "source": [ "plt.plot(list_of_train_accuracy, label='train accuracy')\n", "plt.plot(list_of_val_accuracy, label='validation accuracy')\n", "\n", "plt.title('Training history')\n", "plt.ylabel('Accuracy')\n", "plt.xlabel('Epoch')\n", "plt.legend()\n", "plt.ylim([0, 1]);" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ZsHqkLAuf8pv" }, "source": [ "The training accuracy starts to approach 100% after 10 epochs or so. You might try to fine-tune the parameters a bit more, but this will be good enough for us.\n", "\n", "Don't want to wait? Uncomment the next cell to download my pre-trained model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zoGUH8VZ-pPQ" }, "outputs": [], "source": [ "# !gdown --id 1V8itWtowCYnb2Bc9KlK9SxGff9WwmogA\n", "\n", "# model = SentimentClassifier(len(class_names))\n", "# model.load_state_dict(torch.load('best_model_state.bin'))\n", "# model = model.to(device)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "U3HZb3NWFtFf" }, "source": [ "## Evaluation\n", "\n", "So how good is our model on predicting sentiment? Let's start by calculating the accuracy on the test data:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "jS3gJ_qBEljD", "outputId": "67e6e276-fd88-4467-f3cb-270bdfa253fa" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.\n", " warnings.warn(_create_warning_msg(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Test Accuracy : 0.8781725888324873\n" ] } ], "source": [ "test_acc, _ = eval_model(\n", " model,\n", " test_data_loader,\n", " loss_fn,\n", " device,\n", " len(df_test)\n", ")\n", "\n", "print(('\\n'))\n", "print('Test Accuracy : ', test_acc.item())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "mdQ7-ylCj8Gd" }, "source": [ "The accuracy is about 1% lower on the test set. Our model seems to generalize well.\n", "\n", "We'll define a helper function to get the predictions from our model:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "id": "EgR6MuNS8jr_" }, "outputs": [], "source": [ "def get_predictions(model, data_loader):\n", " model = model.eval()\n", "\n", " review_texts = []\n", " predictions = []\n", " prediction_probs = []\n", " real_values = []\n", "\n", " with torch.no_grad():\n", " for d in data_loader:\n", "\n", " texts = d[\"review_text\"]\n", " input_ids = d[\"input_ids\"].to(device)\n", " attention_mask = d[\"attention_mask\"].to(device)\n", " targets = d[\"targets\"].to(device)\n", "\n", " outputs = model(\n", " input_ids=input_ids,\n", " attention_mask=attention_mask\n", " )\n", " _, preds = torch.max(outputs, dim=1)\n", "\n", " probs = F.softmax(outputs, dim=1)\n", "\n", " review_texts.extend(texts)\n", " predictions.extend(preds)\n", " prediction_probs.extend(probs)\n", " real_values.extend(targets)\n", "\n", " predictions = torch.stack(predictions).cpu()\n", " prediction_probs = torch.stack(prediction_probs).cpu()\n", " real_values = torch.stack(real_values).cpu()\n", " return review_texts, predictions, prediction_probs, real_values" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "dkbnBTI7kd_y" }, "source": [ "This is similar to the evaluation function, except that we're storing the text of the reviews and the predicted probabilities (by applying the softmax on the model outputs):" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zHdPZr60-0c_", "outputId": "1724a178-afeb-4b24-ce1a-91d3a866d777" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] } ], "source": [ "y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(\n", " model,\n", " test_data_loader\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "gVwoVij2lC7F" }, "source": [ "Let's have a look at the classification report" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "L8a9_8-ND3Is", "outputId": "bcafa022-4a17-4dd8-b570-f3dd0fafd16f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " negative 0.92 0.84 0.88 245\n", " neutral 0.82 0.87 0.84 254\n", " positive 0.91 0.91 0.91 289\n", "\n", " accuracy 0.88 788\n", " macro avg 0.88 0.88 0.88 788\n", "weighted avg 0.88 0.88 0.88 788\n", "\n" ] } ], "source": [ "print(classification_report(y_test, y_pred, target_names=class_names))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "rFAekw3mmWUi" }, "source": [ "Looks like it is really hard to classify neutral (3 stars) reviews. And I can tell you from experience, looking at many reviews, those are hard to classify.\n", "\n", "We'll continue with the confusion matrix:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 747 }, "id": "6d1qxsc__DTh", "outputId": "6b579751-1f15-44d5-eb27-10925823974c" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 730, "width": 1008 } }, "output_type": "display_data" } ], "source": [ "def show_confusion_matrix(confusion_matrix):\n", " hmap = sns.heatmap(confusion_matrix, annot=True, fmt=\"d\", cmap=\"Blues\")\n", " hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')\n", " hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')\n", " plt.ylabel('True sentiment')\n", " plt.xlabel('Predicted sentiment');\n", "\n", "cm = confusion_matrix(y_test, y_pred)\n", "df_cm = pd.DataFrame(cm, index=class_names, columns=class_names)\n", "show_confusion_matrix(df_cm)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "wx0U7oNsnZ3A" }, "source": [ "This confirms that our model is having difficulty classifying neutral reviews. It mistakes those for negative and positive at a roughly equal frequency.\n", "\n", "That's a good overview of the performance of our model. But let's have a look at an example from our test data:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "id": "iANBiY3sLo-K" }, "outputs": [], "source": [ "idx = 2\n", "\n", "review_text = y_review_texts[idx]\n", "true_sentiment = y_test[idx]\n", "pred_df = pd.DataFrame({\n", " 'class_names': class_names,\n", " 'values': y_pred_probs[idx]\n", "})" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-8D0rb1yfnv4", "outputId": "ceb8ccc8-9b77-46e1-8865-3ebb692422cb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I used to use Habitica, and I must say this is a great step up. I'd\n", "like to see more social features, such as sharing tasks - only one\n", "person has to perform said task for it to be checked off, but only\n", "giving that person the experience and gold. Otherwise, the price for\n", "subscription is too steep, thus resulting in a sub-perfect score. I\n", "could easily justify $0.99/month or eternal subscription for $15. If\n", "that price could be met, as well as fine tuning, this would be easily\n", "worth 5 stars.\n", "\n", "True sentiment: neutral\n" ] } ], "source": [ "print(\"\\n\".join(wrap(review_text)))\n", "print()\n", "print(f'True sentiment: {class_names[true_sentiment]}')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "f7hj_IZFnn2X" }, "source": [ "Now we can look at the confidence of each sentiment of our model:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 710 }, "id": "qj4d8lZyMkhf", "outputId": "64d33548-ae81-476d-e30f-664f8bf29726" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 693, "width": 1083 } }, "output_type": "display_data" } ], "source": [ "sns.barplot(x='values', y='class_names', data=pred_df, orient='h')\n", "plt.ylabel('sentiment')\n", "plt.xlabel('probability')\n", "plt.xlim([0, 1]);" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "7WL5pDmvFyaU" }, "source": [ "### Predicting on Raw Text\n", "\n", "Let's use our model to predict the sentiment of some raw text:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "id": "QEPi7zQRsDhH" }, "outputs": [], "source": [ "review_text = \"I love completing my todos! Best app ever!!!\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "GaN4RnqMnxYw" }, "source": [ "We have to use the tokenizer to encode the text:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zA5Or4D2sLc9", "outputId": "10b0fd44-4818-4567-b98a-74bed7aa44ed" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2377: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n", " warnings.warn(\n" ] } ], "source": [ "encoded_review = tokenizer.encode_plus(\n", " review_text,\n", " max_length=MAX_LEN,\n", " add_special_tokens=True,\n", " return_token_type_ids=False,\n", " pad_to_max_length=True,\n", " return_attention_mask=True,\n", " return_tensors='pt',\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "et8xlDrKpH60" }, "source": [ "Let's get the predictions from our model:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Qr_t3rUksumr", "outputId": "a7419338-d63e-40ff-ba9d-a7446d1362bb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Review text: I love completing my todos! Best app ever!!!\n", "Sentiment : positive\n" ] } ], "source": [ "input_ids = encoded_review['input_ids'].to(device)\n", "attention_mask = encoded_review['attention_mask'].to(device)\n", "\n", "output = model(input_ids, attention_mask)\n", "_, prediction = torch.max(output, dim=1)\n", "\n", "print(f'Review text: {review_text}')\n", "print(f'Sentiment : {class_names[prediction]}')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "PVhwzq7bpPRl" }, "source": [ "## Summary\n", "\n", "Nice job! You learned how to use BERT for sentiment analysis. You built a custom classifier using the Hugging Face library and trained it on our app reviews dataset!\n", "\n", "- [Read the tutorial](https://www.curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/)\n", "- [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/1PHv-IRLPCtv7oTcIGbsgZHqrB5LPvB7S)\n", "- [Read the `Getting Things Done with Pytorch` book](https://github.com/curiousily/Getting-Things-Done-with-Pytorch)\n", "\n", "You learned how to:\n", "\n", "- Intuitively understand what BERT is\n", "- Preprocess text data for BERT and build PyTorch Dataset (tokenization, attention masks, and padding)\n", "- Use Transfer Learning to build Sentiment Classifier using the Transformers library by Hugging Face\n", "- Evaluate the model on test data\n", "- Predict sentiment on raw text\n", "\n", "Next, we'll learn how to deploy our trained model behind a REST API and build a simple web app to access it." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Wf39tauBa2V2" }, "source": [ "## References\n", "\n", "- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)\n", "- [L11 Language Models - Alec Radford (OpenAI)](https://www.youtube.com/watch?v=BnpB3GrpsfM)\n", "- [The Illustrated BERT, ELMo, and co.](https://jalammar.github.io/illustrated-bert/)\n", "- [BERT Fine-Tuning Tutorial with PyTorch](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)\n", "- [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf)\n", "- [Huggingface Transformers](https://huggingface.co/transformers/)\n", "- [BERT Explained: State of the art language model for NLP](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "0a3631a4ad9c49909c63b9cabdd926d8": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "0c85e11e27d6484ea31b766aecf0a06e": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "300407d9c525447cbc2223a30c17deac": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "30cc769f98f9447eae28e7060c943d6f": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_ce8cb11f151b45a69431a11d0789bb93", "placeholder": "​", "style": "IPY_MODEL_6652821599284ab3bcb86f8dfee60da5", "value": " 29.0/29.0 [00:00<00:00, 576B/s]" } }, "34bbd61a5c844010bbd47539fcf2abc6": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "3dac470fee3c4f9eb8bebb1992c19688": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "3ed9609b1b804e079e548c6e87410911": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_5f9e170e18d8413ea9811e3cd501fcb7", "placeholder": "​", "style": "IPY_MODEL_c65deba3b9c1414eb489317f2fb18b12", "value": "Downloading (…)lve/main/config.json: 100%" } }, "44b33027973c41d69730f66aeef675ef": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "5968e9e39f1b466dbacf6cece45d7ee2": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_5b828d7f8bd747619cddc4b28d1d5383", "IPY_MODEL_c8135ded16e24ec1bb3ab6b26aa1e01d", "IPY_MODEL_cdbbeea71cf1478a93df34da79abb1e1" ], "layout": "IPY_MODEL_acd7b4228d984976b0aa16eda3a74e2a" } }, "5b828d7f8bd747619cddc4b28d1d5383": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_a5aaec5b93e04200bf59024dd6497283", "placeholder": "​", "style": "IPY_MODEL_300407d9c525447cbc2223a30c17deac", "value": "Downloading (…)solve/main/vocab.txt: 100%" } }, "5d54820c46e44b54aaeb002e1054d495": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_34bbd61a5c844010bbd47539fcf2abc6", "max": 29, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_b389a069f9aa433b97f81043b5dc573a", "value": 29 } }, "5f9e170e18d8413ea9811e3cd501fcb7": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "62b3f7375b0243239eed24bd9b342939": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "648c9e36f406464b8a5be4d19a69a4e7": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "6652821599284ab3bcb86f8dfee60da5": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "6f54d4cafe384c67a459065119006b4e": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "7022e627b3934baababd16449a13ab16": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_b7336a3c03804a1889bd4539e8574100", "max": 570, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_44b33027973c41d69730f66aeef675ef", "value": 570 } }, "7532b67ceaf74f6c8263ae7185a7b04d": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "7fb2d01f05914c7b97e58dc877c67c62": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_3ed9609b1b804e079e548c6e87410911", "IPY_MODEL_7022e627b3934baababd16449a13ab16", "IPY_MODEL_b376b41872cd490e8bcd7d64db66941a" ], "layout": "IPY_MODEL_0c85e11e27d6484ea31b766aecf0a06e" } }, "841966fa79134f75a1fed6db97f34dc9": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_62b3f7375b0243239eed24bd9b342939", "placeholder": "​", "style": "IPY_MODEL_c23dcfab453442b4b41a90c2f096302d", "value": "Downloading (…)okenizer_config.json: 100%" } }, "8432242e44044eb48b132453b519947e": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_924c0e3b25054a68b37090823bd07a96", "IPY_MODEL_fa802e4f8a3c4d20b0fabeaeaab7ee18", "IPY_MODEL_d48ccf851db0492d805f2ac9342839b4" ], "layout": "IPY_MODEL_7532b67ceaf74f6c8263ae7185a7b04d" } }, "90ececf282a94f6ca593d361ad6d9212": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "924c0e3b25054a68b37090823bd07a96": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_fdd1cbca5dd74a7e9c91d15cb6abd7ce", "placeholder": "​", "style": "IPY_MODEL_b3940458f5134c4a8a16432113ef2561", "value": "Downloading model.safetensors: 100%" } }, "9ff5213e4251453b92de1eeeb9c7acb6": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "a364d729ab9342dfa53bb39e8f068688": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "a5aaec5b93e04200bf59024dd6497283": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "a7733a2a0059411db9f7edb5e4dd1a92": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "acd7b4228d984976b0aa16eda3a74e2a": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "b376b41872cd490e8bcd7d64db66941a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_9ff5213e4251453b92de1eeeb9c7acb6", "placeholder": "​", "style": "IPY_MODEL_0a3631a4ad9c49909c63b9cabdd926d8", "value": " 570/570 [00:00<00:00, 13.0kB/s]" } }, "b389a069f9aa433b97f81043b5dc573a": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "b3940458f5134c4a8a16432113ef2561": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "b7336a3c03804a1889bd4539e8574100": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "c23dcfab453442b4b41a90c2f096302d": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "c5c2e50239a44566b376e233b36e1220": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_841966fa79134f75a1fed6db97f34dc9", "IPY_MODEL_5d54820c46e44b54aaeb002e1054d495", "IPY_MODEL_30cc769f98f9447eae28e7060c943d6f" ], "layout": "IPY_MODEL_a7733a2a0059411db9f7edb5e4dd1a92" } }, "c65deba3b9c1414eb489317f2fb18b12": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "c8135ded16e24ec1bb3ab6b26aa1e01d": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_f59117f9fb414783b339226908a3d32f", "max": 213450, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_6f54d4cafe384c67a459065119006b4e", "value": 213450 } }, "cdbbeea71cf1478a93df34da79abb1e1": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_fabf9140688d49d99dd799a2b364ec27", "placeholder": "​", "style": "IPY_MODEL_3dac470fee3c4f9eb8bebb1992c19688", "value": " 213k/213k [00:00<00:00, 1.30MB/s]" } }, "ce8cb11f151b45a69431a11d0789bb93": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "d48ccf851db0492d805f2ac9342839b4": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_e8592109366842438c9295552aac501a", "placeholder": "​", "style": "IPY_MODEL_90ececf282a94f6ca593d361ad6d9212", "value": " 436M/436M [00:01<00:00, 204MB/s]" } }, "e8592109366842438c9295552aac501a": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "f59117f9fb414783b339226908a3d32f": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "fa802e4f8a3c4d20b0fabeaeaab7ee18": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_648c9e36f406464b8a5be4d19a69a4e7", "max": 435755784, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_a364d729ab9342dfa53bb39e8f068688", "value": 435755784 } }, "fabf9140688d49d99dd799a2b364ec27": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "fdd1cbca5dd74a7e9c91d15cb6abd7ce": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } } } } }, "nbformat": 4, "nbformat_minor": 0 }