{ "cells": [ { "cell_type": "markdown", "id": "ece9e8d9", "metadata": { "papermill": { "duration": 0.102825, "end_time": "2022-05-16T23:18:18.905047", "exception": false, "start_time": "2022-05-16T23:18:18.802222", "status": "completed" }, "tags": [] }, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "0407b5a9", "metadata": { "papermill": { "duration": 0.098538, "end_time": "2022-05-16T23:18:19.107274", "exception": false, "start_time": "2022-05-16T23:18:19.008736", "status": "completed" }, "tags": [] }, "source": [ "One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.\n", "\n", "Perhaps the most widely practically useful application of NLP is *classification* -- that is, classifying a document automatically into some category. This can be used, for instance, for:\n", "\n", "- Sentiment analysis (e.g are people saying *positive* or *negative* things about your product)\n", "- Author identification (what author most likely wrote some document)\n", "- Legal discovery (which documents are in scope for a trial)\n", "- Organizing documents by topic\n", "- Triaging inbound emails\n", "- ...and much more!\n", "\n", "Classification models can also be used to solve problems that are not, at first, obviously appropriate. For instance, consider the Kaggle [U.S. Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/) competition. In this, we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of `1` it is considered that the two inputs have identical meaning, and `0` means they have totally different meaning. For instance, *abatement* and *eliminating process* have a score of `0.5`, meaning they're somewhat similar, but not identical.\n", "\n", "It turns out that this can be represented as a classification problem. How? By representing the question like this:\n", "\n", "> For the following text...: \"TEXT1: abatement; TEXT2: eliminating process\" ...chose a category of meaning similarity: \"Different; Similar; Identical\".\n", "\n", "In this notebook we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above." ] }, { "cell_type": "markdown", "id": "72aef3e5", "metadata": { "papermill": { "duration": 0.096761, "end_time": "2022-05-16T23:18:19.299885", "exception": false, "start_time": "2022-05-16T23:18:19.203124", "status": "completed" }, "tags": [] }, "source": [ "### On Kaggle" ] }, { "cell_type": "markdown", "id": "d7ba1b00", "metadata": { "papermill": { "duration": 0.095382, "end_time": "2022-05-16T23:18:19.492836", "exception": false, "start_time": "2022-05-16T23:18:19.397454", "status": "completed" }, "tags": [] }, "source": [ "Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills. There is nothing like being able to get hands-on practice and receiving real-time feedback to help you improve your skills. It provides:\n", "\n", "1. Interesting data sets\n", "1. Feedback on how you're doing\n", "1. A leader board to see what's good, what's possible, and what's state-of-art\n", "1. Notebooks and blog posts by winning contestants share useful tips and techniques.\n", "\n", "The dataset we will be using here is only available from Kaggle. Therefore, you will need to register on the site, then go to the [page for the competition](https://www.kaggle.com/c/us-patent-phrase-to-phrase-matching). On that page click \"Rules,\" then \"I Understand and Accept.\" (Although the competition has finished, and you will not be entering it, you still have to agree to the rules to be allowed to download the data.)\n", "\n", "There are two ways to then use this data:\n", "\n", "- Easiest: run this notebook directly on Kaggle, or\n", "- Most flexible: download the data locally and run it on your PC or GPU server\n", "\n", "If you are running this on Kaggle.com, you can skip the next section. Just make sure that on Kaggle you've selected to use a GPU during your session, by clicking on the hamburger menu (3 dots in the top right) and clicking \"Accelerator\" -- it should look like this:" ] }, { "attachments": { "9af4e875-1f2a-468c-b233-8c91531e4c40.png": { "image/png": "" }, "image.png": { "image/png": "" } }, "cell_type": "markdown", "id": "f7de75fb", "metadata": { "papermill": { "duration": 0.095452, "end_time": "2022-05-16T23:18:19.683195", "exception": false, "start_time": "2022-05-16T23:18:19.587743", "status": "completed" }, "tags": [] }, "source": [ "![image.png](attachment:9af4e875-1f2a-468c-b233-8c91531e4c40.png)!" ] }, { "cell_type": "markdown", "id": "bfd7f80a", "metadata": { "papermill": { "duration": 0.094882, "end_time": "2022-05-16T23:18:19.873521", "exception": false, "start_time": "2022-05-16T23:18:19.778639", "status": "completed" }, "tags": [] }, "source": [ "We'll need slightly different code depending on whether we're running on Kaggle or not, so we'll use this variable to track where we are:" ] }, { "cell_type": "code", "execution_count": 1, "id": "a2fd6427", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:20.069718Z", "iopub.status.busy": "2022-05-16T23:18:20.068198Z", "iopub.status.idle": "2022-05-16T23:18:20.077629Z", "shell.execute_reply": "2022-05-16T23:18:20.078082Z", "shell.execute_reply.started": "2022-04-19T22:50:15.58802Z" }, "papermill": { "duration": 0.110699, "end_time": "2022-05-16T23:18:20.078308", "exception": false, "start_time": "2022-05-16T23:18:19.967609", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import os\n", "iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')" ] }, { "cell_type": "markdown", "id": "48759fa0", "metadata": { "papermill": { "duration": 0.088837, "end_time": "2022-05-16T23:18:20.263924", "exception": false, "start_time": "2022-05-16T23:18:20.175087", "status": "completed" }, "tags": [] }, "source": [ "### Using Kaggle data on your own machine" ] }, { "cell_type": "markdown", "id": "8b1ca433", "metadata": { "papermill": { "duration": 0.15651, "end_time": "2022-05-16T23:18:20.564625", "exception": false, "start_time": "2022-05-16T23:18:20.408115", "status": "completed" }, "tags": [] }, "source": [ "Kaggle limits your weekly time using a GPU machine. The limits are very generous, but you may well still find it's not enough! In that case, you'll want to use your own GPU server, or a cloud server such as Colab, Paperspace Gradient, or SageMaker Studio Lab (all of which have free options). To do so, you'll need to be able to download Kaggle datasets.\n", "\n", "The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using `pip` by running this in a notebook cell:\n", "\n", " !pip install kaggle\n", "\n", "You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called *kaggle.json* to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell (e.g., `creds = '{\"username\":\"xxx\",\"key\":\"xxx\"}'`):" ] }, { "cell_type": "code", "execution_count": 2, "id": "6c52e24b", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:20.878123Z", "iopub.status.busy": "2022-05-16T23:18:20.877241Z", "iopub.status.idle": "2022-05-16T23:18:20.879522Z", "shell.execute_reply": "2022-05-16T23:18:20.878859Z", "shell.execute_reply.started": "2022-04-19T22:50:15.619534Z" }, "papermill": { "duration": 0.161714, "end_time": "2022-05-16T23:18:20.879673", "exception": false, "start_time": "2022-05-16T23:18:20.717959", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "creds = ''" ] }, { "cell_type": "markdown", "id": "f73d2a77", "metadata": { "papermill": { "duration": 0.091829, "end_time": "2022-05-16T23:18:21.075422", "exception": false, "start_time": "2022-05-16T23:18:20.983593", "status": "completed" }, "tags": [] }, "source": [ "Then execute this cell (this only needs to be run once):" ] }, { "cell_type": "code", "execution_count": 3, "id": "41a7fe66", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:21.273100Z", "iopub.status.busy": "2022-05-16T23:18:21.272324Z", "iopub.status.idle": "2022-05-16T23:18:21.275167Z", "shell.execute_reply": "2022-05-16T23:18:21.274735Z", "shell.execute_reply.started": "2022-04-19T22:50:15.625454Z" }, "papermill": { "duration": 0.105467, "end_time": "2022-05-16T23:18:21.275293", "exception": false, "start_time": "2022-05-16T23:18:21.169826", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# for working with paths in Python, I recommend using `pathlib.Path`\n", "from pathlib import Path\n", "\n", "cred_path = Path('~/.kaggle/kaggle.json').expanduser()\n", "if not cred_path.exists():\n", " cred_path.parent.mkdir(exist_ok=True)\n", " cred_path.write_text(creds)\n", " cred_path.chmod(0o600)" ] }, { "cell_type": "markdown", "id": "8cc398d8", "metadata": { "papermill": { "duration": 0.096152, "end_time": "2022-05-16T23:18:21.470375", "exception": false, "start_time": "2022-05-16T23:18:21.374223", "status": "completed" }, "tags": [] }, "source": [ "Now you can download datasets from Kaggle." ] }, { "cell_type": "code", "execution_count": 4, "id": "f9240dd6", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:21.669472Z", "iopub.status.busy": "2022-05-16T23:18:21.668423Z", "iopub.status.idle": "2022-05-16T23:18:21.671180Z", "shell.execute_reply": "2022-05-16T23:18:21.670701Z", "shell.execute_reply.started": "2022-04-19T22:50:15.636168Z" }, "papermill": { "duration": 0.104227, "end_time": "2022-05-16T23:18:21.671310", "exception": false, "start_time": "2022-05-16T23:18:21.567083", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "path = Path('us-patent-phrase-to-phrase-matching')" ] }, { "cell_type": "markdown", "id": "d3a9b31d", "metadata": { "papermill": { "duration": 0.097029, "end_time": "2022-05-16T23:18:21.865372", "exception": false, "start_time": "2022-05-16T23:18:21.768343", "status": "completed" }, "tags": [] }, "source": [ "And use the Kaggle API to download the dataset to that path, and extract it:" ] }, { "cell_type": "code", "execution_count": 5, "id": "76cd493d", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:22.063705Z", "iopub.status.busy": "2022-05-16T23:18:22.062698Z", "iopub.status.idle": "2022-05-16T23:18:22.065286Z", "shell.execute_reply": "2022-05-16T23:18:22.064712Z", "shell.execute_reply.started": "2022-04-19T22:50:15.645742Z" }, "papermill": { "duration": 0.105373, "end_time": "2022-05-16T23:18:22.065424", "exception": false, "start_time": "2022-05-16T23:18:21.960051", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "if not iskaggle and not path.exists():\n", " import zipfile,kaggle\n", " kaggle.api.competition_download_cli(str(path))\n", " zipfile.ZipFile(f'{path}.zip').extractall(path)" ] }, { "cell_type": "markdown", "id": "1d234710", "metadata": { "papermill": { "duration": 0.096966, "end_time": "2022-05-16T23:18:22.259134", "exception": false, "start_time": "2022-05-16T23:18:22.162168", "status": "completed" }, "tags": [] }, "source": [ "Note that you can easily download notebooks from Kaggle and upload them to other cloud services. So if you're low on Kaggle GPU credits, give this a try!" ] }, { "cell_type": "markdown", "id": "04d27700", "metadata": { "papermill": { "duration": 0.096057, "end_time": "2022-05-16T23:18:22.452686", "exception": false, "start_time": "2022-05-16T23:18:22.356629", "status": "completed" }, "tags": [] }, "source": [ "## Import and EDA" ] }, { "cell_type": "code", "execution_count": 6, "id": "e7d21aca", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:22.653940Z", "iopub.status.busy": "2022-05-16T23:18:22.653136Z", "iopub.status.idle": "2022-05-16T23:18:31.317348Z", "shell.execute_reply": "2022-05-16T23:18:31.316520Z", "shell.execute_reply.started": "2022-04-19T22:50:15.653623Z" }, "papermill": { "duration": 8.767461, "end_time": "2022-05-16T23:18:31.317539", "exception": false, "start_time": "2022-05-16T23:18:22.550078", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\r\n" ] } ], "source": [ "if iskaggle:\n", " path = Path('../input/us-patent-phrase-to-phrase-matching')\n", " ! pip install -q datasets" ] }, { "cell_type": "markdown", "id": "e33c3324", "metadata": { "papermill": { "duration": 0.148707, "end_time": "2022-05-16T23:18:31.616056", "exception": false, "start_time": "2022-05-16T23:18:31.467349", "status": "completed" }, "tags": [] }, "source": [ "Documents in NLP datasets are generally in one of two main forms:\n", "\n", "- **Larger documents**: One text file per document, often organised into one folder per category\n", "- **Smaller documents**: One document (or document pair, optionally with metadata) per row in a [CSV file](https://realpython.com/python-csv/).\n", "\n", "Let's look at our data and see what we've got. In Jupyter you can use any bash/shell command by starting a line with a `!`, and use `{}` to include python variables, like so:" ] }, { "cell_type": "code", "execution_count": 7, "id": "abd6e692", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:31.900273Z", "iopub.status.busy": "2022-05-16T23:18:31.899522Z", "iopub.status.idle": "2022-05-16T23:18:32.550557Z", "shell.execute_reply": "2022-05-16T23:18:32.550032Z", "shell.execute_reply.started": "2022-04-19T22:50:24.320172Z" }, "papermill": { "duration": 0.789889, "end_time": "2022-05-16T23:18:32.550692", "exception": false, "start_time": "2022-05-16T23:18:31.760803", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sample_submission.csv test.csv train.csv\r\n" ] } ], "source": [ "!ls {path}" ] }, { "cell_type": "markdown", "id": "925425fe", "metadata": { "papermill": { "duration": 0.086033, "end_time": "2022-05-16T23:18:32.724591", "exception": false, "start_time": "2022-05-16T23:18:32.638558", "status": "completed" }, "tags": [] }, "source": [ "It looks like this competition uses CSV files. For opening, manipulating, and viewing CSV files, it's generally best to use the Pandas library, which is explained brilliantly in [this book](https://wesmckinney.com/book/) by the lead developer (it's also an excellent introduction to matplotlib and numpy, both of which I use in this notebook). Generally it's imported as the abbreviation `pd`." ] }, { "cell_type": "code", "execution_count": 8, "id": "1ccad14f", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:32.903526Z", "iopub.status.busy": "2022-05-16T23:18:32.902787Z", "iopub.status.idle": "2022-05-16T23:18:32.904904Z", "shell.execute_reply": "2022-05-16T23:18:32.905273Z", "shell.execute_reply.started": "2022-04-19T22:50:25.029375Z" }, "papermill": { "duration": 0.094094, "end_time": "2022-05-16T23:18:32.905420", "exception": false, "start_time": "2022-05-16T23:18:32.811326", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "b8f0f88b", "metadata": { "papermill": { "duration": 0.088323, "end_time": "2022-05-16T23:18:33.079444", "exception": false, "start_time": "2022-05-16T23:18:32.991121", "status": "completed" }, "tags": [] }, "source": [ "Let's set a path to our data:" ] }, { "cell_type": "code", "execution_count": 9, "id": "410bf8a8", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:33.257861Z", "iopub.status.busy": "2022-05-16T23:18:33.257213Z", "iopub.status.idle": "2022-05-16T23:18:33.338741Z", "shell.execute_reply": "2022-05-16T23:18:33.338223Z", "shell.execute_reply.started": "2022-04-19T22:50:25.036197Z" }, "papermill": { "duration": 0.173142, "end_time": "2022-05-16T23:18:33.338883", "exception": false, "start_time": "2022-05-16T23:18:33.165741", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df = pd.read_csv(path/'train.csv')" ] }, { "cell_type": "markdown", "id": "121e2210", "metadata": { "papermill": { "duration": 0.087101, "end_time": "2022-05-16T23:18:33.513985", "exception": false, "start_time": "2022-05-16T23:18:33.426884", "status": "completed" }, "tags": [] }, "source": [ "This creates a [DataFrame](https://pandas.pydata.org/docs/user_guide/10min.html), which is a table of named columns, a bit like a database table. To view the first and last rows, and row count of a DataFrame, just type its name:" ] }, { "cell_type": "code", "execution_count": 10, "id": "4b298ff1", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:33.700493Z", "iopub.status.busy": "2022-05-16T23:18:33.699888Z", "iopub.status.idle": "2022-05-16T23:18:33.715083Z", "shell.execute_reply": "2022-05-16T23:18:33.715531Z", "shell.execute_reply.started": "2022-04-19T22:50:25.122204Z" }, "papermill": { "duration": 0.114426, "end_time": "2022-05-16T23:18:33.715685", "exception": false, "start_time": "2022-05-16T23:18:33.601259", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idanchortargetcontextscore
037d61fd2272659b1abatementabatement of pollutionA470.50
17b9652b17b68b7a4abatementact of abatingA470.75
236d72442aefd8232abatementactive catalystA470.25
35296b0c19e1ce60eabatementeliminating processA470.50
454c1e3b9184cb5b6abatementforest regionA470.00
..................
364688e1386cbefd7f245wood articlewooden articleB441.00
3646942d9e032d1cd3242wood articlewooden boxB440.50
36470208654ccb9e14fa3wood articlewooden handleB440.50
36471756ec035e694722bwood articlewooden materialB440.75
364728d135da0b55b8c88wood articlewooden substrateB440.50
\n", "

36473 rows × 5 columns

\n", "
" ], "text/plain": [ " id anchor target context score\n", "0 37d61fd2272659b1 abatement abatement of pollution A47 0.50\n", "1 7b9652b17b68b7a4 abatement act of abating A47 0.75\n", "2 36d72442aefd8232 abatement active catalyst A47 0.25\n", "3 5296b0c19e1ce60e abatement eliminating process A47 0.50\n", "4 54c1e3b9184cb5b6 abatement forest region A47 0.00\n", "... ... ... ... ... ...\n", "36468 8e1386cbefd7f245 wood article wooden article B44 1.00\n", "36469 42d9e032d1cd3242 wood article wooden box B44 0.50\n", "36470 208654ccb9e14fa3 wood article wooden handle B44 0.50\n", "36471 756ec035e694722b wood article wooden material B44 0.75\n", "36472 8d135da0b55b8c88 wood article wooden substrate B44 0.50\n", "\n", "[36473 rows x 5 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "14efc19d", "metadata": { "papermill": { "duration": 0.087104, "end_time": "2022-05-16T23:18:33.890429", "exception": false, "start_time": "2022-05-16T23:18:33.803325", "status": "completed" }, "tags": [] }, "source": [ "It's important to carefully read the [dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) to understand how each of these columns is used.\n", "\n", "One of the most useful features of `DataFrame` is the `describe()` method:" ] }, { "cell_type": "code", "execution_count": 11, "id": "21982274", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:34.098355Z", "iopub.status.busy": "2022-05-16T23:18:34.097454Z", "iopub.status.idle": "2022-05-16T23:18:34.145806Z", "shell.execute_reply": "2022-05-16T23:18:34.146201Z", "shell.execute_reply.started": "2022-04-19T22:50:25.149735Z" }, "papermill": { "duration": 0.16831, "end_time": "2022-05-16T23:18:34.146345", "exception": false, "start_time": "2022-05-16T23:18:33.978035", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idanchortargetcontext
count36473364733647336473
unique3647373329340106
top37d61fd2272659b1component composite coatingcompositionH01
freq1152242186
\n", "
" ], "text/plain": [ " id anchor target context\n", "count 36473 36473 36473 36473\n", "unique 36473 733 29340 106\n", "top 37d61fd2272659b1 component composite coating composition H01\n", "freq 1 152 24 2186" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe(include='object')" ] }, { "cell_type": "markdown", "id": "625b4e8b", "metadata": { "papermill": { "duration": 0.08891, "end_time": "2022-05-16T23:18:34.323267", "exception": false, "start_time": "2022-05-16T23:18:34.234357", "status": "completed" }, "tags": [] }, "source": [ "We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with \"component composite coating\" for instance appearing 152 times.\n", "\n", "Earlier, I suggested we could represent the input to the model as something like \"*TEXT1: abatement; TEXT2: eliminating process*\". We'll need to add the context to this too. In Pandas, we just use `+` to concatenate, like so:" ] }, { "cell_type": "code", "execution_count": 12, "id": "d950dfad", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:34.525187Z", "iopub.status.busy": "2022-05-16T23:18:34.509818Z", "iopub.status.idle": "2022-05-16T23:18:34.535058Z", "shell.execute_reply": "2022-05-16T23:18:34.534144Z", "shell.execute_reply.started": "2022-04-19T22:50:25.226549Z" }, "papermill": { "duration": 0.123242, "end_time": "2022-05-16T23:18:34.535176", "exception": false, "start_time": "2022-05-16T23:18:34.411934", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor" ] }, { "cell_type": "markdown", "id": "7e74bcd3", "metadata": { "papermill": { "duration": 0.088484, "end_time": "2022-05-16T23:18:34.712258", "exception": false, "start_time": "2022-05-16T23:18:34.623774", "status": "completed" }, "tags": [] }, "source": [ "We can refer to a column (also known as a *series*) either using regular python \"dotted\" notation, or access it like a dictionary. To get the first few rows, use `head()`:" ] }, { "cell_type": "code", "execution_count": 13, "id": "c9bbe94c", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:34.896887Z", "iopub.status.busy": "2022-05-16T23:18:34.896259Z", "iopub.status.idle": "2022-05-16T23:18:34.899014Z", "shell.execute_reply": "2022-05-16T23:18:34.899432Z", "shell.execute_reply.started": "2022-04-19T22:50:25.25829Z" }, "papermill": { "duration": 0.09874, "end_time": "2022-05-16T23:18:34.899569", "exception": false, "start_time": "2022-05-16T23:18:34.800829", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 TEXT1: A47; TEXT2: abatement of pollution; ANC...\n", "1 TEXT1: A47; TEXT2: act of abating; ANC1: abate...\n", "2 TEXT1: A47; TEXT2: active catalyst; ANC1: abat...\n", "3 TEXT1: A47; TEXT2: eliminating process; ANC1: ...\n", "4 TEXT1: A47; TEXT2: forest region; ANC1: abatement\n", "Name: input, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.input.head()" ] }, { "cell_type": "markdown", "id": "528ae2cf", "metadata": { "papermill": { "duration": 0.090197, "end_time": "2022-05-16T23:18:35.078246", "exception": false, "start_time": "2022-05-16T23:18:34.988049", "status": "completed" }, "tags": [] }, "source": [ "## Tokenization" ] }, { "cell_type": "markdown", "id": "ff7d7a2a", "metadata": { "papermill": { "duration": 0.08786, "end_time": "2022-05-16T23:18:35.254344", "exception": false, "start_time": "2022-05-16T23:18:35.166484", "status": "completed" }, "tags": [] }, "source": [ "Transformers uses a `Dataset` object for storing a... well a dataset, of course! We can create one like so:" ] }, { "cell_type": "code", "execution_count": 14, "id": "46fe2b83", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:35.438352Z", "iopub.status.busy": "2022-05-16T23:18:35.437675Z", "iopub.status.idle": "2022-05-16T23:18:37.535196Z", "shell.execute_reply": "2022-05-16T23:18:37.534706Z", "shell.execute_reply.started": "2022-04-19T22:50:25.267906Z" }, "papermill": { "duration": 2.190631, "end_time": "2022-05-16T23:18:37.535330", "exception": false, "start_time": "2022-05-16T23:18:35.344699", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "from datasets import Dataset,DatasetDict\n", "\n", "ds = Dataset.from_pandas(df)" ] }, { "cell_type": "markdown", "id": "2a9712b2", "metadata": { "papermill": { "duration": 0.094487, "end_time": "2022-05-16T23:18:37.724881", "exception": false, "start_time": "2022-05-16T23:18:37.630394", "status": "completed" }, "tags": [] }, "source": [ "Here's how it's displayed in a notebook:" ] }, { "cell_type": "code", "execution_count": 15, "id": "26089735", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:37.917875Z", "iopub.status.busy": "2022-05-16T23:18:37.916717Z", "iopub.status.idle": "2022-05-16T23:18:37.920232Z", "shell.execute_reply": "2022-05-16T23:18:37.919832Z", "shell.execute_reply.started": "2022-04-19T22:50:27.331754Z" }, "papermill": { "duration": 0.097799, "end_time": "2022-05-16T23:18:37.920352", "exception": false, "start_time": "2022-05-16T23:18:37.822553", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Dataset({\n", " features: ['id', 'anchor', 'target', 'context', 'score', 'input'],\n", " num_rows: 36473\n", "})" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds" ] }, { "cell_type": "markdown", "id": "73bff7b1", "metadata": { "papermill": { "duration": 0.090915, "end_time": "2022-05-16T23:18:38.100934", "exception": false, "start_time": "2022-05-16T23:18:38.010019", "status": "completed" }, "tags": [] }, "source": [ "But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:\n", "\n", "- *Tokenization*: Split each text up into words (or actually, as we'll see, into *tokens*)\n", "- *Numericalization*: Convert each word (or token) into a number.\n", "\n", "The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace \"small\" with \"large\" for a slower but more accurate model, once you've finished exploring):" ] }, { "cell_type": "code", "execution_count": 16, "id": "94f04956", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:38.291309Z", "iopub.status.busy": "2022-05-16T23:18:38.290399Z", "iopub.status.idle": "2022-05-16T23:18:38.292739Z", "shell.execute_reply": "2022-05-16T23:18:38.292199Z", "shell.execute_reply.started": "2022-04-19T22:50:27.345436Z" }, "papermill": { "duration": 0.103204, "end_time": "2022-05-16T23:18:38.292884", "exception": false, "start_time": "2022-05-16T23:18:38.189680", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "model_nm = 'microsoft/deberta-v3-small'" ] }, { "cell_type": "markdown", "id": "f8a437e9", "metadata": { "papermill": { "duration": 0.08842, "end_time": "2022-05-16T23:18:38.471586", "exception": false, "start_time": "2022-05-16T23:18:38.383166", "status": "completed" }, "tags": [] }, "source": [ "`AutoTokenizer` will create a tokenizer appropriate for a given model:" ] }, { "cell_type": "code", "execution_count": 17, "id": "9a7191a0", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:38.655835Z", "iopub.status.busy": "2022-05-16T23:18:38.655114Z", "iopub.status.idle": "2022-05-16T23:18:44.932165Z", "shell.execute_reply": "2022-05-16T23:18:44.931194Z", "shell.execute_reply.started": "2022-04-19T22:50:27.353989Z" }, "papermill": { "duration": 6.371289, "end_time": "2022-05-16T23:18:44.932308", "exception": false, "start_time": "2022-05-16T23:18:38.561019", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ac8ba7b571384940bb57dadacf3cdd90", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/52.0 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idanchortargetcontext
count36363636
unique36343629
top4112d61851461f60el displayinorganic photoconductor drumG02
freq1213
\n", "" ], "text/plain": [ " id anchor target context\n", "count 36 36 36 36\n", "unique 36 34 36 29\n", "top 4112d61851461f60 el display inorganic photoconductor drum G02\n", "freq 1 2 1 3" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eval_df = pd.read_csv(path/'test.csv')\n", "eval_df.describe()" ] }, { "cell_type": "markdown", "id": "b72de683", "metadata": { "papermill": { "duration": 0.093872, "end_time": "2022-05-16T23:18:56.237423", "exception": false, "start_time": "2022-05-16T23:18:56.143551", "status": "completed" }, "tags": [] }, "source": [ "This is the *test set*. Possibly the most important idea in machine learning is that of having separate training, validation, and test data sets." ] }, { "cell_type": "markdown", "id": "c80d0906", "metadata": { "heading_collapsed": true, "papermill": { "duration": 0.091766, "end_time": "2022-05-16T23:18:56.421195", "exception": false, "start_time": "2022-05-16T23:18:56.329429", "status": "completed" }, "tags": [] }, "source": [ "### Validation set" ] }, { "cell_type": "markdown", "id": "ce5af6dc", "metadata": { "hidden": true, "papermill": { "duration": 0.091787, "end_time": "2022-05-16T23:18:56.605465", "exception": false, "start_time": "2022-05-16T23:18:56.513678", "status": "completed" }, "tags": [] }, "source": [ "To explain the motivation, let's start simple, and imagine we're trying to fit a model where the true relationship is this quadratic:" ] }, { "cell_type": "code", "execution_count": 26, "id": "59f936fe", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:56.807733Z", "iopub.status.busy": "2022-05-16T23:18:56.806649Z", "iopub.status.idle": "2022-05-16T23:18:56.808617Z", "shell.execute_reply": "2022-05-16T23:18:56.809106Z", "shell.execute_reply.started": "2022-04-19T22:50:40.082981Z" }, "hidden": true, "papermill": { "duration": 0.110994, "end_time": "2022-05-16T23:18:56.809253", "exception": false, "start_time": "2022-05-16T23:18:56.698259", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def f(x): return -3*x**2 + 2*x + 20" ] }, { "cell_type": "markdown", "id": "9b056d51", "metadata": { "hidden": true, "papermill": { "duration": 0.102842, "end_time": "2022-05-16T23:18:57.018924", "exception": false, "start_time": "2022-05-16T23:18:56.916082", "status": "completed" }, "tags": [] }, "source": [ "Unfortunately matplotlib (the most common library for plotting in Python) doesn't come with a way to visualize a function, so we'll write something to do this ourselves:" ] }, { "cell_type": "code", "execution_count": 27, "id": "f79d3475", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:57.213875Z", "iopub.status.busy": "2022-05-16T23:18:57.213034Z", "iopub.status.idle": "2022-05-16T23:18:57.214858Z", "shell.execute_reply": "2022-05-16T23:18:57.215264Z", "shell.execute_reply.started": "2022-04-19T22:50:40.088386Z" }, "hidden": true, "papermill": { "duration": 0.100055, "end_time": "2022-05-16T23:18:57.215397", "exception": false, "start_time": "2022-05-16T23:18:57.115342", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import numpy as np, matplotlib.pyplot as plt\n", "\n", "def plot_function(f, min=-2.1, max=2.1, color='r'):\n", " x = np.linspace(min,max, 100)[:,None]\n", " plt.plot(x, f(x), color)" ] }, { "cell_type": "markdown", "id": "cdd515e0", "metadata": { "hidden": true, "papermill": { "duration": 0.093371, "end_time": "2022-05-16T23:18:57.404588", "exception": false, "start_time": "2022-05-16T23:18:57.311217", "status": "completed" }, "tags": [] }, "source": [ "Here's what our function looks like:" ] }, { "cell_type": "code", "execution_count": 28, "id": "f4718265", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:57.598056Z", "iopub.status.busy": "2022-05-16T23:18:57.597509Z", "iopub.status.idle": "2022-05-16T23:18:57.789107Z", "shell.execute_reply": "2022-05-16T23:18:57.789776Z", "shell.execute_reply.started": "2022-04-19T22:50:40.097166Z" }, "hidden": true, "papermill": { "duration": 0.293081, "end_time": "2022-05-16T23:18:57.789928", "exception": false, "start_time": "2022-05-16T23:18:57.496847", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_function(f)" ] }, { "cell_type": "markdown", "id": "b722c832", "metadata": { "hidden": true, "papermill": { "duration": 0.093043, "end_time": "2022-05-16T23:18:57.976558", "exception": false, "start_time": "2022-05-16T23:18:57.883515", "status": "completed" }, "tags": [] }, "source": [ "For instance, perhaps we've measured the height above ground of an object before and after some event. The measurements will have some random error. We can use numpy's random number generator to simulate that. I like to use `seed` when writing about simulations like this so that I know you'll see the same thing I do:" ] }, { "cell_type": "code", "execution_count": 29, "id": "db44230d", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:58.167367Z", "iopub.status.busy": "2022-05-16T23:18:58.166496Z", "iopub.status.idle": "2022-05-16T23:18:58.169108Z", "shell.execute_reply": "2022-05-16T23:18:58.168668Z", "shell.execute_reply.started": "2022-04-19T22:50:40.304166Z" }, "hidden": true, "papermill": { "duration": 0.099419, "end_time": "2022-05-16T23:18:58.169220", "exception": false, "start_time": "2022-05-16T23:18:58.069801", "status": "completed" }, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "from numpy.random import normal,seed,uniform\n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "id": "889069b7", "metadata": { "hidden": true, "papermill": { "duration": 0.104043, "end_time": "2022-05-16T23:18:58.379851", "exception": false, "start_time": "2022-05-16T23:18:58.275808", "status": "completed" }, "tags": [] }, "source": [ "Here's a function `add_noise` that adds some random variation to an array:" ] }, { "cell_type": "code", "execution_count": 30, "id": "34c80ce2", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:58.582003Z", "iopub.status.busy": "2022-05-16T23:18:58.580339Z", "iopub.status.idle": "2022-05-16T23:18:58.582612Z", "shell.execute_reply": "2022-05-16T23:18:58.583066Z", "shell.execute_reply.started": "2022-04-19T22:50:40.310049Z" }, "hidden": true, "papermill": { "duration": 0.104517, "end_time": "2022-05-16T23:18:58.583206", "exception": false, "start_time": "2022-05-16T23:18:58.478689", "status": "completed" }, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "def noise(x, scale): return normal(scale=scale, size=x.shape)\n", "def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)" ] }, { "cell_type": "markdown", "id": "9de58b05", "metadata": { "hidden": true, "papermill": { "duration": 0.09537, "end_time": "2022-05-16T23:18:58.776554", "exception": false, "start_time": "2022-05-16T23:18:58.681184", "status": "completed" }, "tags": [] }, "source": [ "Let's use it to simulate some measurements evenly distributed over time:" ] }, { "cell_type": "code", "execution_count": 31, "id": "cd6dc8f3", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:58.981118Z", "iopub.status.busy": "2022-05-16T23:18:58.980278Z", "iopub.status.idle": "2022-05-16T23:18:59.178183Z", "shell.execute_reply": "2022-05-16T23:18:59.178770Z", "shell.execute_reply.started": "2022-04-19T22:50:40.318907Z" }, "hidden": true, "papermill": { "duration": 0.309467, "end_time": "2022-05-16T23:18:59.178959", "exception": false, "start_time": "2022-05-16T23:18:58.869492", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD5CAYAAAA+0W6bAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAUOElEQVR4nO3df5BddXnH8ffTEHVV6oJZMVmIQEdjrVSjq4Ni/YU1jGMlpT9GRi0obaptrTpOLMhUpv0naBw7OLbjZISCMwxVMQbaagOClnHGxC4ECBgiakWzCWYtE9G65ZdP/9izsrnsvXv33nvuPefu+zWTybnnnLv32cPyyXef8z3nRGYiSaqfXxt0AZKkzhjgklRTBrgk1ZQBLkk1ZYBLUk0Z4JJUU8cstkNEnAR8FjgBSGBbZl4WEVuB3wMeBr4HvDMzj7T6WqtWrcqTTz6525olaVm59dZbf5KZY43rY7F54BGxGlidmbdFxLHArcBG4ETg5sx8NCI+CpCZf9Pqa01MTOTk5GSH34IkLU8RcWtmTjSuX7SFkpmHMvO2YvlnwD5gPDNvyMxHi912MRvokqQ+WVIPPCJOBtYDuxs2vQv4So9qkiS1oe0Aj4inA18E3p+ZD85bfzHwKHB1k/dtiojJiJicnp7utl5JUqGtAI+IlcyG99WZuX3e+vOBNwNvyybN9MzclpkTmTkxNvaEHrwkqUPtzEIJ4HJgX2Z+Yt76s4APAa/JzF+UV6IkaSGLBjhwBvAOYG9E3F6s+zDwSeDJwI2zGc+uzHx3GUVKndqxZ4qtO/dz8MgMa0ZH2LxhHRvXjw+6LKknFg3wzPwGEAts+nLvy5F6Z8eeKS7avpeZRx4DYOrIDBdt3wtgiGsotDMClzo2yBHw1p37fxXec2YeeYytO/cb4BoKBrhKM+gR8MEjM0taL9WN90JRaVqNgPthzejIktZLdWOAqzSDHgFv3rCOkZUrjlo3snIFmzes68vn98KOPVOccenNnHLhv3PGpTezY8/UoEtShdhCUWnWjI4wtUBY92sEPNemqesslEG3oFR9BrhKs3nDuqMCCPo/At64fry2YedJWC3GAFdp6j4CHrRBt6BUfQa4SlXnEXAvdDONctAtKFWfJzHVkifROjfXw546MkPyeA+73WM4DCdhVS4DXE11G0DLXbfTKDeuH2fLOacxPjpCAOOjI2w557Rl/RuNjmYLRU15Eq07vehhL/cWlFpzBK6mPInWHS8kUtkMcDVlAHXHHrbKZoCrKQOoO/awVTZ74GrKedzds4etMhngaskAkqrLFook1dSiAR4RJ0XE1yLi2xFxd0S8r1h/fETcGBH3Fn8fV365kqQ57YzAHwU+mJkvAE4H/jIiXgBcCNyUmc8FbipeS5L6ZNEAz8xDmXlbsfwzYB8wDpwNXFXsdhWwsaQaJUkLWFIPPCJOBtYDu4ETMvNQsel+4ITeliZJaqXtAI+IpwNfBN6fmQ/O35aZCWST922KiMmImJyenu6qWEnS49oK8IhYyWx4X52Z24vVP46I1cX21cDhhd6bmdsycyIzJ8bGxnpRsySJ9mahBHA5sC8zPzFv0/XAecXyecB1vS9PktRMOxfynAG8A9gbEbcX6z4MXAp8PiIuAO4D/riUCiVJC1o0wDPzG0A02Xxmb8uRJLXLKzElqaYMcEmqKQNckmrKAJekmjLAJammDHBJqikDXJJqygCXpJoywCWppgxwSaopA1ySasoAl6SaMsAlqaYMcEmqKQNckmrKAJekmjLAJamm2nkm5hURcTgi7pq37sURsSsibi+eOP/ycsuUJDVqZwR+JXBWw7qPAX+XmS8GPlK8liT10aIBnpm3AA80rgZ+vVh+BnCwx3VJkhbRzlPpF/J+YGdEfJzZfwRe2bOKJElt6fQk5nuAD2TmScAHgMub7RgRm4o++eT09HSHHydJatRpgJ8HbC+WvwA0PYmZmdsycyIzJ8bGxjr8OElSo04D/CDwmmL59cC9vSlHktSuRXvgEXEN8FpgVUQcAC4B/gy4LCKOAf4P2FRmkdKg7Ngzxdad+zl4ZIY1oyNs3rCOjevHB12WBLQR4Jl5bpNNL+1xLVKl7NgzxUXb9zLzyGMATB2Z4aLtewEMcVWCV2JKTWzduf9X4T1n5pHH2Lpz/4Aqko5mgEtNHDwys6T1Ur8Z4FITa0ZHlrRe6jcDXGpi84Z1jKxccdS6kZUr2Lxh3YAqko7W6ZWY0tCbO1HpLBRVlQEutbBx/biBrcqyhSJJNWWAS1JNGeCSVFMGuCTVlAEuSTXlLBRVmjeTkpozwFVZ3kxKas0WiirLm0lJrRngqixvJiW1ZoCrsryZlNSaAa7K8mZSUmuLBnhEXBERhyPirob1742IeyLi7oj4WHklarnauH6cLeecxvjoCAGMj46w5ZzTPIEpFdqZhXIl8Cngs3MrIuJ1wNnAizLzoYh4VjnlabnzZlJSc4uOwDPzFuCBhtXvAS7NzIeKfQ6XUJskqYVOe+DPA34nInZHxH9GxMt6WZQkaXGdXshzDHA8cDrwMuDzEXFqZmbjjhGxCdgEsHbt2k7rlCQ16HQEfgDYnrO+BfwSWLXQjpm5LTMnMnNibGys0zolSQ06DfAdwOsAIuJ5wJOAn/SoJklSGxZtoUTENcBrgVURcQC4BLgCuKKYWvgwcN5C7RNJUnkWDfDMPLfJprf3uBZJ0hJ4JaYk1ZQBLkk1ZYBLUk0Z4JJUUwa4JNWUAS5JNWWAS1JN+VDjIedT3aXhZYAPMZ/qLg03WyhDzKe6S8PNEfgQ86nuGjRbeOVyBD7EfKq7BmmuhTd1ZIbk8Rbejj1Tgy5taBjgQ8ynumuQbOGVzxbKEJv7VdVfYTUItvDKZ4APOZ/qrkFZMzrC1AJhbQuvd2yhSCqFLbzyOQKXVIoqtPCGfRZMO49UuwJ4M3A4M1/YsO2DwMeBscz0mZiSjjLIFt5yuJCtnRbKlcBZjSsj4iTgjcAPe1yTJHVtOcyCaeeZmLdExMkLbPoH4EPAdb0uSpKguxbIcpgF09FJzIg4G5jKzDt6XI8kAd1fCLQcLmRbcoBHxFOBDwMfaXP/TRExGRGT09PTS/04SctUty2Q5TALppMR+G8ApwB3RMQPgBOB2yLi2QvtnJnbMnMiMyfGxsY6r1TSstJtC2Tj+nG2nHMa46MjBDA+OsKWc04bmhOY0ME0wszcCzxr7nUR4hPOQpGGzyCn4fXiQqBhv5Bt0RF4RFwDfBNYFxEHIuKC8suSNGiDvhnVcmiBdKudWSjnLrL95J5VI6kyWvWg+zGqrcKFQFXnlZiSFlSFaXjD3gLplgFesmG/lFfDy5tRVZ83syrRoHuIUjfsQVefAV6i5XApr4bXcpiGV3e2UEpUhR6i1A170NXmCLxEy+FSXkmDY4CXyB6ipDLZQimR81gllckAL5k9REllsYUiSTVlgEtSTRngklRTBrgk1ZQBLkk15SyUivNmWJKaMcArbO5mWHP3U5m7GRZgiEuyhVJl3gxLUivtPFLtiog4HBF3zVu3NSLuiYg7I+JLETFaapXLlDfDktRKOyPwK4GzGtbdCLwwM38b+A5wUY/rEt4MS1JriwZ4Zt4CPNCw7obMfLR4uQs4sYTalj1vhiWplV6cxHwX8LkefB018GZYklrpKsAj4mLgUeDqFvtsAjYBrF27tpuPW5a8GZakZjqehRIR5wNvBt6Wmdlsv8zclpkTmTkxNjbW6cdJkhp0NAKPiLOADwGvycxf9LYkSVI72plGeA3wTWBdRByIiAuATwHHAjdGxO0R8emS65QkNVh0BJ6Z5y6w+vISapEkLYFXYkpSTRngklRTBrgk1ZQBLkk1ZYBLUk0Z4JJUUwa4JNWUT+SRhpiP5BtuBrg0pHwk3/AzwBfhCEZ11eqRfP4MDwcDvAVHMKozH8k3/DyJ2YIPFVad+Ui+4WeAt+AIRnXmI/mGnwHegiMY1dnG9eNsOec0xkdHCGB8dIQt55xm+2+I2ANvYfOGdUf1wMERjOrFR/INNwO8BR8qLKnKDPBFOIKRVFXtPFLtiog4HBF3zVt3fETcGBH3Fn8fV26ZkqRG7ZzEvBI4q2HdhcBNmflc4KbitSSpjxYN8My8BXigYfXZwFXF8lXAxt6WJUlaTKfTCE/IzEPF8v3ACT2qR5LUpq7ngWdmAtlse0RsiojJiJicnp7u9uMkSYVOZ6H8OCJWZ+ahiFgNHG62Y2ZuA7YBTExMNA16SaqiKt/QrtMR+PXAecXyecB1vSlHkqpj7oZ2U0dmSB6/od2OPVODLg1obxrhNcA3gXURcSAiLgAuBX43Iu4F3lC8lqShUvUb2i3aQsnMc5tsOrPHtUhSpVT9hnbezEqSmqj6De0McElqouq35PVeKJLURNVvaGeAS1ILVb6hnS0USaopA1ySasoAl6SasgcuSSUq81J8A1ySSjJ3Kf7c1Zxzl+IDPQlxWyiSVJKyL8U3wCWpJGVfim+AS1JJyr4U3wCXpJKUfSm+JzElqSRlX4pvgEtSicq8FN8WiiTVlAEuSTXVVYBHxAci4u6IuCsiromIp/SqMElSax0HeESMA38NTGTmC4EVwFt7VZgkqbVuWyjHACMRcQzwVOBg9yVJktrRcYBn5hTwceCHwCHgp5l5Q68KkyS11k0L5TjgbOAUYA3wtIh4+wL7bYqIyYiYnJ6e7rxSSdJRummhvAH478yczsxHgO3AKxt3ysxtmTmRmRNjY2NdfJwkab5uAvyHwOkR8dSICOBMYF9vypIkLaabHvhu4FrgNmBv8bW29aguSdIiurqUPjMvAS7pUS2SpCXwSkxJqikDXJJqygCXpJoywCWppgxwSaopA1ySasoAl6SaMsAlqaYMcEmqKQNckmrKAJekmjLAJammDHBJqikDXJJqqqvbyfbDjj1TbN25n4NHZlgzOsLmDevYuH68b++XpKqqdIDv2DPFRdv3MvPIYwBMHZnhou17AdoK4W7fL0lVVukWytad+38VvnNmHnmMrTv39+X9klRlXQV4RIxGxLURcU9E7IuIV/SqMICDR2aWtL7X75ekKut2BH4Z8B+Z+XzgRfT4ocZrRkeWtL7X75ekKus4wCPiGcCrgcsBMvPhzDzSo7oA2LxhHSMrVxy1bmTlCjZvWNeX90tSlXVzEvMUYBr454h4EXAr8L7M/N+eVMbjJxo7nUXS7fslqcoiMzt7Y8QEsAs4IzN3R8RlwIOZ+bcN+20CNgGsXbv2pffdd1+XJUvS8hIRt2bmROP6bnrgB4ADmbm7eH0t8JLGnTJzW2ZOZObE2NhYFx8nSZqv4wDPzPuBH0XEXEP5TODbPalKkrSobi/keS9wdUQ8Cfg+8M7uS5IktaOrAM/M24En9GUkSeWr9JWYkqTmOp6F0tGHRUwDnU5DWQX8pIfl9Ip1LY11LY11LU1V64LuantOZj5hFkhfA7wbETG50DSaQbOupbGupbGupalqXVBObbZQJKmmDHBJqqk6Bfi2QRfQhHUtjXUtjXUtTVXrghJqq00PXJJ0tDqNwCVJ81Q2wCNia/GgiDsj4ksRMdpkv7MiYn9EfDciLuxDXX8UEXdHxC+LG3o12+8HEbE3Im6PiMkK1dXv43V8RNwYEfcWfx/XZL/HimN1e0RcX2I9Lb//iHhyRHyu2L47Ik4uq5Yl1nV+REzPO0Z/2qe6roiIwxFxV5PtERGfLOq+MyKecD+kAdX12oj46bzj9ZE+1HRSRHwtIr5d/L/4vgX26e3xysxK/gHeCBxTLH8U+OgC+6wAvgecCjwJuAN4Qcl1/SawDvg6MNFivx8Aq/p4vBata0DH62PAhcXyhQv9dyy2/bwPx2jR7x/4C+DTxfJbgc9VpK7zgU/16+dp3ue+mtmb1N3VZPubgK8AAZwO7K5IXa8F/q3Px2o18JJi+VjgOwv8d+zp8arsCDwzb8jMR4uXu4ATF9jt5cB3M/P7mfkw8C/A2SXXtS8zK/dQzTbr6vvxKr7+VcXyVcDGkj+vlXa+//n1XgucGRFRgboGIjNvAR5oscvZwGdz1i5gNCJWV6CuvsvMQ5l5W7H8M2afUNb48IGeHq/KBniDdzH7r1ajceBH814f4IkHbFASuCEibi3uiV4FgzheJ2TmoWL5fuCEJvs9JSImI2JXRGwsqZZ2vv9f7VMMIH4KPLOkepZSF8AfFL92XxsRJ5VcU7uq/P/gKyLijoj4SkT8Vj8/uGi9rQd2N2zq6fHq9m6EXYmIrwLPXmDTxZl5XbHPxcCjwNVVqqsNr8rMqYh4FnBjRNxTjBoGXVfPtapr/ovMzIhoNu3pOcXxOhW4OSL2Zub3el1rjf0rcE1mPhQRf87sbwmvH3BNVXYbsz9TP4+INwE7gOf244Mj4unAF4H3Z+aDZX7WQAM8M9/QantEnA+8GTgziwZSgylg/kjkxGJdqXW1+TWmir8PR8SXmP01uasA70FdfT9eEfHjiFidmYeKXxUPN/kac8fr+xHxdWZHL70O8Ha+/7l9DkTEMcAzgP/pcR1Lrisz59fwGWbPLVRBKT9T3ZofnJn55Yj4p4hYlZml3iclIlYyG95XZ+b2BXbp6fGqbAslIs4CPgS8JTN/0WS3/wKeGxGnxOw9yd8KlDaDoV0R8bSIOHZumdkTsgueLe+zQRyv64HziuXzgCf8phARx0XEk4vlVcAZlPNwkHa+//n1/iFwc5PBQ1/rauiTvoXZ/moVXA/8STG74nTgp/NaZgMTEc+eO3cRES9nNutK/Ye4+LzLgX2Z+Ykmu/X2ePXzLO0Sz+h+l9le0e3Fn7mZAWuALzec1f0Os6O1i/tQ1+8z27d6CPgxsLOxLmZnE9xR/Lm7KnUN6Hg9E7gJuBf4KnB8sX4C+Eyx/Epgb3G89gIXlFjPE75/4O+ZHSgAPAX4QvHz9y3g1LKPUZt1bSl+lu4AvgY8v091XQMcAh4pfr4uAN4NvLvYHsA/FnXvpcXMrD7X9Vfzjtcu4JV9qOlVzJ77unNebr2pzOPllZiSVFOVbaFIklozwCWppgxwSaopA1ySasoAl6SaMsAlqaYMcEmqKQNckmrq/wHblvg7QR4VHwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.linspace(-2, 2, num=20)[:,None]\n", "y = add_noise(f(x), 0.2, 1.3)\n", "plt.scatter(x,y);" ] }, { "cell_type": "markdown", "id": "1ee44447", "metadata": { "hidden": true, "papermill": { "duration": 0.095978, "end_time": "2022-05-16T23:18:59.370378", "exception": false, "start_time": "2022-05-16T23:18:59.274400", "status": "completed" }, "tags": [] }, "source": [ "Now let's see what happens if we *underfit* or *overfit* these predictions. To do that, we'll create a function that fits a polynomial of some degree (e.g. a line is degree 1, quadratic is degree 2, cubic is degree 3, etc). The details of how this function works don't matter too much so feel free to skip over it if you like! (PS: if you're not sure about the jargon around polynomials, here's a [great video](https://www.youtube.com/watch?v=ffLLmV4mZwU) which teaches you what you'll need to know.)" ] }, { "cell_type": "code", "execution_count": 32, "id": "59371ec4", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:18:59.563588Z", "iopub.status.busy": "2022-05-16T23:18:59.562917Z", "iopub.status.idle": "2022-05-16T23:19:00.524919Z", "shell.execute_reply": "2022-05-16T23:19:00.523968Z", "shell.execute_reply.started": "2022-04-19T22:50:40.539771Z" }, "hidden": true, "papermill": { "duration": 1.060551, "end_time": "2022-05-16T23:19:00.525064", "exception": false, "start_time": "2022-05-16T23:18:59.464513", "status": "completed" }, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.pipeline import make_pipeline\n", "\n", "def plot_poly(degree):\n", " model = make_pipeline(PolynomialFeatures(degree), LinearRegression())\n", " model.fit(x, y)\n", " plt.scatter(x,y)\n", " plot_function(model.predict)" ] }, { "cell_type": "markdown", "id": "143c4476", "metadata": { "hidden": true, "papermill": { "duration": 0.095759, "end_time": "2022-05-16T23:19:00.717839", "exception": false, "start_time": "2022-05-16T23:19:00.622080", "status": "completed" }, "tags": [] }, "source": [ "So, what happens if we fit a line (a \"degree 1 polynomial\") to our measurements?" ] }, { "cell_type": "code", "execution_count": 33, "id": "38869436", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:01.002283Z", "iopub.status.busy": "2022-05-16T23:19:00.992285Z", "iopub.status.idle": "2022-05-16T23:19:01.176825Z", "shell.execute_reply": "2022-05-16T23:19:01.177542Z", "shell.execute_reply.started": "2022-04-19T22:50:41.552578Z" }, "hidden": true, "papermill": { "duration": 0.365113, "end_time": "2022-05-16T23:19:01.177706", "exception": false, "start_time": "2022-05-16T23:19:00.812593", "status": "completed" }, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD5CAYAAAA+0W6bAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAU8ElEQVR4nO3dfYxcZ3XH8e/BccjmjXW8S4g3cW1QWEjjBMM2SolaXkLrtKKNa6lVI4pCQbWKKAWETAmojSpVSooRFRWVkKWkASlKS8E1SC01KaBGlUhaJ07rhGCglBevA97EcZKSTWI7p3/sDJ4dz+zMztudu/P9SKuZuXNn5uwo+8vjc5/73MhMJEnl86KiC5AkdcYAl6SSMsAlqaQMcEkqKQNckkrKAJekkjqj1Q4RcQnwWeBCIIFdmfnJiNgJ/AbwPPA/wO9n5rGl3mtiYiI3bNjQbc2SNFLuv//+xzJzsn57tJoHHhEXARdl5gMRcR5wP7AVuBj4WmaeiIi/BMjMP1nqvWZmZnLfvn0d/gqSNJoi4v7MnKnf3rKFkpmPZuYDlftPA48AU5n5lcw8UdntXhYCXZI0IMvqgUfEBmAzcF/dU+8EvtyjmiRJbWg7wCPiXOALwPsz86ma7R8FTgB3Nnnd9ojYFxH75ubmuq1XklTRVoBHxGoWwvvOzNxds/0dwFuBt2WTZnpm7srMmcycmZw8rQcvSepQO7NQArgNeCQzP1Gz/TrgQ8AbMvOZ/pUoSWqkZYAD1wBvBw5ExIOVbR8B/hp4MXD3QsZzb2b+YT+KlLqxZ/8sO/ce5PCxedaNj7FjyzRbN08VXZbUtZYBnpn/DkSDp/659+VIvbVn/yw37T7A/PGTAMwem+em3QcADHGVXjsjcKlrRY2Cd+49+LPwrpo/fpKdew8a4Co9A1x9V+Qo+PCx+WVtl8rEtVDUd0uNgvtt3fjYsrZLZWKAq++KHAXv2DLN2OpVi7aNrV7Fji3Tff/sbuzZP8s1t36NjR/+J6659Wvs2T9bdEkaQrZQ1HfrxseYbRDWgxgFV1s0ZZqF4oFXtcsAV9/t2DK9KJBgsKPgrZunShV8HnhVuwxw9V0ZR8FF8sCr2mWAayDKNgruhU6nThbZclK5eBBTbfPAWvuqfezZY/Mkp/rY7XxnZT3wqsEzwNWWbgJpFHUzdXLr5ilu2baJqfExApgaH+OWbZtG7l8was0WitrigbXl6baPPYotJy2fI3C1xQNry+MJRBoEA1xtMZCWxz62BsEAV1sMpOWxj61BsAeutjiXe/nsY6vfDHC1zUCShostFEkqqZYBHhGXRMTXI+KbEfFwRLyvsv2CiLg7Ir5TuV3T/3IlSVXtjMBPAB/MzMuAq4H3RMRlwIeBr2bmpcBXK48lSQPSMsAz89HMfKBy/2ngEWAKuB74TGW3zwBb+1SjJKmBZfXAI2IDsBm4D7gwMx+tPPVj4MLeliZJWkrbAR4R5wJfAN6fmU/VPpeZCWST122PiH0RsW9ubq6rYiVJp7QV4BGxmoXwvjMzd1c2/yQiLqo8fxFwpNFrM3NXZs5k5szk5GQvapYk0d4slABuAx7JzE/UPPUl4MbK/RuBL/a+PElSM+2cyHMN8HbgQEQ8WNn2EeBW4HMR8S7gB8Dv9KVCSVJDLQM8M/8diCZPX9vbciRJ7fJMTEkqKQNckkrKAJekkjLAJamkDHBJKikDXJJKygCXpJIywCWppAxwSSopA1ySSsoAl6SSMsAlqaQMcEkqKQNckkrKAJekkjLAJamkDHBJKql2rol5e0QciYiHara9JiLujYgHK1ecv6q/ZUqS6rUzAr8DuK5u28eAP8/M1wB/VnksSRqglgGemfcAR+s3A+dX7r8EONzjuiRJLbRzVfpG3g/sjYiPs/A/gdf3rCJJUls6PYj5buADmXkJ8AHgtmY7RsT2Sp9839zcXIcfJ0mq12mA3wjsrtz/B6DpQczM3JWZM5k5Mzk52eHHSZLqdRrgh4E3VO6/GfhOb8qRJLWrZQ88Iu4C3ghMRMQh4GbgD4BPRsQZwLPA9n4WKRVlz/5Zdu49yOFj86wbH2PHlmm2bp4quiwJaCPAM/OGJk+9rse1SENlz/5Zbtp9gPnjJwGYPTbPTbsPABjiGgqeiSk1sXPvwZ+Fd9X88ZPs3HuwoIqkxQxwqYnDx+aXtV0aNANcamLd+NiytkuDZoBLTezYMs3Y6lWLto2tXsWOLdMFVSQt1umZmNKKVz1Q6SwUDSsDXFrC1s1TBraGli0USSopA1ySSsoAl6SSMsAlqaQMcEkqKWehaOi5oJTUmAGuoeaCUlJztlA01FxQSmrOANdQc0EpqTkDXEPNBaWk5gxwDTUXlJKaaxngEXF7RByJiIfqtr83Ir4VEQ9HxMf6V6JG2dbNU9yybRNT42MEMDU+xi3bNnkAU6K9WSh3AJ8CPlvdEBFvAq4HrszM5yLipf0pT3JBKamZliPwzLwHOFq3+d3ArZn5XGWfI32oTZK0hE574K8Efiki7ouIf4uIX+hlUZKk1jo9kecM4ALgauAXgM9FxMszM+t3jIjtwHaA9evXd1qnJKlOpyPwQ8DuXPAfwAvARKMdM3NXZs5k5szk5GSndUqS6nQa4HuANwFExCuBM4HHelSTJKkNLVsoEXEX8EZgIiIOATcDtwO3V6YWPg/c2Kh9Iknqn5YBnpk3NHnq93pciyRpGTwTU5JKygCXpJJyPXBJ6pXjx+Hxx+GxxxZ+qvcffxxuuAE2buzpxxngktTIc88tBG+zQG50+9RTzd/vyisNcElatmefbR66zcL5//6v+fudey5MTCz8rF0Lr3zl4sf1t2vXwlln9fzXMsAllcv8/OkB3GqU/NOfNn+/888/Fb4vfSm8+tWLw7j6Mzl56v6LXzy433cJBrikYmTCM8+015qovX3mmebvOT5+auT7spfBz//8qeCtDeRqQF9wAZx55sB+5V4zwEeMV3hXX2QutBzaDeLq/Wefbf6ea9acCt2pKbjiilMj4UYtirVr4YzRirTR+m1HnFd4V1sy4emnm4dus/vPP9/4/SJOhfHEBKxfD6997eIQrh8dr1kzcmHcCb+hEbLUFd4N8BUqc2FmRDsH7Wpvjx9v/H4vetFC26EauC9/OVx11emtidrR8Zo1sGpV4/dTVwzwEeIV3kvuhRfgySfbO2hXvX38cThxovH7rVq1OGgvvRSuvrpxr7h6f3x8IcSbONWiO8a68efYseVctk4Y3v1igI+QdeNjzDYIa6/wXoAXXoBjx9qfX/zYY3D0KJw82fj9zjhjcRhPTzceDdfOpBgfX2hv9IgtusEzwEfIji3Ti/7AwCu898TJk/DEE8ubTXH06EKIN7J69eJR8GWXLQ7fRr3j88/vaRh3whbd4BngI6T6R+QslCWcOLEQrksdwKu/feKJhV5zI2eeuXjmxKZNjWdQ1IbzuecWHsadsEU3eAb4iBmpK7yfOHH6qdCtRsdPPNH8/c46a3H4XnJJ49Fw7e0555QyjDthi27wDHCVw/PPLw7jdkbITz7Z/P3OPntx0G7Y0Lw9Ub1/9tkD+3XLyBbd4BngGrzqIkHL6RkvtUjQOecsbke84hVLH7ybmIAxR4W9VnSLbhRPUmvnkmq3A28FjmTm5XXPfRD4ODCZmV4TcxTVLhLUamRcvb/UIkHnnXf61LbaM+8anfjRh0WC1JmiWnSjOgOmnRH4HcCngM/WboyIS4BfBX7Y+7JUiF4vEnTeeafCt36RoGbrUgzJIkEql1GdAdPONTHviYgNDZ76K+BDwBd7XZS61GiRoHbaFPNLzBaoXSTooovg8suXXjpz7dpSLxKkYnTaBhnVGTAd9cAj4npgNjP/K0bkCHthMhdGua1CuH5bq0WCqoF78cXwmtc0DuLakbHrUqjPummDjOoMmGX/VUbE2cBHWGiftLP/dmA7wPr165f7cStLO4sENbp97rnG71e7SNDatacWCaoNXxcJUkl00wYZ1RkwnfwlvwLYCFRH3xcDD0TEVZn54/qdM3MXsAtgZmamydkOJdRokaBmvePadSlaLRJUnTGxcSPMzJw+e6J+XQoXCdIK0U0bpOgZMEVZdoBn5gHgpdXHEfF9YKbUs1CqiwQt55JLy10k6PWvX3rFthaLBEmDUtR0vG7bICN1klpFO9MI7wLeCExExCHg5sy8rd+Fdax2kaB2z8BrtUhQbdi+6lVLr9a2di285CWGsUqpyOl4o9oG6UY7s1BuaPH8hp5V08zsLHz/++0dvGt3kaCJiYVFghqtR1HbthiCRYKkQSlyOt6otkG6UY6jWX/xF/DpTy/eduaZi0e/1cstNVuTosSLBEmDUvR0vFFsg3SjHAH+7nfD1q2LR8clXiRoFE/5VTmM6nS8sipHgF9xxcLPCjCqp/yqHOxDl4tH2gZsqR6jVLStm6e4ZdsmpsbHCGBqfIxbtm1ycDGkyjECX0GK7jFKrdiHLg9H4APWrJdoj1HSchngA7ZjyzRjqxefPWmPUVInbKEMmHNdJfWKAV4Ae4ySesEWiiSVlAEuSSVlgEtSSRngklRSBrgklZSzUErGhbAkVRngJeJCWJJq2UIpERfCklSrZYBHxO0RcSQiHqrZtjMivhUR/x0R/xgR432tUoALYUlarJ0R+B3AdXXb7gYuz8wrgG8DN/W4LjXgQliSarUM8My8Bzhat+0rmVm9JPu9wMV9qE11XAhLUq1eHMR8J/D3PXgfteBCWJJqdRXgEfFR4ARw5xL7bAe2A6xfv76bjxMuhCXplI5noUTEO4C3Am/LzGy2X2buysyZzJyZnJzs9OMkSXU6GoFHxHXAh4A3ZOYzvS1JktSOdqYR3gV8A5iOiEMR8S7gU8B5wN0R8WBEfLrPdUqS6rQcgWfmDQ0239aHWiRJy+CZmJJUUga4JJWUAS5JJWWAS1JJGeCSVFIGuCSVlAEuSSXlFXmkFchL740GA1xaYbz03ugwwDvkCEfDaqlL7/nf6MpigHfAEY6GmZfeGx0exOyAFxfWMPPSe6PDAO+AIxwNMy+9NzoM8A44wtEw27p5ilu2bWJqfIwApsbHuGXbJtt7K5A98A7s2DK9qAcOjnA0XLz03mgwwDvgxYUlDQMDvEOOcCQVrZ1Lqt0eEUci4qGabRdExN0R8Z3K7Zr+lilJqtfOQcw7gOvqtn0Y+GpmXgp8tfJYkjRALQM8M+8BjtZtvh74TOX+Z4CtvS1LktRKp9MIL8zMRyv3fwxc2KN6JElt6noeeGYmkM2ej4jtEbEvIvbNzc11+3GSpIpOZ6H8JCIuysxHI+Ii4EizHTNzF7ALYGZmpmnQS1JRyro4Xacj8C8BN1bu3wh8sTflSNJgVRenmz02T3Jqcbo9+2eLLq2ldqYR3gV8A5iOiEMR8S7gVuBXIuI7wFsqjyWpdMq8OF3LFkpm3tDkqWt7XIskDVyZF6dzMStJI63Mi9MZ4JJGWpmX33UtFEkjrcyL0xngkkZeWRens4UiSSVlgEtSSRngklRS9sAlqQtFnoZvgEtSh6qn4VfP5Kyehg8MJMRtoUhSh4o+Dd8Al6QOFX0avgEuSR0q+jR8A1ySOlT0afgexJSkDhV9Gr4BLkldKPI0fFsoklRSBrgklVRXAR4RH4iIhyPioYi4KyLO6lVhkqSldRzgETEF/DEwk5mXA6uA3+1VYZKkpXXbQjkDGIuIM4CzgcPdlyRJakfHAZ6Zs8DHgR8CjwJPZuZXelWYJGlp3bRQ1gDXAxuBdcA5EfF7DfbbHhH7ImLf3Nxc55VKkhbppoXyFuB/M3MuM48Du4HX1++UmbsycyYzZyYnJ7v4OElSrW4C/IfA1RFxdkQEcC3wSG/KkiS10k0P/D7g88ADwIHKe+3qUV2SpBa6OpU+M28Gbu5RLZKkZfBMTEkqKQNckkrKAJekkjLAJamkDHBJKikDXJJKygCXpJIywCWppAxwSSopA1ySSsoAl6SSMsAlqaQMcEkqKQNckkqqq+Vki7Zn/yw79x7k8LF51o2PsWPLNFs3Tw3s9ZJUpNIG+J79s9y0+wDzx08CMHtsnpt2HwBoK4S7fb0kFa20LZSdew/+LHyr5o+fZOfegwN5vSQVrasAj4jxiPh8RHwrIh6JiF/sVWGtHD42v6ztvX69JBWt2xH4J4F/ycxXAVcywIsarxsfW9b2Xr9ekorWcYBHxEuAXwZuA8jM5zPzWI/qamnHlmnGVq9atG1s9Sp2bJkeyOslqWjdHMTcCMwBfxsRVwL3A+/LzJ/2pLIWqgcaO51F0u3rJalokZmdvTBiBrgXuCYz74uITwJPZeaf1u23HdgOsH79+tf94Ac/6LJkSRotEXF/Zs7Ub++mB34IOJSZ91Uefx54bf1OmbkrM2cyc2ZycrKLj5Mk1eo4wDPzx8CPIqLaNL4W+GZPqpIktdTtiTzvBe6MiDOB7wG/331JkqR2dBXgmfkgcFpfRpLUf6U9E1OSRl3Hs1A6+rCIOaAf01AmgMf68L4rid/R0vx+WvM7Wlo/v5+fy8zTZoEMNMD7JSL2NZpio1P8jpbm99Oa39HSivh+bKFIUkkZ4JJUUislwHcVXUAJ+B0tze+nNb+jpQ38+1kRPXBJGkUrZQQuSSNnxQR4ROysXFjivyPiHyNivOiahk1E/HZEPBwRL1QWIxMQEddFxMGI+G5EfLjoeoZNRNweEUci4qGiaxlGEXFJRHw9Ir5Z+ft636A+e8UEOHA3cHlmXgF8G7ip4HqG0UPANuCeogsZFhGxCvgb4NeAy4AbIuKyYqsaOncA1xVdxBA7AXwwMy8DrgbeM6j/hlZMgGfmVzLzROXhvcDFRdYzjDLzkcz0op+LXQV8NzO/l5nPA38HXF9wTUMlM+8BjhZdx7DKzEcz84HK/adZuDLZQC4ssGICvM47gS8XXYRKYQr4Uc3jQwzoj08rT0RsADYD97XYtSe6XY1woCLiX4GXNXjqo5n5xco+H2XhnzR3DrK2YdHOdySp9yLiXOALwPsz86lBfGapAjwz37LU8xHxDuCtwLU5ovMjW31HOs0scEnN44sr26S2RcRqFsL7zszcPajPXTEtlIi4DvgQ8JuZ+UzR9ag0/hO4NCI2Vta1/13gSwXXpBKJiGDh4u6PZOYnBvnZKybAgU8B5wF3R8SDEfHpogsaNhHxWxFxCPhF4J8iYm/RNRWtcuD7j4C9LBx8+lxmPlxsVcMlIu4CvgFMR8ShiHhX0TUNmWuAtwNvrmTPgxHx64P4YM/ElKSSWkkjcEkaKQa4JJWUAS5JJWWAS1JJGeCSVFIGuCSVlAEuSSVlgEtSSf0/+4frtEBKL+UAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_poly(1)" ] }, { "cell_type": "markdown", "id": "5a792455", "metadata": { "hidden": true, "papermill": { "duration": 0.093894, "end_time": "2022-05-16T23:19:01.366741", "exception": false, "start_time": "2022-05-16T23:19:01.272847", "status": "completed" }, "tags": [] }, "source": [ "As you see, the points on the red line (the line we fitted) aren't very close at all. This is *under-fit* -- there's not enough detail in our function to match our data.\n", "\n", "And what happens if we fit a degree 10 polynomial to our measurements?" ] }, { "cell_type": "code", "execution_count": 34, "id": "0946043f", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:01.577483Z", "iopub.status.busy": "2022-05-16T23:19:01.576630Z", "iopub.status.idle": "2022-05-16T23:19:01.740385Z", "shell.execute_reply": "2022-05-16T23:19:01.741143Z", "shell.execute_reply.started": "2022-04-19T22:50:41.806965Z" }, "hidden": true, "papermill": { "duration": 0.280325, "end_time": "2022-05-16T23:19:01.741320", "exception": false, "start_time": "2022-05-16T23:19:01.460995", "status": "completed" }, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_poly(10)" ] }, { "cell_type": "markdown", "id": "97381d3d", "metadata": { "hidden": true, "papermill": { "duration": 0.109375, "end_time": "2022-05-16T23:19:01.958176", "exception": false, "start_time": "2022-05-16T23:19:01.848801", "status": "completed" }, "tags": [] }, "source": [ "Well now it fits our data better, but it doesn't look like it'll do a great job predicting points other than those we measured -- especially those in earlier or later time periods. This is *over-fit* -- there's too much detail such that the model fits our points, but not the underlying process we really care about.\n", "\n", "Let's try a degree 2 polynomial (a quadratic), and compare it to our \"true\" function (in blue):" ] }, { "cell_type": "code", "execution_count": 35, "id": "96e74956", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:02.168390Z", "iopub.status.busy": "2022-05-16T23:19:02.167555Z", "iopub.status.idle": "2022-05-16T23:19:02.344144Z", "shell.execute_reply": "2022-05-16T23:19:02.343715Z", "shell.execute_reply.started": "2022-04-19T22:50:41.991988Z" }, "hidden": true, "papermill": { "duration": 0.290263, "end_time": "2022-05-16T23:19:02.344261", "exception": false, "start_time": "2022-05-16T23:19:02.053998", "status": "completed" }, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_poly(2)\n", "plot_function(f, color='b')" ] }, { "cell_type": "markdown", "id": "4789f92a", "metadata": { "hidden": true, "papermill": { "duration": 0.096446, "end_time": "2022-05-16T23:19:02.537038", "exception": false, "start_time": "2022-05-16T23:19:02.440592", "status": "completed" }, "tags": [] }, "source": [ "That's not bad at all!\n", "\n", "So, how do we recognise whether our models are under-fit, over-fit, or \"just right\"? We use a *validation set*. This is a set of data that we \"hold out\" from training -- we don't let our model see it at all. If you use the fastai library, it automatically creates a validation set for you if you don't have one, and will always report metrics (measurements of the accuracy of a model) using the validation set.\n", "\n", "The validation set is *only* ever used to see how we're doing. It's *never* used as inputs to training the model.\n", "\n", "Transformers uses a `DatasetDict` for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use `train_test_split`:" ] }, { "cell_type": "code", "execution_count": 36, "id": "b8b1e366", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:02.734895Z", "iopub.status.busy": "2022-05-16T23:19:02.734114Z", "iopub.status.idle": "2022-05-16T23:19:02.754237Z", "shell.execute_reply": "2022-05-16T23:19:02.754678Z", "shell.execute_reply.started": "2022-04-19T22:50:42.182674Z" }, "hidden": true, "papermill": { "duration": 0.122148, "end_time": "2022-05-16T23:19:02.754828", "exception": false, "start_time": "2022-05-16T23:19:02.632680", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n", " num_rows: 27354\n", " })\n", " test: Dataset({\n", " features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],\n", " num_rows: 9119\n", " })\n", "})" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dds = tok_ds.train_test_split(0.25, seed=42)\n", "dds" ] }, { "cell_type": "markdown", "id": "776f4394", "metadata": { "hidden": true, "papermill": { "duration": 0.096073, "end_time": "2022-05-16T23:19:02.948686", "exception": false, "start_time": "2022-05-16T23:19:02.852613", "status": "completed" }, "tags": [] }, "source": [ "As you see above, the validation set here is called `test` and not `validate`, so be careful!\n", "\n", "In practice, a random split like we've used here might not be a good idea -- here's what Dr Rachel Thomas has to say about it:\n", "\n", "> \"*One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a `train_test_split` method, this method takes a random subset of the data, which is a poor choice for many real-world problems.*\"\n", "\n", "I strongly recommend reading her article [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) to more fully understand this critical topic." ] }, { "cell_type": "markdown", "id": "dc1f53c9", "metadata": { "heading_collapsed": true, "papermill": { "duration": 0.096584, "end_time": "2022-05-16T23:19:03.142638", "exception": false, "start_time": "2022-05-16T23:19:03.046054", "status": "completed" }, "tags": [] }, "source": [ "### Test set" ] }, { "cell_type": "markdown", "id": "e665eb75", "metadata": { "hidden": true, "papermill": { "duration": 0.10414, "end_time": "2022-05-16T23:19:03.343573", "exception": false, "start_time": "2022-05-16T23:19:03.239433", "status": "completed" }, "tags": [] }, "source": [ "So that's the validation set explained, and created. What about the \"test set\" then -- what's that for?\n", "\n", "The *test set* is yet another dataset that's held out from training. But it's held out from reporting metrics too! The accuracy of your model on the test set is only ever checked after you've completed your entire training process, including trying different models, training methods, data processing, etc.\n", "\n", "You see, as you try all these different things, to see their impact on the metrics on the validation set, you might just accidentally find a few things that entirely coincidentally improve your validation set metrics, but aren't really better in practice. Given enough time and experiments, you'll find lots of these coincidental improvements. That means you're actually over-fitting to your validation set!\n", "\n", "That's why we keep a test set held back. Kaggle's public leaderboard is like a test set that you can check from time to time. But don't check too often, or you'll be even over-fitting to the test set!\n", "\n", "Kaggle has a *second* test set, which is yet another held-out dataset that's only used at the *end* of the competition to assess your predictions. That's called the \"private leaderboard\". Here's a [great post](https://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/) about what can happen if you overfit to the public leaderboard.\n", "\n", "We'll use `eval` as our name for the test set, to avoid confusion with the `test` dataset that was created above." ] }, { "cell_type": "code", "execution_count": 37, "id": "3a064b7f", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:03.544188Z", "iopub.status.busy": "2022-05-16T23:19:03.543651Z", "iopub.status.idle": "2022-05-16T23:19:04.825203Z", "shell.execute_reply": "2022-05-16T23:19:04.823967Z", "shell.execute_reply.started": "2022-04-19T22:50:42.209113Z" }, "hidden": true, "papermill": { "duration": 1.38502, "end_time": "2022-05-16T23:19:04.825395", "exception": false, "start_time": "2022-05-16T23:19:03.440375", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ca8cd9e5c48c4b17bad4862d247c0565", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1 [00:00 At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so. This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.\n", "\n", "In Kaggle, however, it's very straightforward to know what metric to use: Kaggle will tell you! According to this competition's [evaluation page](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview/evaluation), \"*submissions are evaluated on the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the predicted and actual similarity scores*.\" This coefficient is usually abbreviated using the single letter *r*. It is the most widely used measure of the degree of relationship between two variables.\n", "\n", "r can vary between `-1`, which means perfect inverse correlation, and `+1`, which means perfect positive correlation. The mathematical formula for it is much less important than getting a good intuition for what the different values look like. To start to get that intuition, let's look at some examples using the [California Housing](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) dataset, which shows \"*is the median house value for California districts, expressed in hundreds of thousands of dollars*\". This dataset is provided by the excellent [scikit-learn](https://scikit-learn.org/stable/) library, which is the most widely used library for machine learning outside of deep learning." ] }, { "cell_type": "code", "execution_count": 38, "id": "0d4d7f9a", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:05.445767Z", "iopub.status.busy": "2022-05-16T23:19:05.445083Z", "iopub.status.idle": "2022-05-16T23:19:07.853081Z", "shell.execute_reply": "2022-05-16T23:19:07.852530Z", "shell.execute_reply.started": "2022-04-19T22:50:43.287033Z" }, "hidden": true, "papermill": { "duration": 2.510087, "end_time": "2022-05-16T23:19:07.853218", "exception": false, "start_time": "2022-05-16T23:19:05.343131", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
75063.055037.05.1527781.048611729.05.06250033.92-118.281.054
47203.086235.04.6978971.0554491159.02.21606134.05-118.373.453
128882.555624.04.8649051.1292221631.02.39500738.66-121.351.057
133443.005732.04.2126870.9365671378.05.14179134.05-117.640.969
71731.908342.03.8885541.0391571535.04.62349434.05-118.191.192
\n", "
" ], "text/plain": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "7506 3.0550 37.0 5.152778 1.048611 729.0 5.062500 33.92 \n", "4720 3.0862 35.0 4.697897 1.055449 1159.0 2.216061 34.05 \n", "12888 2.5556 24.0 4.864905 1.129222 1631.0 2.395007 38.66 \n", "13344 3.0057 32.0 4.212687 0.936567 1378.0 5.141791 34.05 \n", "7173 1.9083 42.0 3.888554 1.039157 1535.0 4.623494 34.05 \n", "\n", " Longitude MedHouseVal \n", "7506 -118.28 1.054 \n", "4720 -118.37 3.453 \n", "12888 -121.35 1.057 \n", "13344 -117.64 0.969 \n", "7173 -118.19 1.192 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import fetch_california_housing\n", "housing = fetch_california_housing(as_frame=True)\n", "housing = housing['data'].join(housing['target']).sample(1000, random_state=52)\n", "housing.head()" ] }, { "cell_type": "markdown", "id": "2019b08b", "metadata": { "hidden": true, "papermill": { "duration": 0.107065, "end_time": "2022-05-16T23:19:08.070958", "exception": false, "start_time": "2022-05-16T23:19:07.963893", "status": "completed" }, "tags": [] }, "source": [ "We can see all the correlation coefficients for every combination of columns in this dataset by calling `np.corrcoef`:" ] }, { "cell_type": "code", "execution_count": 39, "id": "3a95a2b7", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:08.288467Z", "iopub.status.busy": "2022-05-16T23:19:08.283123Z", "iopub.status.idle": "2022-05-16T23:19:08.296176Z", "shell.execute_reply": "2022-05-16T23:19:08.296946Z", "shell.execute_reply.started": "2022-04-19T22:50:45.696764Z" }, "hidden": true, "papermill": { "duration": 0.128649, "end_time": "2022-05-16T23:19:08.297169", "exception": false, "start_time": "2022-05-16T23:19:08.168520", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[ 1. , -0.12, 0.43, -0.08, 0.01, -0.07, -0.12, 0.04, 0.68],\n", " [-0.12, 1. , -0.17, -0.06, -0.31, 0. , 0.03, -0.13, 0.12],\n", " [ 0.43, -0.17, 1. , 0.76, -0.09, -0.07, 0.12, -0.03, 0.21],\n", " [-0.08, -0.06, 0.76, 1. , -0.08, -0.07, 0.09, 0. , -0.04],\n", " [ 0.01, -0.31, -0.09, -0.08, 1. , 0.16, -0.15, 0.13, 0. ],\n", " [-0.07, 0. , -0.07, -0.07, 0.16, 1. , -0.16, 0.17, -0.27],\n", " [-0.12, 0.03, 0.12, 0.09, -0.15, -0.16, 1. , -0.93, -0.16],\n", " [ 0.04, -0.13, -0.03, 0. , 0.13, 0.17, -0.93, 1. , -0.03],\n", " [ 0.68, 0.12, 0.21, -0.04, 0. , -0.27, -0.16, -0.03, 1. ]])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.set_printoptions(precision=2, suppress=True)\n", "\n", "np.corrcoef(housing, rowvar=False)" ] }, { "cell_type": "markdown", "id": "04b9ef46", "metadata": { "hidden": true, "papermill": { "duration": 0.099935, "end_time": "2022-05-16T23:19:08.505179", "exception": false, "start_time": "2022-05-16T23:19:08.405244", "status": "completed" }, "tags": [] }, "source": [ "This works well when we're getting a bunch of values at once, but it's overkill when we want a single coefficient:" ] }, { "cell_type": "code", "execution_count": 40, "id": "911d424d", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:08.710479Z", "iopub.status.busy": "2022-05-16T23:19:08.709293Z", "iopub.status.idle": "2022-05-16T23:19:08.715181Z", "shell.execute_reply": "2022-05-16T23:19:08.714739Z", "shell.execute_reply.started": "2022-04-19T22:50:45.70789Z" }, "hidden": true, "papermill": { "duration": 0.112399, "end_time": "2022-05-16T23:19:08.715315", "exception": false, "start_time": "2022-05-16T23:19:08.602916", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[1. , 0.68],\n", " [0.68, 1. ]])" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.corrcoef(housing.MedInc, housing.MedHouseVal)" ] }, { "cell_type": "markdown", "id": "a7f0b3ea", "metadata": { "hidden": true, "papermill": { "duration": 0.098461, "end_time": "2022-05-16T23:19:08.933883", "exception": false, "start_time": "2022-05-16T23:19:08.835422", "status": "completed" }, "tags": [] }, "source": [ "Therefore, we'll create this little function to just return the single number we need given a pair of variables:" ] }, { "cell_type": "code", "execution_count": 41, "id": "8565840d", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:09.137395Z", "iopub.status.busy": "2022-05-16T23:19:09.136447Z", "iopub.status.idle": "2022-05-16T23:19:09.139903Z", "shell.execute_reply": "2022-05-16T23:19:09.140273Z", "shell.execute_reply.started": "2022-04-19T22:50:45.716828Z" }, "hidden": true, "papermill": { "duration": 0.108235, "end_time": "2022-05-16T23:19:09.140419", "exception": false, "start_time": "2022-05-16T23:19:09.032184", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.6760250732906" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def corr(x,y): return np.corrcoef(x,y)[0][1]\n", "\n", "corr(housing.MedInc, housing.MedHouseVal)" ] }, { "cell_type": "markdown", "id": "dfbb3155", "metadata": { "hidden": true, "papermill": { "duration": 0.09923, "end_time": "2022-05-16T23:19:09.337993", "exception": false, "start_time": "2022-05-16T23:19:09.238763", "status": "completed" }, "tags": [] }, "source": [ "Now we'll look at a few examples of correlations, using this function (the details of the function don't matter too much):" ] }, { "cell_type": "code", "execution_count": 42, "id": "fea78fe6", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:09.544032Z", "iopub.status.busy": "2022-05-16T23:19:09.542325Z", "iopub.status.idle": "2022-05-16T23:19:09.544661Z", "shell.execute_reply": "2022-05-16T23:19:09.545084Z", "shell.execute_reply.started": "2022-04-19T22:50:45.727485Z" }, "hidden": true, "papermill": { "duration": 0.106227, "end_time": "2022-05-16T23:19:09.545232", "exception": false, "start_time": "2022-05-16T23:19:09.439005", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def show_corr(df, a, b):\n", " x,y = df[a],df[b]\n", " plt.scatter(x,y, alpha=0.5, s=4)\n", " plt.title(f'{a} vs {b}; r: {corr(x, y):.2f}')" ] }, { "cell_type": "markdown", "id": "6e252e37", "metadata": { "hidden": true, "papermill": { "duration": 0.106432, "end_time": "2022-05-16T23:19:09.749801", "exception": false, "start_time": "2022-05-16T23:19:09.643369", "status": "completed" }, "tags": [] }, "source": [ "OK, let's check out the correlation between income and house value:" ] }, { "cell_type": "code", "execution_count": 43, "id": "c3621405", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:09.962327Z", "iopub.status.busy": "2022-05-16T23:19:09.961217Z", "iopub.status.idle": "2022-05-16T23:19:10.156235Z", "shell.execute_reply": "2022-05-16T23:19:10.155794Z", "shell.execute_reply.started": "2022-04-19T22:50:45.735339Z" }, "hidden": true, "papermill": { "duration": 0.307424, "end_time": "2022-05-16T23:19:10.156356", "exception": false, "start_time": "2022-05-16T23:19:09.848932", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEICAYAAAB25L6yAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAA/EElEQVR4nO29e3hdd3Xn/Vm6HMm6RbIkHye240uwiUyUmuAXUtNmAtiE2kyB4YEh0zD12+bNzEtvtKTjFvdtO7RmyjuG0nnpZWhgVBpKCbiZ8uBAIgNuoGqgdjAWWL4QK8ZWoqOLpUhHsnR0+b1/7LO399na5yadyz5H6/M8fnzO2fvs39pb0nevvX5rrZ8YY1AURVGCS0WxDVAURVFSo0KtKIoScFSoFUVRAo4KtaIoSsBRoVYURQk4KtSKoigBR4W6jBARIyKvKrYdxUJEXhSRvcW2I1+s9p/vakaFugjEBSUmIm2ez78f/2PckoMxukTkj1d6nHwgIifj5/lTns+fjH9+fw7G+EMRedzn86KJnYh8XUQ+4vP5O0RkUESqimGXy45dInJaRKbj/+9Ks//7RKRPRKZE5AUR+VnXtvfGt02KyDkReWe+7S9nVKiLRz/woP1GRDqBuuKZU3AuAv/RfiMircBPA8NFsyj//A3wkIiI5/P3A583xsznY1CxSPm3LiIh4B+Bx4GWuK3/GP/cb/99wMeA/xNoBO4DLse3bYgf57eAJuC3gb8TkXU5OaFViAp18fhbXEIF/CLwOfcOIlIjIkdF5CciEhGRvxKRNa7tvy0iL4vISyLyS8kGEpEtcU/yF+PHGhGRw67tlSLy4bhXNBn3pjb5HOdrIvKrns9+ICL/Li4GfyoiQyIyISK9InJXivP/PPDvRaQy/v5B4Ekg5jp2hYj8TtyuURF5QkTWura/X0SuxLcdJkvi1/eT8ev3Uvx1TXzbQRH5jmd/xxsXkf1xT3FSRAZE5FHXfm8XkTMiMi4iPSJyd3zT/wZaAbfn2QK8HficiLxeRP4l/r2XReRTKYTyP4jI2RTndlJEjojIPwPTwLY0l+N+oAr4pDFm1hjzPwAB3pxk//8KfMQY85wxZtEYM2CMGYhv2wiMG2O+ZiyOA1PAHWlsUJKgQl08ngOaRKQjLlbvw/JC3PwJsAPYBbwK2AD8PoCIvA14FNgHbAcyic3+DPBq4C3A74tIR/zz38ISyv1YHtAvYf1xe/kCiU8BO4HNwHHgrVhe1Q7gFuC9wGgKW14CzsW/B9ZN63OefX4NeCfwb4DbgDHgz11j/yWWN3oblgBuTDGeH4eBe7Gu708Brwd+L8Pvfgb4T8aYRuAu4Jtxu14LfBb4T3Gb/ifwFRGpMcbcAJ4g8Qb9XuC8MeYHwALwm0Ab1tPFW4AP+A1ujPk7Y8zdfttcvB94BMvjvSIiXxWR30my72uAsyaxp8TZ+OcJxH9fdwPtIvJjEbkWv6nYTsQpoE9Efj7uBLwTmI0fT1kOxhj9V+B/wItYwvp7wH8D3gZ0Y3k0BtiC5c1MAXe4vvfTQH/89WeBP3Ft2xH/7qvi77uAP46/3hLfttG1//eA98VfXwDekYHdjXGbNsffHwE+G3/9Zqxwxr1ARZrjnAQeBh7CEv87gYvxbdeA++Ov+4C3uL53KzAXv06/D/y9a1s9lje+N/7+D+Pvxz3/3NfoBWC/6xgPAC/GXx8EvuOx2/3dn2CJcZNnn78E/sjz2QXg38Rf/0zcjtr4+38GfjPJdfog8KTf+Bn8rE5iebyZ/k7+P+7rGf/s88Af+ux7W9yWU/GfSVv8PI649vllIArMY930DxT7766U/6lHXVz+FvgPWKLg9SbbsWLWp+OPwuPA1+Ofg/XHctW1/5UMxht0vZ4GGuKvN2GJVkqMMZNY3vP74h89iPXHjDHmm8CnsDzeIRH5tIg0pTnkP2AJ/K9iXQsvm4EnXeffh+V1hvGcvzFmiqUe/BPGmGb3P8/220i8blfin2XCu7GeQK6IyD+JyE+7bP6QbXPc7k32cY0x3wFGgHeKyB1YXvzfAYjIjrjXOygiE8BHsURwuVxNv4tDFOtpyk0TMOmz7434//+fMeZlY8wI8Ams64FYmTf/L1Y4JYT1RPSYpJmcVJKjQl1EjDFXsCYV92OJlpsRrD+I17iE5hZjjC2uL2MJgM3tKzDlKpnHD78APBgXplrgW/YGY8z/MMa8DtiJ5eH/dqoDGWOmga8B/zf+Qn0V+DmP2NYaKxaacP4iUocVasiGl7CE1eb2+GdgPTk4k7sist5j+78aY94BrMOKPT/hsvmIx+Y6Y8wXXF//HFb44yHgaWNMJP75XwLnge3GmCbgw1hPVsslm9aYPwLuFkmY6Lw7/nniQY0Zw3rycR/f/XoX8Kwx5pSx4tf/CnyXzMJzig8q1MXnl4E3xz1CB2PMIvDXwJ/as+UiskFEHojv8gRwUER2xkXqD1Zgw2PAH4nI9vik4N1iZWH48RSWuH0E+GLcTkTk/xCRN4hINZbIzQCLGYz9YaywwIs+2/4KOCIim+NjtIvIO+Lbvgy8XUR+Jj7h9hGy/33+AvB78eO2YYVT7HmCHwCvEStlrRYrlELcjpCI/IKI3GKMmQMmXOf618B/jl8LEZF6ETkgIo2ucT+HJVr/F1Z2hU1j/FhREbkT6wbmS3yy88UszzcVJ7GeVn49PslqTxp/M8n+/wv4NRFZF58Q/U3gq/Ft/wr8rO1Bx+P2P4vGqJeNCnWRMca8YIw5lWTzIeDHwHPxR+ETWJOBGGO+BnwS6w/pxyT/g8qET2AJ/zNYQvEZYI3fjsaYWSzvfy/xR/Y4TVgiNYYVQhgF/nu6gY0xL8XDAX78GfAV4BkRmcSagH1D/Hs/An4lbsPL8XGvpRvPwx9jxVnPAr3A8/HPMMZcxBL/E8AlwGvj+4EX4z+X/wz8Qvx7p7AE+FNxm36MFdpyn/OLQA9WXP0rrk2PYoXCJrGu5RdT2L4JKy6cMWJl7XzYb5sxJoY1cfsfsWLovwS8M/45YmUFfc31lT/CEuSLWCGp72PNWWCM+SesG9uX4z+3Y8BHjTHPZGOvchMxRhcOUJRSQ0SeAX7DGNNXbFuU/KNCrSiKEnA09KEoihJwVKgVRVECjgq1oihKwMlLt662tjazZcuWfBxaURSlLDl9+vSIMabdb1tehHrLli2cOpUs40xRFEXxIiJJq4s19KEoihJwVKgVRVECjgq1oihKwFGhVhRFCTgq1IqiKAEno6yPeJeuSazuWvPGmN35NEpRFEW5STbpeW+KNwhXFEVRCkhRl6dX4OjT5zneO8iBzvU8+sCdzucnzkXo7ovQ3hBiOBpjX0eYvTvDCdvsz7zv/fZZLqnGAnzHOHEuQldPPwAH92xNaqP3+H7Hs481Go3R2hDi4J6tAEuOn805eG3ctak54Rrn6nqmOje/a7QcUv2erIRszjff++b63PJFsr/lXJBpjNpg9QQ+LSKP+O0gIo+IyCkROTU8PJw7C8uc472D1FZVcLx3MOHz7r4IddWVHO8dpK66ku6+yJJt9mfe98k+Ww6pxko2RndfhMjELJGJ2ZQ2ZnI8+1gD4zec4/kdP5tz8Nrovca5up6pzi3bc0g3ht/vyUrI5nzzvW+uzy1fJPtbzgWZCvXPGGPuAX4O+BURuc+7gzHm08aY3caY3e3tvlWQig8HOtczM7/Igc6ElZ7Y1xFmem6BA53rmZ5bcLwy9zb7M+9792ftDSEOHTvLiXPL+yVPNZbfuPY+4aYawk01KW3M5Hj2sTY0r3GO53f8bM7Ba6P3Gqe6npmMl8m5ZXsO6cbw+z1ZCdmcb773zfW55Ytkf8u5IOt+1CLyh0DUGHM02T67d+82WkIeDA4dO0tddSXTcwt87N13F9scRVGSICKnkyVqpPWo42u+NdqvgbcCP8ytiUq+WI4nqChKsMhkMjEMPBlfnLgK+DtjzNfzapWSM/buDPYEjKIo6Ukr1MaYy8BPFcAWRVEUxQetTFzFnDgXWdFEY1DGUJRyR4V6FZOrFL58jpELodebhVLqqFCvYgox0bjSMXJxMynEDUlR8olWJq5iCjHRuNIx9nWEE6r7inUMRSkmWedRZ4LmUSuKomTHivKoldWJxnUVJTioUCu+aFzXQm9YShBQoVZ8CVJFYzHFUm9YShBQoVZ82bszzMfefXcgqhqLKZZBumEpqxfN+lACT7qsjVz13vZDS/CVIKBCXabkU7wKTTqxdHvcpX6uiuKHhj5KmFSx29UUW9XwhFLuqFCXMKnEOFvxKuXshiDF0xUlH6hQlzC59CRXkweuKKWGxqiTEMQYr99in8lWbck2bqtl1ooSXNSjTkIQPcxsFvvM1tvOVfiglEMoihJUVKiTEMQJqmwW+yxW3DaINzg3eiNRShENfSQhiPmzQbTJS9BDKLlI5QtiWEwpb9SjVnxZrucZ9AyMXDwpBf2pQSk/VKgVX/IlRsUOPeTiRhLEsJhS3qhQlzD5FL18iVE5eKNBf2pQyg8V6gKSa2HNpeh5bcuXGKk3qijZo0JdQHLtTZZiwUu6G0CxQyOKEkRUqAtIrr3JXHq9QfF0yyE0oii5RtPzCkiQ0+uCYNuJcxEGxqYBOLhna1FtWSmawqfkEvWolcDQ3Rdh+7pGNrTUlby46ZOBkktUqJXAEJTwSy4op3NRio8YY3J+0N27d5tTp07l/LirBX1sVpTVh4icNsbs9tumHnUA6e6LMDIxw5Gn+jT7QVEUFeogsq8jzOXRaba11a2qGGepp+aVuv1KcFl1WR+lEFaw7cqkuVEpnE+meCfgSu28dO1GJV+sOo+6VGbjM82RDuL5eD3LTD1N9wRcpudVDC822Zg6gajki1Un1OX2xxTE8/HzjDMRXffNKdPzKsaNKtmY2gNEyRcZhz5EpBI4BQwYY96eP5PySxAKO3JJqvMpVljE25N6OT2qM/05Far/tftaBr3ntlJ+ZJyeJyK/BewGmtIJtabnLZ9ciuuhY2epq65kem4h6dqKpcKJcxG6evoBq2qx0DfbcrqWSjBZcXqeiGwEDgCP5dIwZSm5fJQPYlhkuXT3RYhMzBKZmC1KPL6crqVSemQa+vgk8F+AxvyZokBuH+XLKcyzryPs9AEphliW07VU8kM+Q41phVpE3g4MGWNOi8j9KfZ7BHgE4Pbbb8+VfasOFQR/9LooQSef6ZmZhD7eCPy8iLwI/D3wZhF53LuTMebTxpjdxpjd7e3tOTVSCQ75TIc7+vR53nT0JEefPp/zYytKvslneCytUBtjftcYs9EYswV4H/BNY8xDObdEKQnymQ53vHeQ2qoKjvcO5vzYQUerGkuffKZnrro8amVl5NNrONC5npn5RQ50rs/6u6UudEEsXFKCg3bPUwpOPiZdSj19rpxaASjLQ7vnlSil7iUmIx/eY6mnz2lVo5IKFeoAU66Pw/kQVRU6pZxZdd3zSolyLVVeaaqdhgmU1YYKdYApdO5wqQigthNVVhsa+lhlpIp7l0qopdTj0YqSLSrUq4xUYlwqAqjxaGW1oaGPVUaquLdfqKVUwiGKUs6oUBeQIIieW4wzsUfjwYpSfDT0UUCCFgP2s8cbwy6VcIiilDMq1AUkaKLnZ49XvDUerCjFR0MfBSSbdLtchUlSHcfPnqDmbgchbKQoxUI96gKRbTl4rsIk2R4nqB500MJGilJIVKgLRLZCk6swSTbHyfZmks3+K+1bErSwkaIUEg19FIhsQworqUr0hgkyPU62GR7Z7L/S7JFMzqPYC+AqSr5Qj7pAFDKkkM57T+bdZuu1ZrN/ITziYi+Aqyj5QoU6DbloNbrSY2S7RJWfKLptSJaW5/b4M7F3786w86SQyb6pblS5uM77OsKEm2oIN9XQ3hAqyxaxyupEhToNuZjE8h4jnSh5t9tLVH3p1DUeeuw5HnrsuZQC5CeKbhvSpeVl45Fne32SnXsurvPenWEef/heHn/4XoajMZ18VMoGFeo05OKRvb0hxMmLw7Q3hID0otTV08/zV8aceKu9RNXa+tCyH+3d5+En5O7t6c45neinItm55zo0opOPSjmhS3EVAO8yUelygh967DkiE7OEm2p4/OF7nc/TTZYVKtd4JeMEIR86CDYoipdUS3GpUBcAWxjaG0IMR2NpBcIbL85UVJa7bmA64Uq1PVPRC5I45nJ9xSCdl1La6JqJRcYONWQaN3WHJrKJ3S73cT/dGKm2u7elir17wznFJJdhES3EUQqBCnWOSSVWyxGIbL6TbWaF/b69IeSM4Wd/Khvc2wolWivNEMllqqTGwpVCoAUvOSZVYcdyiljSfSebR2+vbfb74WjMiZ0feaqPbW11CfanssG7LVlRz8E9W3PWQ0RbryqrDfWoc0yhPayVhEa877v7ImxrrePyyHRS+93erNezTeWplqsXq6EPpRDoZGIOKMaEkneCMtOJykyOmeoY7ok4wHdSrtDXo5hZKDqZqOQKnUzMM4X0qmwvtqunPyFskYsCj0y83kzyrXNVBJMpK7n+K/3ZBbXboFJeqFDngFw+iqcTLVtYgJRhjHyN7xamZCKVqyKYTFnJuQcpjKIoydDQR8BIl+Ob70ftXOYYZ4qGDxQldehDsz4CRrp2qLnMAlnO+NmQqS0raemqKKsBFeoCkI14rlS0Mkldy3Z5rnzZstybinrgympDY9QFoJCTjZnEXAtVJZhNc6ds0JQ4ZbWhHnUBKOSCsblcGSYftrjHWO51CeoCvIqSL3QyMWBk28BpJWN4mz7ZnupyJxKzzcMu1GSlopQCmkddQthiebx3MKNmR6lI9r1kTZ9WmqqWSUhC0+EUJXvSCrWI1IrI90TkByLyIxH5r4UwrNRZrrjaQnagc33aZkeZ5lxnKpwrLd7IRISXM0YululSlFImkxj1LPBmY0xURKqB74jI14wxz+XZtpJmuY2DksWY3aEKO8QwMDbN9nWNScfIJJabyyyPfKXZaRMmZbWTVqiNFcSOxt9Wx//lPrBdBuRioswPvw51ftWJ6b63XIqdDqeTh8pqJ6PJRBGpBE4DrwL+3BhzyGefR4BHAG6//fbXXblyJcemBp9CTZQVWjhXcl7FFnlFKRVythSXiDQDTwK/Zoz5YbL9VmvWRzaitJLlrwpp50r3t73/S0OTbGipy5tg6w1BKXVylvVhjBkHvgW8LQd2lR3ZTJStZPmrlWAvDjAyOZPxsbOdALRt7+rpZ2BsmktDkwB5LVLRIhilnMkk66M97kkjImuAfcD5PNtV8qTLVEiXIbGcNLZ0Y9oi3RiqTLk4QCbHSoVtO8D2dY1saKnj4J6teU3L07Q/pZzJJOvjVuBv4nHqCuAJY8xX82tW6eOXqeB9PE+3one2j/DpsiO6evqZnVtganaej76rM+Xxs8m08LN7785w1uebjnz0KNGQiVIKpPWojTFnjTGvNcbcbYy5yxjzkUIYFmQy8Tb9PLx8hzsy8Srra6rYEW5IK0qpjnXiXISHHnuOhx57zhE6P7vThUyyPd9U+y/3CUBDJkopoJWJyyCTP24/kfKKn1dcVvr4nk4YD+7Zyj2bWzi4Z+uKjtXdFyEyMcvl4SmOPNWXsIp5NmR7vqn2X67gashEKQW010ca/B6Nc/W47Jf2lqwPh3uco0+f53jvIAc61/PoA3eu4OyS4z1Hr11dPf1cjETpvK2Jtqbaovft0BCGUupor48V4Oep5WqdvHThkWRe4vHeQWqrKjjeO5jy+CuZEPSO7Y1ZH9yzlR3hBl6emGFgbDon5d1ee7OxX9cuVMoZbXOaBL8udvk4vtcD9Fbh+VXkHehc73jUqUg2IXjiXMTpRb1rU7Nvlz6vHfb79oYQh46ddcrXT14cZvu6Rrp6+tM+CaTDa6/fzSJfud+KEmQ09JGEXFcZeoWjEFWMycTq0LGzPH9lDID5RcP9O9oztsO22y5gsW9ktnDbaXnLObdU4ZZsW7BqO1Wl1NA1E5dBrvtLeL3F5RzfT8hsz/jgnq0Zp6zt6wgzMDYNJHrUmWDb7R0vWWx9uefmZ382x9T+IEo5oR51gcjFggAPPfYckYlZwk01PP7wvQme8T2bW1bkOWZj30rCCqm8fPWAldWMTiYWgHQTX/Zk13A0lrO83X0dYcJNNYSbalbsOXb3RRiZnKGr5wrPvTDCB794hqNP+xegdvX08+yFIQ4/2ZuzvOVs0uS0P7Wy2lChzhGZ5vGuJG/XzoPetamZQ8fOAvD4w/fy+MP3AqxIvPZ1hLk8Ms22tjquXr/BojE8cepa0v2nYgtUVJBwvsstBILc9klRlHJDhToLUglRJgKcq0yEM1fHE4TKabQ0kXmjJS97d4Y5vL+Djttu4bbmWipFaGsI+e57cM9W7t54C3e0N7CvI+xcl66e/qwKgVa6Co6dgaKetVLulOVk4koFMdn30/XvSBZbda/IUlNZwZGn+gCW1ctjZHKGSxFrHQe7wrC7L8K2tjouj0zz4Bs2Z32+Nt4eHbYQeq+Dd5LPjp1XVQgbWjJ/Wsimn4ifnXZcO9vvK0qpUZYe9UofjbOJo2Yylr0PwOVRK7yQzjY/b9MOT9x1WxMbWuoccdrXEaatsZbD+zt8Gx9l63UuN57e2hBKGb7w9ghZafm2ln8rq4Wy9KhXmpqV7Pt+6W7pxjpxLuKkwrk94GTNjrx5w25v0f7f+/1UneOW67Vmcm42B/dszWg/u0eI/XqllYT5WqNRUYKGpuflmWzSztzFJDZ++dHJSNWXZCVpgX7jpMrfTmZXe0OIM1fHM/6eoqwmND0vCcnCArlM/8rm8XxfR5hLQ5NcikSZuDHHwPhMVmOl6kuSy7RA2zOOTMxmdDzbruFozMlSyebmoxOGymqnLEMfmZIsLLCScIGXbB7P7f4WNVUVfLd/jDdsaUlqg59XmypU0d4QStsfJFl/E6+X7q5szPQGtNxQVC5/FopSqqxqjzqZt1uoSapkE4ZtjbUc3LOZtqbapDY4PaGHrJ7QJ85FfHOR7TG+8L2fcPX6FP/7+wNJ7bFF8XjvYNoOfvbyWqnE0x4bWHY8WicMl48+jZQPq9qjTubt5mqSKl2aoNdbTLa/3+e2V9s7MYPB8PFnLiRNKRyZnGF0ao4qgeFoLKm9tud7oHO9E0vetanZeW3fDLp6+olMzDIwNp3S7lx4wzphuHz0aaR8WNUedT5wezHpUve8hRvJCkaSxZ4ff/heOjfeQkNtNdenYs7K395VYy6PTNPeUA0ivOXO9oRjH336PG86epIPPH7KEdtHH7iTDS11bF/XyHA05rx2jz81O8/FSJQT5yJ8/JkLfK33ZT7+zAXf81NvuDjo9S8fVrVHnQ/corqvI+zEkW1v1NtlDuCb54eYWzBUV0rSUEyyGO+uTc0c7x3kdZubE1qM2l6UN6XP61nZixA8e2mU97xuY9Lufu7XB/ds5chTfXS0Wvng16diCHB9KtFbV2+4uOj1Lx9UqHOMu8G+7YHa3qg9WTgyMcORp/rY0FxLTWUFLwxP0VoXYtu6et8UvlR/cMPRWEI/ab8bQbLvnzgXobaqgutTMe7b3prg3XsrLb2ViXBTvFNNVGYTzgkKQbZNWZ1o6GOZJJuo8abDAQle8r6OsFOdCFal4qvDDdSEKjNadNaL+/E2W4Hp7otw77ZW7r9zHX/x0G72dYQ53jvIyGTqniHecR594E6+9ej9Ces32lWIh5/sTehBkk1fkGKhTZ+UoKFCvUy6evp5/sqYE9rwYgvowT1bHc/00LGznLk6xobmWl5+xcqRPtC5no7bbuHw/g5nn2zyut2ZHl6BSTfr741hdvdF2NZq9QxJFdd0j5NsDDsrpUKsm5F7DL8bWJDQ2K4SNDT04UM+mjrZAvXEqWvcsqaaV27M8YatrQxHYwlCni6v236frPrQnaGRbtbfGxKxwzYPvmFzwrE/0X2BkWiM9+7eyK5NLQkl8fYY3jUTB8amqaoQ7ljXkJDGZ49hrywTRDS2qwQN9ah9yOTR1+4N7Reu8Pu+7aXZrUPbGkJLvDZvGMP2VN2f2zHuDz/Z6zQ3clf+uTM0sm3G746t2x5yd1+Eq9dvMDu/wPHeQbr7Imxf1+g0hbLHgJuTmB9/5gJnr70CsKQKMR+VkopS7pSdR53KG860R0WyLAvv95P17rAn1zrWNyS0CU2VcwyJnpzbu3YXi5y5Osbxsy9TXSm8MBS1si/WN/C9/igHOteza1NLwvHdfZ9TPSG4i13u39GekP3Re22ckWjMmSx0Txz6ndfJ80O+WSDu6+htVKUoSnLKzqNO5Q1n2qPCFqiunn7Ha/X7frL4rJ2J0TcYTdp7I92jdTJveDga4w1bWqiurGA6tsBsbIFTV8a5f0c7w9GYc3wgIZ97ZHLGqWBMNd6BzvUJ2R8AT/3GfXzv8F4efeBO59zcYQvvzec9uzeytqGG9+ze6DuW7ZXbr7VyTlFSU3Yedaqc42x6VNiibL+2xXtgbJrRaIyBsWm6evrZvq4xIT7rzkE+0Lk+4xW+vWKXLE5qH/ujb9jMJ7ovcPX6DW5ZU7VE1O3qwbNXx2ltCHExEqXztibHVm83Pe94yeLl3utrry5j99i2s0DcGSDJzgFYMoamxinKUspOqFNNBGUzSeQn6t6VRS4NTTI9t8D1qVhCSbV3HDtVDZKHXDIt93Uf++PPXEAEGmurk4Zhrk/FuHdbKwBtTbUMjE37hjj8zt/vhuc9N3emyIOvz2x1Gb9wSbbXQVFWE2Un1OlI57G5t9tZDd5ttjdqi669FFWy43f3RXhhOMrU7AJdPf1ZCWMqWhtCzC8aWn3WNrRt99pq25fO28/0puaXKZIpfmOspNOeopQrZSnUqZrlp0t18253e3fu7Aq7CvDQsbNOqpk7K8O9LqI9wVYfqlxiY7pQh9++NqlWVvF6rd7Pg0rQ7VOUYlCWQp0sgwESPTa/x+xUPS78trmF28Ybs927M2xla/QOsmtTc8J3k8WMvcty2ZOBcLOE2y1qycTc3enOfT5+GTGp8rOTrQ6joQpFyT8lL9TJWoAme7z3i7Gm2u59nWpizb2/9/PhaIxtrXVxsW5ZMqHmvam4xbljfQPf7R9LEH8v6QRzNBpbcgNxX7+BsWlqqhJXSE91w0t1DXKBTioqyk3SrpkoIpuAzwFhwACfNsb8WarvFHLNxGzWJMwXmTQeAsvTbqypZHJ2wVkx/AOPn+LZS6O8OlzPq8JNCR71kaf6rIm60Wnnf7+Vxv1s8HrDA2PT1FRWLDmGe2J0YHyGba11tDXVJjR4yuV6i5leuyD8XBWlkKRaMzETj3oe+JAx5nkRaQROi0i3MeZcTq1cJkGYfErmzbo/t8XGFmvbc+0bjLKpZQ3XxmZ4VbjJ+a59nK6eftZUVzC7sLhEpDOJxbvj6e5JP/v95I2YUyxzcE9LwrUsRLw42bULws9VUYJCWqE2xrwMvBx/PSkifcAGIBBCnW8xSRW79eZNe0XF7/MNzbVcikS5K57TXFtVwUuv3GBufpFvXxhesmoKwBu2Wi1I/cIV29c1po3F+10nWyC/1x9NKJYpdJgh0zRARVnNZBWjFpEtwGuB7/psewR4BOD222/PhW2BwM/j836WTFT8xNGuyLNzmu/d1srJi8PMzi0wOTO/ZFxY2mXOniCsqhA2tCxkFIv3spyiHMi8DD9TVJAVJT0ZC7WINADHgA8aYya8240xnwY+DVaMOmcWFhk/j8/+zC6zzjSGa3+vqbaK7/Vfp2N9g1O2bXe9O7hn65JeGMmO2doQWnb8NplAphNiv4pNRVHyS0ZCLSLVWCL9eWPMP+TXpNyQq6wBtxdtv/dWKKbKivCLIx95qo/aqgr6BqN869H7l4x56NhZtq9rTAh3uLGX37JT/bzjrcTjTSfE7Q0hJm7MsbY+pPFjRSkQaZsyiYgAnwH6jDGfyL9JucFbuJLrY9le76WhSTrWN3Dy4jDtPhWC7hQ3+xgHOtczOhVjfmEhoemTTar2pCfORTjeO8i2trol/ZztTJEXhqJcHpricLwV6tGnz6dcQMA7driphnBTje/4Z66O07SmmtaGkHrTilIgMume90bg/cCbReRM/N/+PNu1YrLpxezF7s1hi+i+jrCVvjY2ndBJr6aygoHxGSZm5pd0lPPaYXel29dhNS16053rqK2uStvJz0tXTz+zsQV6ByaWnJvdd2PRwLwxxOYXOX1ljMe+3Z+wHFYq9u60Vjf39pFWFKV4ZJL18R1ACmBLTlnJJJX38f9j7757yQSiHcLY1lbH7Pxi0nUL/aoHIXUnv3Rl7vW1VYSbanyLT7p6+tkebmDXpma+dOoaM/MLrKmu5PLoNA++IbOmSX7YtrvL5RVFKQwlX5mYD/xE1D2p6NfYyBZNu0GTnWbnxl1teHh/B48/fG/S8ZOVj6fr7+HOnz7yrk4nXr1rUzNdPf109fQvO3btVy6vKEr+UaH2wdvQ6MzVsQRBPnTsLCOTM3yv/3rSSsHRaMxZVQVwJhS/13+dba3JS8Hd49vYFYq2d+/20L0Th26Rdx/n0LGzvpOEmVRVpsoVzwVaLq4oqSm7FV7S4Y0/p8JvIhAsj/fyyLQjnvZx7U5692xuobUh5HzH7Y0e3t9BW1NtQuP9VBN9e3eGOdC5nsuj0wmTlSfORTj8ZC+nr4zxwnA0ISvFbwUZ9yShnVboXQDXbYs3/GIfF/xXSl8JuZz4VZRyZNUJtR1/dk/iecXy6NPnedPRk/w4MsHJi8NOvrN7eSq34NrZFiMTMwnx20tDk+zrCCdMRsLN0IbtET97cYjDT/b6it+JcxEr1jw37+Ra2+dREZ85WFyEyRsx3nT0JEefPp/03De01HFwz1aGozEnnNLuWmTXLZjJJmPzIarLnfhNd5NTlHJh1Qm1nQdcVSFL2pXa4nO8d5DaqgouRKa4f0c7E/GKwW/0DfH8lTEn3GDT3RdhW5vVNMkWPPeagGCJZE2l1Z2uq6c/Ybzx6XlemZlLOO4HHj/FXX/wNIeO/YCJmTnGp+eccMqJc1YYZdHAltZ6jryrk77BKLVVFRzvHQSWipg71t3eEHKeCOyY896dYdobQk6aYSrPfLnZNMlINlY61BNXVgurTqiHozH2d97K3ZuaE/piuNPvDnSuZ2Z+kfu2tzpLbT1/ZYyXXrnhHMfrfbY11jrxalvM4ObCA+0NIf75hVHGp2a5PhVzxtu1qZnG2irWrknMwX720ijVFcL49BzNa6ppqq1OCKfY+cxgpey9Mh2jf2SKjvUNS+yzz9Etzt4QjH1t7DTDUvBW/dImFaUcWXVC7ecR7t0ZZkNLHdvXNdLdF+HRB+7kW4/ez188tJuPvftu1tZbInrbLWu4Z3MLB/dsdUSi99o4XT39CRNhtod4cM9WZ6zhaIyWumoWDM7xXhiO8qVT19i9uZmaUKVTaXjiXIRQpTB+Yw4BxqbneN3mZnZtaubkxWEmb8S4FIkSnZlz1mu8MbdAqKrC8f6957l3Z9gR5/aGUMLEoC3I7u8k81aD5MV6f26KUq6suqyPZPnVqbIa7JLtB14TTlhd2xvv9h7XPdaZq2MsGrhljXXJR6MxpmYXqA9V0jcYTSiY6erp58bcAhUCCwbEGE5dGadxTYj7d7Rz8uIwd93WxOXRaQ50rueb54eYX1iktqrSd2zvZ+4VxuGm1+8NP/hdj/aGEMd7BznQuT79xS4A2g5VWQ2sOqFeDu6QgJt9HWH++dIww9EY29fVpz3G/s5bOXlxOB6/nnQWpd21qdmZKHQ/wlcIGAPzC4ap2XmnX0jH+gZOXxlnbX2IXZtaGI7GuLWplsuj0xzcszVh3FQr4Lhj9G6hS5Uul+xa5JtkNmn3PWU1EHihzkWObSY9pb2P9O78ZLviz/6eO8RhVSfW0zcY9W3ABCR8ZvcFOdC5nkcfuDOhWtF+hHeL7Wg0xoujUywaw5mr4zz+8L0cOnaWpjXVzC+ahMIY90rgqZbY8oqbX2FOsmW9iuXBprJJUcqdwAt1Lv5A/Y7hVxJuC1BXTz9nr71CfehmSCBZWfeBzvU8ceoabQ0hq3zb08gfYGRihq//8GVqqipprQ8leKTdfdaq5ZciUeBm4YpbcA8/2euk4tnNoKorxelg5+dV2va+ODLF4MQsG5prM76GqcS4WB6shjiU1UzghToXf6B+HnGqFVC6evqpr6lkcdG/hPzjz1zgxdFpvt77Mh9/7y6GozFn3cHpuYUEr3nXphaOPNVHbH4RgJfGb/BU78usrQ85aXb/+P0B1sQXCfATwe1hK5PDLh+3W6DaNnkrJ93nPBKN0VRbxejUXNJrmKo/SVAIok2KUigCL9S5+APdu9MSrReGohx+sjftcd39NPz2eWn8BjNzC8wvSELowQ5ZfPjJXqoqhDNXx53Jx090X+DFkSnmFg0VIrwyPef0pRaBmfkFX1vcwux+GnBnZthtT73hje6+CBM35rg8Ms3BPYlrJdrn5l5EV8MKihJMAi/UuWQqtkB9Tfowil/HO3t9wu6+CBta1nBjaBIDTnGIuynT5Mw8BsM26p3jAXzwi2dorBYmZ+a4MbfAjsYGIhOzNK+pdvKu3TFw8H+iGBibpqun3+lkd6BzPcd7BxMWzbWfGgB+/S07koZ8nGKdkZV111MUJX+sGqE+uGerE/5Il+Hg/swWNri5dqHT4rR1afN+gOa6KhYXrWyOhx57DrAmBasrhOnYAtvDjdx6Sy2XR6Z5z+6NDEdjCTeCVFkNdkogkNDJzg6xNIYql3jWdvGKbbtb+J2JyNdvzrk3rc2WFCU3iDG5X95w9+7d5tSpUzk/bj6w25KGm2qctqN2nrE7DuyXyubN8PCGFrr7Inz7wjCTs/NUVMC6xlqqKwVj4PpUjPfs3uiERtKJmjujxL2+orfjnd0ju62xNqGJkn0+hWxRWqxxFaUUEZHTxpjdfttWXWWimxPnIlyMRBmZnOViJOrkMNsVepM3Yk5vDG8Knx1aON47yMjkzdVT3BOV7Q0h5o2hsbbKqWpcWx9iftHQtKY6wRt3d6fz6+7n7sDntwKLu0d2W2NiaXg++nNkQrHGVZRyY9WEPvzo7ovQuaGJ7/aPcc+GJkeA7X9vOnrSaXT06AN3+nrX21rj8d3Xb14Szx6Oxvioq3l/e0OIgbFpqiqE1obExWGPPn2e472DzC8s8MqNeSc10JvFcX0qxkOPPZfgTduedGOokm9NzrIj3MCZq2NLPHT3zaQQ+E3YajhEUbJnVXvUdjOlg3s2L/FCT5yLUFtVwfWpmFMu7e3ytq8jzOzCIhuaa4GbXu9oNJbQhc7uR/GlU9eITMzS2hByYua252x37BuejFmpgSYxlm6PefX6DS4PTyX0tnB376uqECITs0t6aHf19Pt2/is0QeoVoiilwqr0qN1eXbLYaXdfhHu3Wd3z3P093HjDIbbH3doQcjxquDlhZ4c97OO/MBRlKrZAV0+/k7nx5jvbaVwTchonuXOkARprqpianXc6xrlDLQf3bHbi15msbVgM71YLVxQle1alUCerMkyXFufFrhIcjcYQsdLm7Fxqv2IatzACnDw/RH2NlVHy6AN3JtwQ7EnOiRtz7O+81cnT9qYKZlqg4rfWol/FZr7FWwtXFCV7VmXWh1/6nZ2Z4LcOYbrvnrw4TFW8xvuezS0pMxy8Yp3s9Yef7KVKhFvqrD7Uo9GYEzI5c3XM6WC3a1PLsoXVT5Q1U0NRikOqrI+SFepceX7e4xw6dpbnr4wB/qLrFjK75WfH+ganD3SyFb69E41eIXQfF2BkcobLI9Mc3t9Bd1+E56+MEZ2do7a6ig3Ntc4xgJwKq072KUpxSCXUJRv6yFU3Ne+j+L6OcMLahl7cqXd26XbjmhB/8ZDv9V3Sxc7dfMnvuPs6wpy5OsbJ80POAgP7OsL0XhtnbGqOHeEagIS0N7/OfstFQxOKEjxKNusjXzm6e3eGl+Qpu5elsjM/hqMxtrXW8cOBiZRLQbkrGy+PTHPXbU1saKlbcmzAySgZjsZoWlPNeDyPG6BzYzNvvKOVydkFZyUY215d5URRypuSFWpvqpyX5az5l+w7fill7Q0h/uXyKEOTs1wdnU4qkvYNZdemZjY01zK7sOjcXNyrl9upeg899hztDSHCTTUsLsK2tjrH025rstZltLv12WNme9MqhfUQFUW5ScmGPtKxnNBIV08/kYlZBsamE7I03GXbR58+7/TmmIun2v1k7Ab/tiHke0z38lc1lRVcHp1OsNFuiFRbVcHA+A0qBQbGZzi8v8OZNOy4tcm354dfi9Z8XRtFUYpH2Qq1X3pdugZMANGZOSZuzDmf22XbG1rqnJai9+9oByDcWMPAKzPUVVXwzfNDS3KYbYEfjcaYnJljOBrj1eGGhBJ0uyFSV08/4zdijE3NcU/ciwZ8l71KJsx2daO9ekw210ZRlOBSslkfy8Ev9czbgMnuitfWVOtbMj55I0bfYNRJjfvQl84Qm18kVFnBuiarQnF+0TiL0FqVgjMYYxWr1IQqOby/I2nan98yXn4ZGH43HbvkfWZ+kW89en8+L6WiKDmmLNPzlkMmayemS09zC/vA2DSn46l8W1qt3tPXp2I01FQSmYzx6nA9a0JVTkHM2vrQkoKY5YYe/Lr+HX36PF86dY219SE+9NZXpzy2puEpSrAoy/S85ZCsOdHA2DQff+YCXT39HNyzNWk+sl2JCDf7W7fUWWXhH3rrq51QyZdOX6N5TRUXIlN88t/vWiKEtth39fRnLZa2wI769MF+9IE7l0w0Jju+xqkVpXQo2ayPZKTLaOjuizAyOcORp/qcKsSz117hxdEpIhOzjsB5j2NnaNRUVjAatdLmbsTmWVg0tLrynafnFrhveyvjN+ad5a28x9rXEebS0KTVYnXiZotU9zn4tTq17a+rrqS1IcQ9m1t887HdfbSTNUDSFqSKUjqUnUedzlN0x6FtAasPVTIdg3BTTUIRyQtDUU6eH3KOa2doTM/Os7Y+xI9emiRUVcH4dIzuvkhCuqBfqbm7N0d3X4SaqgpnCSzv/vYqLl7P2L1Qr5+n7J5oPHN1zAmFeIthtLBFUUqHtB61iHxWRIZE5IeFMChb/LzVdJ7ihuZaXnplhoGxaZpqq6gNVfH+n97sFLnYCwqM35ijosIS7YGxaWbnFzm8v4P37N7I9akYFQKVFbKkJak3k2RgbJpLQ5Ps67i5LFZ7Q4i2xlpnYrG7L0LfS6/wwS+e4ceRCV65MUd1pfgKfaYFLnbhzPyi0WIYRSlhMvGou4BPAZ/LrynLI10nPO+kWXeftar3wPgwNVUVPHtplG1tdRzvHWTXphZnn84NTfxwYII72hsAlqwEfrx3kHBjDZOxhSVZHO58bMCZ9LPzqe2UP3csfF9HmONnX3Zi2+953cYlK4+3N4QckU/XwtQ+ZqpyeEVRSoO0HrUx5lngegFsWRbpYrLdfRFGJm7GpO39D3Su5/LItLNCizsUsq8jzOz8ItvDDRzcs5Vdm5p5qvdlzl4dd4R/W1udr0h7uT4V49rYNL0DrySM7xXOvTvDHNyzmeqqSu7b3pqwj51z/cSpa3z74jBnro6nrMp0H9Nv2S5FUUqLjNLzRGQL8FVjzF0p9nkEeATg9ttvf92VK1dyZWPG+BV8JFvw1d7mzl12///EqWtUVQjb2uvZ0FLHsxeHmJpd4O6NtyT0dvbz3N1tSM9cHefstVeor6nkvh3rnPGzTY87dOyss1DuprVr6NzYrKl1ilJGFCQ9zxjzaeDTYOVR5+q42QjacDSWUMnnXvDVFuCHHnvO6e28a1MzA2PTTsN/2yM/3jtIlQiT8dal+zrCVpP/kNVcyb0QwKFjZx1v3vbcNzTXcv+OdqdScWPLGtbWh5wYtV/zf0hdWegOYwAJMWvNiVaU8ibwWR+psji8AuVXSeiOBz/02HOcvfYKcwuLzC8aS5ArhOjsHEee6nME3faEIbG/tLfs2o5F914bZ219iEuRKOGmGqeV6Wg0xvyiSShKsWPUAJeGJp3z2Lsz7KybaMfL3efmztLwW3zA7xqpgCtKeRAoofYTllR9KfzS3tyCZPeMthenBaivqWR61krFa6qt4vSVcWbmF9h5a9OSCT43ydLZpmbnGZqcYV1DLWvrQ0zOLnDXbU20xcvJ7TQ7b5jF7cHb9tvrJh7oXJ/yBuXXQ9vvGnmbTCmKUpqkFWoR+QJwP9AmIteAPzDGfCYfxviJU6p832QC5Q4vuEMh3tjyoWNn2d95K5eGJpesQu7lA4+f4tlLo9y3vdVZJODgnq0ceaqPxtpqJmfn2baufsnahN5cau/NwN30f9emFs5cHefM1fGExWnTecaaE60o5U1aoTbGPFgIQyD7rm7JBMrdrN/OnkjlrSdbPsvNs5dGqa4Qus9FeNPRkwlxZO8ai+4QxcDYNF09/eza1OyEU+xQh50K6M5UsT1wt6AfOnbWqaa0zzsT/Ba09UNDJIoSbAIV+siVZ+gnwHZs2Jtv7c0C6erp58WRKeYWDO/ZvdER4/u2t/LspVFCVZVOHPnRB+5MabO7wtDdKtU9CejuHQL45j17qykzvUaZXk/t+6EowSZQQp0r/ATK7a17hckdKnlhKMrg5CxrXGIMOOEOd2ZGOrwFJ+60PcApvnEXtiQLbdj756NwRftTK0qwKUuhTod79fBDx846qXIAiwbqqitYWGSJGJ84F2E4Gktb5GLjvWF09fRTVSFOCMRvsjPTY+USjXErSrBZNULtjQXbjf3t1Vqm5xbS9orOdYjAm/etKIriR9m1OU1Ge0OIkxeHmbwRc5okHehc7wi0/fgPN0MBR58+79vwye65ke3isAf3bE1oTaqtRhVFyYSyXuHFnbv8t/9yhdiCtWTWv7tnI5eGJtnQUpeQqmev3AJW5d9TvS9TIVY45Mi7Op149pGn+misqWRwYpYd8X4gxQgdaLaGopQPqUrIy9qjdpeExxYWmV9YBCFBjN1etO3d2q+rK4WRqRixhYWEbBG7kZO1HuLNxQb8sMvMvd55Ls9PW5gqSnkTeKFOt2JLqv3dnfJa60NUVlTwxjta+di77+bgnq2+Heps0fvYu++msbaaqgphcfFmuty+jjBtTbUc3LOZbe31CYsN+OG+WWQiqtmcr4ZOUpPt746iBJXATyZmM4HndMqL5xu7W4EOR2OMTMzQNxhNKDixv+fXKKm1wVoPsapC6Orpd9ZUdOdp2zaCf2qdva/dRySdqGZzvpqtkRrND1fKhcALdTY5vu7lsrzLWzlFI22JRSNHnz7PY9/uZ02okvVNtcDNRkl2Zd/A2HTC0lj2+4Gx6SVFLG6WE0PWnObcoddSKRcCL9TZeI32H+aDr9+cMEFoe9d2wUnHrU2AJdJ/9U+XwRgmZwyvuS3EaDTGwPgNPv7MBb72wfucCUS7TLy9IcTJ80NUVECYmqyaRuX6fJXU6LVUyoXAx6izYe/OcEK4wxvD/UbfEKPRWb7RZy1Ya7UVFeYN3HqL1U3vQmSSmbkFrk/FEo5rr5QyHI1x14YmFhdvjptstRWNISuKkgsC71GvBHfp9ZmrY/SPTFFRgSPCBzrX89ff7qdlTTWbW+vpG4xSV13BVGyRjS21ziIDkzNzxBYM79290fGgt4cXlzT+9xtfPTpFUVZKWXnUXuzJxZGJGY73DrJjXQMVUsF7dm8EYNemFprXVFMbqmA0GmNNdQVzi9C5oYnr0/NWHHr8BoMTM8zOL3C8d9Dx2r1ZI4qiKPmirD1q9+Ti2jprde/7treya1OL0+Pjrg1NXB6ZRgTmFgx3tNfTcdstTN6IcfrKOE21VTTVVhFbMAk9OdRbVhSlUJS1R72vI0xbYy2H93dwfXqe5jVVfPP8MB9+speRiRkAZ/va+hAAa+tDVg71mhD7O29lS1s9b9zezkff1blkHUNFUZRCUNYetdvrPXN1jK6eK6wJVVJVIVwenV7SBc+dveEuftFcXEVRiklZC7WbRx+4k12bWpasxmLjnni033tX+FYURSkGq0aoIX1cOds1GxVFUQpBSQp1rrvGeVcIV+9ZUZQgUXKTie6Uu1x1jfOuEG6HPLShj6IoQaDkhNpJuRudzpnn61dBqC1EFUUJCiUX+vD288gmDJJs33SL4SqKohSTkhNqr6hm0/hIW4gqilKKlFzow0s2jY+0SZKiKKVIWa+ZqCiKUiqs2jUTFUVRygEVakVRlICjQq0oihJwVKgVRVECjgq1oihKwFGhVhRFCTgq1IqiKAEnL3nUIjIMXHF91AaM5Hyg3KN25ha1M7eonbklaHZuNsa0+23Ii1AvGUTkVLJE7iChduYWtTO3qJ25pVTsBA19KIqiBB4VakVRlIBTKKH+dIHGWSlqZ25RO3OL2plbSsXOwsSoFUVRlOWjoQ9FUZSAo0KtKIoScPIu1CLyNhG5ICI/FpHfyfd4y0FENonIt0TknIj8SER+o9g2pUJEKkXk+yLy1WLbkgwRaRaRL4vIeRHpE5GfLrZNfojIb8Z/5j8UkS+ISG2xbQIQkc+KyJCI/ND12VoR6RaRS/H/W4ppY9wmPzv/e/znflZEnhSR5iKaaNu0xE7Xtg+JiBGRtmLYlgl5FWoRqQT+HPg5YCfwoIjszOeYy2Qe+JAxZidwL/ArAbXT5jeAvmIbkYY/A75ujLkT+CkCaK+IbAB+HdhtjLkLqATeV1yrHLqAt3k++x3gG8aY7cA34u+LTRdL7ewG7jLG3A1cBH630Eb50MVSOxGRTcBbgZ8U2qBsyLdH/Xrgx8aYy8aYGPD3wDvyPGbWGGNeNsY8H389iSUqG4prlT8ishE4ADxWbFuSISK3APcBnwEwxsSMMeNFNSo5VcAaEakC6oCXimwPAMaYZ4Hrno/fAfxN/PXfAO8spE1++NlpjHnGGDMff/scsLHghnlIcj0B/hT4L0CgsyryLdQbgKuu99cIqADaiMgW4LXAd4tsSjI+ifWLtVhkO1KxFRgG/lc8RPOYiNQX2ygvxpgB4CiWN/Uy8Iox5pniWpWSsDHm5fjrQaAUFv/8JeBrxTbCDxF5BzBgjPlBsW1Jh04muhCRBuAY8EFjzESx7fEiIm8Hhowxp4ttSxqqgHuAvzTGvBaYIhiP6QnEY7zvwLqx3AbUi8hDxbUqM4yVVxtoL1BEDmOFFT9fbFu8iEgd8GHg94ttSybkW6gHgE2u9xvjnwUOEanGEunPG2P+odj2JOGNwM+LyItYYaQ3i8jjxTXJl2vANWOM/VTyZSzhDhp7gX5jzLAxZg74B2BPkW1KRUREbgWI/z9UZHuSIiIHgbcDv2CCWaxxB9YN+gfxv6eNwPMisr6oViUh30L9r8B2EdkqIiGsiZqv5HnMrBERwYqn9hljPlFse5JhjPldY8xGY8wWrGv5TWNM4DxAY8wgcFVEXh3/6C3AuSKalIyfAPeKSF38d+AtBHDS08VXgF+Mv/5F4B+LaEtSRORtWOG5nzfGTBfbHj+MMb3GmHXGmC3xv6drwD3x393AkVehjk8o/CrwNNYfwBPGmB/lc8xl8kbg/Vge6pn4v/3FNqrE+TXg8yJyFtgFfLS45iwl7vF/GXge6MX6ewhEWbGIfAH4F+DVInJNRH4Z+BNgn4hcwnoa+JNi2ghJ7fwU0Ah0x/+W/qqoRpLUzpJBS8gVRVECjk4mKoqiBBwVakVRlICjQq0oihJwVKgVRVECjgq1oihKwFGhVhRFCTgq1IqiKAHn/wdBB5lyTURA/gAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "show_corr(housing, 'MedInc', 'MedHouseVal')" ] }, { "cell_type": "markdown", "id": "d967187d", "metadata": { "hidden": true, "papermill": { "duration": 0.099654, "end_time": "2022-05-16T23:19:10.355784", "exception": false, "start_time": "2022-05-16T23:19:10.256130", "status": "completed" }, "tags": [] }, "source": [ "So that's what a correlation of 0.68 looks like. It's quite a close relationship, but there's still a lot of variation. (Incidentally, this also shows why looking at your data is so important -- we can see clearly in this plot that house prices above $500,000 seem to have been truncated to that maximum value).\n", "\n", "Let's take a look at another pair:" ] }, { "cell_type": "code", "execution_count": 44, "id": "3217ef94", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:10.573146Z", "iopub.status.busy": "2022-05-16T23:19:10.572263Z", "iopub.status.idle": "2022-05-16T23:19:10.754482Z", "shell.execute_reply": "2022-05-16T23:19:10.754042Z", "shell.execute_reply.started": "2022-04-19T22:50:45.936734Z" }, "hidden": true, "papermill": { "duration": 0.299698, "end_time": "2022-05-16T23:19:10.754608", "exception": false, "start_time": "2022-05-16T23:19:10.454910", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "show_corr(housing, 'MedInc', 'AveRooms')" ] }, { "cell_type": "markdown", "id": "e283f2c4", "metadata": { "hidden": true, "papermill": { "duration": 0.118359, "end_time": "2022-05-16T23:19:10.974599", "exception": false, "start_time": "2022-05-16T23:19:10.856240", "status": "completed" }, "tags": [] }, "source": [ "The relationship looks like it is similarly close to the previous example, but r is much lower than the income vs valuation case. Why is that? The reason is that there are a lot of *outliers* -- values of `AveRooms` well outside the mean.\n", "\n", "r is very sensitive to outliers. If there's outliers in your data, then the relationship between them will dominate the metric. In this case, the houses with a very high number of rooms don't tend to be that valuable, so it's decreasing r from where it would otherwise be.\n", "\n", "Let's remove the outliers and try again:" ] }, { "cell_type": "code", "execution_count": 45, "id": "1d5455b4", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:11.210864Z", "iopub.status.busy": "2022-05-16T23:19:11.209971Z", "iopub.status.idle": "2022-05-16T23:19:11.388989Z", "shell.execute_reply": "2022-05-16T23:19:11.388528Z", "shell.execute_reply.started": "2022-04-19T22:50:46.141649Z" }, "hidden": true, "papermill": { "duration": 0.29918, "end_time": "2022-05-16T23:19:11.389107", "exception": false, "start_time": "2022-05-16T23:19:11.089927", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "subset = housing[housing.AveRooms<15]\n", "show_corr(subset, 'MedInc', 'AveRooms')" ] }, { "cell_type": "markdown", "id": "ddee8bc8", "metadata": { "hidden": true, "papermill": { "duration": 0.161694, "end_time": "2022-05-16T23:19:11.723774", "exception": false, "start_time": "2022-05-16T23:19:11.562080", "status": "completed" }, "tags": [] }, "source": [ "As we expected, now the correlation is very similar to our first comparison.\n", "\n", "Here's another relationship using `AveRooms` on the subset:" ] }, { "cell_type": "code", "execution_count": 46, "id": "a40703b7", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:12.065306Z", "iopub.status.busy": "2022-05-16T23:19:12.063521Z", "iopub.status.idle": "2022-05-16T23:19:12.314930Z", "shell.execute_reply": "2022-05-16T23:19:12.315888Z", "shell.execute_reply.started": "2022-04-19T22:50:46.33992Z" }, "hidden": true, "papermill": { "duration": 0.429853, "end_time": "2022-05-16T23:19:12.316088", "exception": false, "start_time": "2022-05-16T23:19:11.886235", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "show_corr(subset, 'MedHouseVal', 'AveRooms')" ] }, { "cell_type": "markdown", "id": "cba1d702", "metadata": { "hidden": true, "papermill": { "duration": 0.199087, "end_time": "2022-05-16T23:19:12.712802", "exception": false, "start_time": "2022-05-16T23:19:12.513715", "status": "completed" }, "tags": [] }, "source": [ "At this level, with r of 0.34, the relationship is becoming quite weak.\n", "\n", "Let's look at one more:" ] }, { "cell_type": "code", "execution_count": 47, "id": "2d6937f6", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:13.004823Z", "iopub.status.busy": "2022-05-16T23:19:12.996936Z", "iopub.status.idle": "2022-05-16T23:19:13.192851Z", "shell.execute_reply": "2022-05-16T23:19:13.192201Z", "shell.execute_reply.started": "2022-04-19T22:50:46.526726Z" }, "hidden": true, "papermill": { "duration": 0.329812, "end_time": "2022-05-16T23:19:13.192993", "exception": false, "start_time": "2022-05-16T23:19:12.863181", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "show_corr(subset, 'HouseAge', 'AveRooms')" ] }, { "cell_type": "markdown", "id": "ce158a24", "metadata": { "hidden": true, "papermill": { "duration": 0.103658, "end_time": "2022-05-16T23:19:13.401756", "exception": false, "start_time": "2022-05-16T23:19:13.298098", "status": "completed" }, "tags": [] }, "source": [ "As you see here, a correlation of -0.2 shows a very weak negative trend.\n", "\n", "We've seen now examples of a variety of levels of correlation coefficient, so hopefully you're getting a good sense of what this metric means.\n", "\n", "Transformers expects metrics to be returned as a `dict`, since that way the trainer knows what label to use, so let's create a function to do that:" ] }, { "cell_type": "code", "execution_count": 48, "id": "a5ff917b", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:13.612237Z", "iopub.status.busy": "2022-05-16T23:19:13.611377Z", "iopub.status.idle": "2022-05-16T23:19:13.614160Z", "shell.execute_reply": "2022-05-16T23:19:13.613658Z", "shell.execute_reply.started": "2022-04-19T22:50:46.715707Z" }, "hidden": true, "papermill": { "duration": 0.109782, "end_time": "2022-05-16T23:19:13.614283", "exception": false, "start_time": "2022-05-16T23:19:13.504501", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}" ] }, { "cell_type": "markdown", "id": "ed983c51", "metadata": { "papermill": { "duration": 0.102797, "end_time": "2022-05-16T23:19:13.819824", "exception": false, "start_time": "2022-05-16T23:19:13.717027", "status": "completed" }, "tags": [] }, "source": [ "## Training" ] }, { "cell_type": "markdown", "id": "352007a6", "metadata": { "papermill": { "duration": 0.118052, "end_time": "2022-05-16T23:19:14.047292", "exception": false, "start_time": "2022-05-16T23:19:13.929240", "status": "completed" }, "tags": [] }, "source": [ "## Training our model" ] }, { "cell_type": "markdown", "id": "7cc0a49a", "metadata": { "papermill": { "duration": 0.102335, "end_time": "2022-05-16T23:19:14.253540", "exception": false, "start_time": "2022-05-16T23:19:14.151205", "status": "completed" }, "tags": [] }, "source": [ "To train a model in Transformers we'll need this:" ] }, { "cell_type": "code", "execution_count": 49, "id": "811f584f", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:14.466565Z", "iopub.status.busy": "2022-05-16T23:19:14.465852Z", "iopub.status.idle": "2022-05-16T23:19:18.918169Z", "shell.execute_reply": "2022-05-16T23:19:18.917243Z", "shell.execute_reply.started": "2022-04-19T22:50:46.722181Z" }, "papermill": { "duration": 4.55931, "end_time": "2022-05-16T23:19:18.918306", "exception": false, "start_time": "2022-05-16T23:19:14.358996", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "from transformers import TrainingArguments,Trainer" ] }, { "cell_type": "markdown", "id": "bef40f65", "metadata": { "papermill": { "duration": 0.103363, "end_time": "2022-05-16T23:19:19.125535", "exception": false, "start_time": "2022-05-16T23:19:19.022172", "status": "completed" }, "tags": [] }, "source": [ "We pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly:" ] }, { "cell_type": "code", "execution_count": 50, "id": "9a7f73b8", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:19.336295Z", "iopub.status.busy": "2022-05-16T23:19:19.334700Z", "iopub.status.idle": "2022-05-16T23:19:19.336896Z", "shell.execute_reply": "2022-05-16T23:19:19.337294Z", "shell.execute_reply.started": "2022-04-19T22:50:50.493351Z" }, "papermill": { "duration": 0.109762, "end_time": "2022-05-16T23:19:19.337450", "exception": false, "start_time": "2022-05-16T23:19:19.227688", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "bs = 128\n", "epochs = 4" ] }, { "cell_type": "markdown", "id": "1b9defba", "metadata": { "papermill": { "duration": 0.104127, "end_time": "2022-05-16T23:19:19.544960", "exception": false, "start_time": "2022-05-16T23:19:19.440833", "status": "completed" }, "tags": [] }, "source": [ "The most important hyperparameter is the learning rate. fastai provides a learning rate finder to help you figure this out, but Transformers doesn't, so you'll just have to use trial and error. The idea is to find the largest value you can, but which doesn't result in training failing." ] }, { "cell_type": "code", "execution_count": 51, "id": "95d56aa8", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:19.756018Z", "iopub.status.busy": "2022-05-16T23:19:19.755174Z", "iopub.status.idle": "2022-05-16T23:19:19.757522Z", "shell.execute_reply": "2022-05-16T23:19:19.756988Z", "shell.execute_reply.started": "2022-04-19T22:50:50.499493Z" }, "papermill": { "duration": 0.109843, "end_time": "2022-05-16T23:19:19.757641", "exception": false, "start_time": "2022-05-16T23:19:19.647798", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "lr = 8e-5" ] }, { "cell_type": "markdown", "id": "456eba4a", "metadata": { "papermill": { "duration": 0.104473, "end_time": "2022-05-16T23:19:19.964659", "exception": false, "start_time": "2022-05-16T23:19:19.860186", "status": "completed" }, "tags": [] }, "source": [ "Transformers uses the `TrainingArguments` class to set up arguments. Don't worry too much about the values we're using here -- they should generally work fine in most cases. It's just the 3 parameters above that you may need to change for different models." ] }, { "cell_type": "code", "execution_count": 52, "id": "e009173f", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:20.241601Z", "iopub.status.busy": "2022-05-16T23:19:20.178470Z", "iopub.status.idle": "2022-05-16T23:19:20.247751Z", "shell.execute_reply": "2022-05-16T23:19:20.247289Z", "shell.execute_reply.started": "2022-04-19T22:50:50.511653Z" }, "papermill": { "duration": 0.178531, "end_time": "2022-05-16T23:19:20.247877", "exception": false, "start_time": "2022-05-16T23:19:20.069346", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,\n", " evaluation_strategy=\"epoch\", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,\n", " num_train_epochs=epochs, weight_decay=0.01, report_to='none')" ] }, { "cell_type": "markdown", "id": "c6f09673", "metadata": { "papermill": { "duration": 0.10325, "end_time": "2022-05-16T23:19:20.455326", "exception": false, "start_time": "2022-05-16T23:19:20.352076", "status": "completed" }, "tags": [] }, "source": [ "We can now create our model, and `Trainer`, which is a class which combines the data and model together (just like `Learner` in fastai):" ] }, { "cell_type": "code", "execution_count": 53, "id": "c242e16d", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:19:20.669564Z", "iopub.status.busy": "2022-05-16T23:19:20.668967Z", "iopub.status.idle": "2022-05-16T23:19:41.632888Z", "shell.execute_reply": "2022-05-16T23:19:41.632215Z", "shell.execute_reply.started": "2022-04-19T22:50:50.57276Z" }, "papermill": { "duration": 21.075395, "end_time": "2022-05-16T23:19:41.633043", "exception": false, "start_time": "2022-05-16T23:19:20.557648", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "06f16c9f0e074c379a708f0fefb12ada", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/273M [00:00\n", " \n", " \n", " [856/856 04:58, Epoch 4/4]\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EpochTraining LossValidation LossPearson
1No log0.0244920.800443
2No log0.0220030.826113
30.0416000.0214230.834453
40.0416000.0222750.834767

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, id, target, anchor, input.\n", "***** Running Evaluation *****\n", " Num examples = 9119\n", " Batch size = 256\n", "The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, id, target, anchor, input.\n", "***** Running Evaluation *****\n", " Num examples = 9119\n", " Batch size = 256\n", "Saving model checkpoint to outputs/checkpoint-500\n", "Configuration saved in outputs/checkpoint-500/config.json\n", "Model weights saved in outputs/checkpoint-500/pytorch_model.bin\n", "tokenizer config file saved in outputs/checkpoint-500/tokenizer_config.json\n", "Special tokens file saved in outputs/checkpoint-500/special_tokens_map.json\n", "added tokens file saved in outputs/checkpoint-500/added_tokens.json\n", "The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, id, target, anchor, input.\n", "***** Running Evaluation *****\n", " Num examples = 9119\n", " Batch size = 256\n", "The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, id, target, anchor, input.\n", "***** Running Evaluation *****\n", " Num examples = 9119\n", " Batch size = 256\n", "\n", "\n", "Training completed. Do not forget to share your model on huggingface.co/models =)\n", "\n", "\n" ] } ], "source": [ "trainer.train();" ] }, { "cell_type": "markdown", "id": "f71e4a43", "metadata": { "papermill": { "duration": 0.128839, "end_time": "2022-05-16T23:24:42.089146", "exception": false, "start_time": "2022-05-16T23:24:41.960307", "status": "completed" }, "tags": [] }, "source": [ "Lots more warning from Transformers again -- you can ignore these as before.\n", "\n", "The key thing to look at is the \"Pearson\" value in table above. As you see, it's increasing, and is already above 0.8. That's great news! We can now submit our predictions to Kaggle if we want them to be scored on the official leaderboard. Let's get some predictions on the test set:" ] }, { "cell_type": "code", "execution_count": 55, "id": "cc9ff8a9", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:24:42.337547Z", "iopub.status.busy": "2022-05-16T23:24:42.336666Z", "iopub.status.idle": "2022-05-16T23:24:42.387175Z", "shell.execute_reply": "2022-05-16T23:24:42.387621Z", "shell.execute_reply.started": "2022-04-19T22:56:03.886175Z" }, "papermill": { "duration": 0.176198, "end_time": "2022-05-16T23:24:42.387780", "exception": false, "start_time": "2022-05-16T23:24:42.211582", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, id, target, anchor, input.\n", "***** Running Prediction *****\n", " Num examples = 36\n", " Batch size = 256\n" ] }, { "data": { "text/html": [ "\n", "

\n", " \n", " \n", " [1/1 : < :]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "array([[ 0.51],\n", " [ 0.65],\n", " [ 0.5 ],\n", " [ 0.32],\n", " [-0.04],\n", " [ 0.52],\n", " [ 0.52],\n", " [ 0.07],\n", " [ 0.28],\n", " [ 1.11],\n", " [ 0.25],\n", " [ 0.22],\n", " [ 0.71],\n", " [ 0.88],\n", " [ 0.73],\n", " [ 0.41],\n", " [ 0.33],\n", " [ 0. ],\n", " [ 0.69],\n", " [ 0.35],\n", " [ 0.4 ],\n", " [ 0.25],\n", " [ 0.12],\n", " [ 0.27],\n", " [ 0.56],\n", " [-0. ],\n", " [-0.03],\n", " [-0.01],\n", " [-0.03],\n", " [ 0.59],\n", " [ 0.29],\n", " [ 0.03],\n", " [ 0.74],\n", " [ 0.57],\n", " [ 0.46],\n", " [ 0.21]])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds = trainer.predict(eval_ds).predictions.astype(float)\n", "preds" ] }, { "cell_type": "markdown", "id": "e7f99fe2", "metadata": { "papermill": { "duration": 0.118083, "end_time": "2022-05-16T23:24:42.627541", "exception": false, "start_time": "2022-05-16T23:24:42.509458", "status": "completed" }, "tags": [] }, "source": [ "Look out - some of our predictions are <0, or >1! This once again shows the value of remember to actually *look* at your data. Let's fix those out-of-bounds predictions:" ] }, { "cell_type": "code", "execution_count": 56, "id": "87e31c26", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:24:42.887528Z", "iopub.status.busy": "2022-05-16T23:24:42.886610Z", "iopub.status.idle": "2022-05-16T23:24:42.888262Z", "shell.execute_reply": "2022-05-16T23:24:42.888836Z", "shell.execute_reply.started": "2022-04-19T22:56:03.940986Z" }, "papermill": { "duration": 0.130653, "end_time": "2022-05-16T23:24:42.888988", "exception": false, "start_time": "2022-05-16T23:24:42.758335", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "preds = np.clip(preds, 0, 1)" ] }, { "cell_type": "code", "execution_count": 57, "id": "73ce77ef", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:24:43.130852Z", "iopub.status.busy": "2022-05-16T23:24:43.130226Z", "iopub.status.idle": "2022-05-16T23:24:43.133011Z", "shell.execute_reply": "2022-05-16T23:24:43.133433Z", "shell.execute_reply.started": "2022-04-19T22:56:03.946586Z" }, "papermill": { "duration": 0.126485, "end_time": "2022-05-16T23:24:43.133567", "exception": false, "start_time": "2022-05-16T23:24:43.007082", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[0.51],\n", " [0.65],\n", " [0.5 ],\n", " [0.32],\n", " [0. ],\n", " [0.52],\n", " [0.52],\n", " [0.07],\n", " [0.28],\n", " [1. ],\n", " [0.25],\n", " [0.22],\n", " [0.71],\n", " [0.88],\n", " [0.73],\n", " [0.41],\n", " [0.33],\n", " [0. ],\n", " [0.69],\n", " [0.35],\n", " [0.4 ],\n", " [0.25],\n", " [0.12],\n", " [0.27],\n", " [0.56],\n", " [0. ],\n", " [0. ],\n", " [0. ],\n", " [0. ],\n", " [0.59],\n", " [0.29],\n", " [0.03],\n", " [0.74],\n", " [0.57],\n", " [0.46],\n", " [0.21]])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds" ] }, { "cell_type": "markdown", "id": "613fd9ce", "metadata": { "papermill": { "duration": 0.1167, "end_time": "2022-05-16T23:24:43.367254", "exception": false, "start_time": "2022-05-16T23:24:43.250554", "status": "completed" }, "tags": [] }, "source": [ "OK, now we're ready to create our submission file. If you save a CSV in your notebook, you will get the option to submit it later." ] }, { "cell_type": "code", "execution_count": 58, "id": "7b89fb1f", "metadata": { "execution": { "iopub.execute_input": "2022-05-16T23:24:43.662645Z", "iopub.status.busy": "2022-05-16T23:24:43.661829Z", "iopub.status.idle": "2022-05-16T23:24:43.714350Z", "shell.execute_reply": "2022-05-16T23:24:43.712706Z", "shell.execute_reply.started": "2022-04-19T22:56:03.959351Z" }, "papermill": { "duration": 0.173814, "end_time": "2022-05-16T23:24:43.714480", "exception": false, "start_time": "2022-05-16T23:24:43.540666", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "69e68b83bfe144bb95d01c779834cf18", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Creating CSV from Arrow format: 0%| | 0/1 [00:00