File size: 69,130 Bytes
6f26afe
1
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n    for filename in filenames:\n        print(os.path.join(dirname, filename))\n\n# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","execution":{"iopub.status.busy":"2022-08-12T16:07:49.750037Z","iopub.execute_input":"2022-08-12T16:07:49.750515Z","iopub.status.idle":"2022-08-12T16:07:49.761989Z","shell.execute_reply.started":"2022-08-12T16:07:49.750473Z","shell.execute_reply":"2022-08-12T16:07:49.760803Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"/kaggle/input/tabular-playground-series-aug-2022/sample_submission.csv\n/kaggle/input/tabular-playground-series-aug-2022/train.csv\n/kaggle/input/tabular-playground-series-aug-2022/test.csv\n","output_type":"stream"}]},{"cell_type":"markdown","source":"## Using skops to host your models on Hugging Face Hub\nThis notebook shows you how you can use [skops](https://skops.readthedocs.io/) to improve your data science workflows with scikit-learn. We will have end-to-end example for Kaggle Tabular Playground Series of August 2022.","metadata":{}},{"cell_type":"markdown","source":"## Install skops","metadata":{}},{"cell_type":"code","source":"#!pip install skops","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:42:20.000537Z","iopub.execute_input":"2022-08-12T16:42:20.000960Z","iopub.status.idle":"2022-08-12T16:42:20.005212Z","shell.execute_reply.started":"2022-08-12T16:42:20.000926Z","shell.execute_reply":"2022-08-12T16:42:20.004298Z"},"trusted":true},"execution_count":58,"outputs":[]},{"cell_type":"markdown","source":"## Import libraries","metadata":{}},{"cell_type":"code","source":"import skops\nimport sklearn\nimport matplotlib.pyplot as plt","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.273144Z","iopub.execute_input":"2022-08-12T16:08:01.273524Z","iopub.status.idle":"2022-08-12T16:08:01.279217Z","shell.execute_reply.started":"2022-08-12T16:08:01.273487Z","shell.execute_reply":"2022-08-12T16:08:01.277670Z"},"trusted":true},"execution_count":36,"outputs":[]},{"cell_type":"markdown","source":"## Let's take a look at the dataset\nTarget variable is a binary category. We have couple of numerical and categorical variables.","metadata":{}},{"cell_type":"code","source":"df = pd.read_csv(\"../input/tabular-playground-series-aug-2022/train.csv\")\ndf.head()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.280555Z","iopub.execute_input":"2022-08-12T16:08:01.280918Z","iopub.status.idle":"2022-08-12T16:08:01.433127Z","shell.execute_reply.started":"2022-08-12T16:08:01.280882Z","shell.execute_reply":"2022-08-12T16:08:01.431902Z"},"trusted":true},"execution_count":37,"outputs":[{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"   id product_code  loading attribute_0 attribute_1  attribute_2  attribute_3  \\\n0   0            A    80.10  material_7  material_8            9            5   \n1   1            A    84.89  material_7  material_8            9            5   \n2   2            A    82.43  material_7  material_8            9            5   \n3   3            A   101.07  material_7  material_8            9            5   \n4   4            A   188.06  material_7  material_8            9            5   \n\n   measurement_0  measurement_1  measurement_2  ...  measurement_9  \\\n0              7              8              4  ...         10.672   \n1             14              3              3  ...         12.448   \n2             12              1              5  ...         12.715   \n3             13              2              6  ...         12.471   \n4              9              2              8  ...         10.337   \n\n   measurement_10  measurement_11  measurement_12  measurement_13  \\\n0          15.859          17.594          15.193          15.029   \n1          17.947          17.915          11.755          14.732   \n2          15.607             NaN          13.798          16.711   \n3          16.346          18.377          10.020          15.250   \n4          17.082          19.932          12.428          16.182   \n\n   measurement_14  measurement_15  measurement_16  measurement_17  failure  \n0             NaN          13.034          14.684         764.100        0  \n1          15.425          14.395          15.631         682.057        0  \n2          18.631          14.094          17.946         663.376        0  \n3          15.562          16.154          17.172         826.282        0  \n4          12.760          13.153          16.412         579.885        0  \n\n[5 rows x 26 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>id</th>\n      <th>product_code</th>\n      <th>loading</th>\n      <th>attribute_0</th>\n      <th>attribute_1</th>\n      <th>attribute_2</th>\n      <th>attribute_3</th>\n      <th>measurement_0</th>\n      <th>measurement_1</th>\n      <th>measurement_2</th>\n      <th>...</th>\n      <th>measurement_9</th>\n      <th>measurement_10</th>\n      <th>measurement_11</th>\n      <th>measurement_12</th>\n      <th>measurement_13</th>\n      <th>measurement_14</th>\n      <th>measurement_15</th>\n      <th>measurement_16</th>\n      <th>measurement_17</th>\n      <th>failure</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>A</td>\n      <td>80.10</td>\n      <td>material_7</td>\n      <td>material_8</td>\n      <td>9</td>\n      <td>5</td>\n      <td>7</td>\n      <td>8</td>\n      <td>4</td>\n      <td>...</td>\n      <td>10.672</td>\n      <td>15.859</td>\n      <td>17.594</td>\n      <td>15.193</td>\n      <td>15.029</td>\n      <td>NaN</td>\n      <td>13.034</td>\n      <td>14.684</td>\n      <td>764.100</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>A</td>\n      <td>84.89</td>\n      <td>material_7</td>\n      <td>material_8</td>\n      <td>9</td>\n      <td>5</td>\n      <td>14</td>\n      <td>3</td>\n      <td>3</td>\n      <td>...</td>\n      <td>12.448</td>\n      <td>17.947</td>\n      <td>17.915</td>\n      <td>11.755</td>\n      <td>14.732</td>\n      <td>15.425</td>\n      <td>14.395</td>\n      <td>15.631</td>\n      <td>682.057</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2</td>\n      <td>A</td>\n      <td>82.43</td>\n      <td>material_7</td>\n      <td>material_8</td>\n      <td>9</td>\n      <td>5</td>\n      <td>12</td>\n      <td>1</td>\n      <td>5</td>\n      <td>...</td>\n      <td>12.715</td>\n      <td>15.607</td>\n      <td>NaN</td>\n      <td>13.798</td>\n      <td>16.711</td>\n      <td>18.631</td>\n      <td>14.094</td>\n      <td>17.946</td>\n      <td>663.376</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>3</td>\n      <td>A</td>\n      <td>101.07</td>\n      <td>material_7</td>\n      <td>material_8</td>\n      <td>9</td>\n      <td>5</td>\n      <td>13</td>\n      <td>2</td>\n      <td>6</td>\n      <td>...</td>\n      <td>12.471</td>\n      <td>16.346</td>\n      <td>18.377</td>\n      <td>10.020</td>\n      <td>15.250</td>\n      <td>15.562</td>\n      <td>16.154</td>\n      <td>17.172</td>\n      <td>826.282</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>4</td>\n      <td>A</td>\n      <td>188.06</td>\n      <td>material_7</td>\n      <td>material_8</td>\n      <td>9</td>\n      <td>5</td>\n      <td>9</td>\n      <td>2</td>\n      <td>8</td>\n      <td>...</td>\n      <td>10.337</td>\n      <td>17.082</td>\n      <td>19.932</td>\n      <td>12.428</td>\n      <td>16.182</td>\n      <td>12.760</td>\n      <td>13.153</td>\n      <td>16.412</td>\n      <td>579.885</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 26 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"code","source":"df[\"failure\"].unique()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.436722Z","iopub.execute_input":"2022-08-12T16:08:01.437150Z","iopub.status.idle":"2022-08-12T16:08:01.445258Z","shell.execute_reply.started":"2022-08-12T16:08:01.437117Z","shell.execute_reply":"2022-08-12T16:08:01.444066Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"array([0, 1])"},"metadata":{}}]},{"cell_type":"markdown","source":"# Encode categorical variables, impute missing values\nWe will impute mean for the numerical attribues and measurements. ","metadata":{}},{"cell_type":"code","source":"df.describe()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.447099Z","iopub.execute_input":"2022-08-12T16:08:01.447438Z","iopub.status.idle":"2022-08-12T16:08:01.558557Z","shell.execute_reply.started":"2022-08-12T16:08:01.447409Z","shell.execute_reply":"2022-08-12T16:08:01.557437Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"                 id       loading   attribute_2   attribute_3  measurement_0  \\\ncount  26570.000000  26320.000000  26570.000000  26570.000000   26570.000000   \nmean   13284.500000    127.826233      6.754046      7.240459       7.415883   \nstd     7670.242662     39.030020      1.471852      1.456493       4.116690   \nmin        0.000000     33.160000      5.000000      5.000000       0.000000   \n25%     6642.250000     99.987500      6.000000      6.000000       4.000000   \n50%    13284.500000    122.390000      6.000000      8.000000       7.000000   \n75%    19926.750000    149.152500      8.000000      8.000000      10.000000   \nmax    26569.000000    385.860000      9.000000      9.000000      29.000000   \n\n       measurement_1  measurement_2  measurement_3  measurement_4  \\\ncount   26570.000000   26570.000000   26189.000000   26032.000000   \nmean        8.232518       6.256568      17.791528      11.731988   \nstd         4.199401       3.309109       1.001200       0.996085   \nmin         0.000000       0.000000      13.968000       8.008000   \n25%         5.000000       4.000000      17.117000      11.051000   \n50%         8.000000       6.000000      17.787000      11.733000   \n75%        11.000000       8.000000      18.469000      12.410000   \nmax        29.000000      24.000000      21.499000      16.484000   \n\n       measurement_5  ...  measurement_9  measurement_10  measurement_11  \\\ncount   25894.000000  ...   25343.000000    25270.000000    25102.000000   \nmean       17.127804  ...      11.430725       16.117711       19.172085   \nstd         0.996414  ...       0.999137        1.405978        1.520785   \nmin        12.073000  ...       7.537000        9.323000       12.461000   \n25%        16.443000  ...      10.757000       15.209000       18.170000   \n50%        17.132000  ...      11.430000       16.127000       19.211500   \n75%        17.805000  ...      12.102000       17.025000       20.207000   \nmax        21.425000  ...      15.412000       22.479000       25.640000   \n\n       measurement_12  measurement_13  measurement_14  measurement_15  \\\ncount    24969.000000    24796.000000    24696.000000    24561.000000   \nmean        11.702464       15.652904       16.048444       14.995554   \nstd          1.488838        1.155247        1.491923        1.549226   \nmin          5.167000       10.890000        9.140000        9.104000   \n25%         10.703000       14.890000       15.057000       13.957000   \n50%         11.717000       15.628500       16.040000       14.969000   \n75%         12.709000       16.374000       17.082000       16.018000   \nmax         17.663000       22.713000       22.303000       21.626000   \n\n       measurement_16  measurement_17       failure  \ncount    24460.000000    24286.000000  26570.000000  \nmean        16.460727      701.269059      0.212608  \nstd          1.708935      123.304161      0.409160  \nmin          9.701000      196.787000      0.000000  \n25%         15.268000      618.961500      0.000000  \n50%         16.436000      701.024500      0.000000  \n75%         17.628000      784.090250      0.000000  \nmax         24.094000     1312.794000      1.000000  \n\n[8 rows x 23 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>id</th>\n      <th>loading</th>\n      <th>attribute_2</th>\n      <th>attribute_3</th>\n      <th>measurement_0</th>\n      <th>measurement_1</th>\n      <th>measurement_2</th>\n      <th>measurement_3</th>\n      <th>measurement_4</th>\n      <th>measurement_5</th>\n      <th>...</th>\n      <th>measurement_9</th>\n      <th>measurement_10</th>\n      <th>measurement_11</th>\n      <th>measurement_12</th>\n      <th>measurement_13</th>\n      <th>measurement_14</th>\n      <th>measurement_15</th>\n      <th>measurement_16</th>\n      <th>measurement_17</th>\n      <th>failure</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>26570.000000</td>\n      <td>26320.000000</td>\n      <td>26570.000000</td>\n      <td>26570.000000</td>\n      <td>26570.000000</td>\n      <td>26570.000000</td>\n      <td>26570.000000</td>\n      <td>26189.000000</td>\n      <td>26032.000000</td>\n      <td>25894.000000</td>\n      <td>...</td>\n      <td>25343.000000</td>\n      <td>25270.000000</td>\n      <td>25102.000000</td>\n      <td>24969.000000</td>\n      <td>24796.000000</td>\n      <td>24696.000000</td>\n      <td>24561.000000</td>\n      <td>24460.000000</td>\n      <td>24286.000000</td>\n      <td>26570.000000</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>13284.500000</td>\n      <td>127.826233</td>\n      <td>6.754046</td>\n      <td>7.240459</td>\n      <td>7.415883</td>\n      <td>8.232518</td>\n      <td>6.256568</td>\n      <td>17.791528</td>\n      <td>11.731988</td>\n      <td>17.127804</td>\n      <td>...</td>\n      <td>11.430725</td>\n      <td>16.117711</td>\n      <td>19.172085</td>\n      <td>11.702464</td>\n      <td>15.652904</td>\n      <td>16.048444</td>\n      <td>14.995554</td>\n      <td>16.460727</td>\n      <td>701.269059</td>\n      <td>0.212608</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>7670.242662</td>\n      <td>39.030020</td>\n      <td>1.471852</td>\n      <td>1.456493</td>\n      <td>4.116690</td>\n      <td>4.199401</td>\n      <td>3.309109</td>\n      <td>1.001200</td>\n      <td>0.996085</td>\n      <td>0.996414</td>\n      <td>...</td>\n      <td>0.999137</td>\n      <td>1.405978</td>\n      <td>1.520785</td>\n      <td>1.488838</td>\n      <td>1.155247</td>\n      <td>1.491923</td>\n      <td>1.549226</td>\n      <td>1.708935</td>\n      <td>123.304161</td>\n      <td>0.409160</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>0.000000</td>\n      <td>33.160000</td>\n      <td>5.000000</td>\n      <td>5.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>13.968000</td>\n      <td>8.008000</td>\n      <td>12.073000</td>\n      <td>...</td>\n      <td>7.537000</td>\n      <td>9.323000</td>\n      <td>12.461000</td>\n      <td>5.167000</td>\n      <td>10.890000</td>\n      <td>9.140000</td>\n      <td>9.104000</td>\n      <td>9.701000</td>\n      <td>196.787000</td>\n      <td>0.000000</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>6642.250000</td>\n      <td>99.987500</td>\n      <td>6.000000</td>\n      <td>6.000000</td>\n      <td>4.000000</td>\n      <td>5.000000</td>\n      <td>4.000000</td>\n      <td>17.117000</td>\n      <td>11.051000</td>\n      <td>16.443000</td>\n      <td>...</td>\n      <td>10.757000</td>\n      <td>15.209000</td>\n      <td>18.170000</td>\n      <td>10.703000</td>\n      <td>14.890000</td>\n      <td>15.057000</td>\n      <td>13.957000</td>\n      <td>15.268000</td>\n      <td>618.961500</td>\n      <td>0.000000</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>13284.500000</td>\n      <td>122.390000</td>\n      <td>6.000000</td>\n      <td>8.000000</td>\n      <td>7.000000</td>\n      <td>8.000000</td>\n      <td>6.000000</td>\n      <td>17.787000</td>\n      <td>11.733000</td>\n      <td>17.132000</td>\n      <td>...</td>\n      <td>11.430000</td>\n      <td>16.127000</td>\n      <td>19.211500</td>\n      <td>11.717000</td>\n      <td>15.628500</td>\n      <td>16.040000</td>\n      <td>14.969000</td>\n      <td>16.436000</td>\n      <td>701.024500</td>\n      <td>0.000000</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>19926.750000</td>\n      <td>149.152500</td>\n      <td>8.000000</td>\n      <td>8.000000</td>\n      <td>10.000000</td>\n      <td>11.000000</td>\n      <td>8.000000</td>\n      <td>18.469000</td>\n      <td>12.410000</td>\n      <td>17.805000</td>\n      <td>...</td>\n      <td>12.102000</td>\n      <td>17.025000</td>\n      <td>20.207000</td>\n      <td>12.709000</td>\n      <td>16.374000</td>\n      <td>17.082000</td>\n      <td>16.018000</td>\n      <td>17.628000</td>\n      <td>784.090250</td>\n      <td>0.000000</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>26569.000000</td>\n      <td>385.860000</td>\n      <td>9.000000</td>\n      <td>9.000000</td>\n      <td>29.000000</td>\n      <td>29.000000</td>\n      <td>24.000000</td>\n      <td>21.499000</td>\n      <td>16.484000</td>\n      <td>21.425000</td>\n      <td>...</td>\n      <td>15.412000</td>\n      <td>22.479000</td>\n      <td>25.640000</td>\n      <td>17.663000</td>\n      <td>22.713000</td>\n      <td>22.303000</td>\n      <td>21.626000</td>\n      <td>24.094000</td>\n      <td>1312.794000</td>\n      <td>1.000000</td>\n    </tr>\n  </tbody>\n</table>\n<p>8 rows × 23 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"markdown","source":"Take a look at the missing values and data types.","metadata":{}},{"cell_type":"code","source":"df.isna().any()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.560497Z","iopub.execute_input":"2022-08-12T16:08:01.560849Z","iopub.status.idle":"2022-08-12T16:08:01.573857Z","shell.execute_reply.started":"2022-08-12T16:08:01.560796Z","shell.execute_reply":"2022-08-12T16:08:01.572810Z"},"trusted":true},"execution_count":40,"outputs":[{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"id                False\nproduct_code      False\nloading            True\nattribute_0       False\nattribute_1       False\nattribute_2       False\nattribute_3       False\nmeasurement_0     False\nmeasurement_1     False\nmeasurement_2     False\nmeasurement_3      True\nmeasurement_4      True\nmeasurement_5      True\nmeasurement_6      True\nmeasurement_7      True\nmeasurement_8      True\nmeasurement_9      True\nmeasurement_10     True\nmeasurement_11     True\nmeasurement_12     True\nmeasurement_13     True\nmeasurement_14     True\nmeasurement_15     True\nmeasurement_16     True\nmeasurement_17     True\nfailure           False\ndtype: bool"},"metadata":{}}]},{"cell_type":"code","source":"df.dtypes","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.575878Z","iopub.execute_input":"2022-08-12T16:08:01.576226Z","iopub.status.idle":"2022-08-12T16:08:01.585351Z","shell.execute_reply.started":"2022-08-12T16:08:01.576194Z","shell.execute_reply":"2022-08-12T16:08:01.584190Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"id                  int64\nproduct_code       object\nloading           float64\nattribute_0        object\nattribute_1        object\nattribute_2         int64\nattribute_3         int64\nmeasurement_0       int64\nmeasurement_1       int64\nmeasurement_2       int64\nmeasurement_3     float64\nmeasurement_4     float64\nmeasurement_5     float64\nmeasurement_6     float64\nmeasurement_7     float64\nmeasurement_8     float64\nmeasurement_9     float64\nmeasurement_10    float64\nmeasurement_11    float64\nmeasurement_12    float64\nmeasurement_13    float64\nmeasurement_14    float64\nmeasurement_15    float64\nmeasurement_16    float64\nmeasurement_17    float64\nfailure             int64\ndtype: object"},"metadata":{}}]},{"cell_type":"markdown","source":"Let's see the cardinality of categorical variables.","metadata":{}},{"cell_type":"code","source":"print(df.product_code.nunique())\nprint(df.attribute_0.nunique())\nprint(df.attribute_0.nunique())\n","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.586538Z","iopub.execute_input":"2022-08-12T16:08:01.587124Z","iopub.status.idle":"2022-08-12T16:08:01.602366Z","shell.execute_reply.started":"2022-08-12T16:08:01.587085Z","shell.execute_reply":"2022-08-12T16:08:01.601468Z"},"trusted":true},"execution_count":42,"outputs":[{"name":"stdout","text":"5\n2\n2\n","output_type":"stream"}]},{"cell_type":"markdown","source":"# Preprocessing \nWe will use OneHotEncoder to encode our categorical variables, SimpleImputer to impute missing values and put them all in a ColumnTransformer. We will then use the transformer in our machine learning pipeline to have an end-to-end object for better reproducibility.","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import OneHotEncoder\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.compose import ColumnTransformer\n\ncolumn_transformer_pipeline = ColumnTransformer([\n                (\"loading_missing_value_imputer\", SimpleImputer(strategy=\"mean\"), [\"loading\"]),\n                (\"numerical_missing_value_imputer\", SimpleImputer(strategy=\"mean\"), list(df.columns[df.dtypes == 'float64'])),\n                (\"attribute_0_encoder\", OneHotEncoder(categories = \"auto\"), [\"attribute_0\"]),\n                (\"attribute_1_encoder\", OneHotEncoder(categories = \"auto\"), [\"attribute_1\"]),\n                (\"product_code_encoder\", OneHotEncoder(categories = \"auto\"), [\"product_code\"])])","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.603692Z","iopub.execute_input":"2022-08-12T16:08:01.604304Z","iopub.status.idle":"2022-08-12T16:08:01.612756Z","shell.execute_reply.started":"2022-08-12T16:08:01.604268Z","shell.execute_reply":"2022-08-12T16:08:01.611678Z"},"trusted":true},"execution_count":43,"outputs":[]},{"cell_type":"code","source":"df = df.drop([\"id\"], axis=1)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.616244Z","iopub.execute_input":"2022-08-12T16:08:01.616897Z","iopub.status.idle":"2022-08-12T16:08:01.628662Z","shell.execute_reply.started":"2022-08-12T16:08:01.616855Z","shell.execute_reply":"2022-08-12T16:08:01.627386Z"},"trusted":true},"execution_count":44,"outputs":[]},{"cell_type":"code","source":"from sklearn.tree import DecisionTreeClassifier\nfrom sklearn.pipeline import Pipeline\npipeline = Pipeline([\n    ('transformation', column_transformer_pipeline),\n    ('model', DecisionTreeClassifier(max_depth=4))\n])","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.630096Z","iopub.execute_input":"2022-08-12T16:08:01.631034Z","iopub.status.idle":"2022-08-12T16:08:01.640519Z","shell.execute_reply.started":"2022-08-12T16:08:01.630995Z","shell.execute_reply":"2022-08-12T16:08:01.639257Z"},"trusted":true},"execution_count":45,"outputs":[]},{"cell_type":"code","source":"X = df.drop([\"failure\"], axis = 1)\ny = df.failure","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.642270Z","iopub.execute_input":"2022-08-12T16:08:01.643448Z","iopub.status.idle":"2022-08-12T16:08:01.656699Z","shell.execute_reply.started":"2022-08-12T16:08:01.643404Z","shell.execute_reply":"2022-08-12T16:08:01.655346Z"},"trusted":true},"execution_count":46,"outputs":[]},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X, y)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.658597Z","iopub.execute_input":"2022-08-12T16:08:01.659907Z","iopub.status.idle":"2022-08-12T16:08:01.680523Z","shell.execute_reply.started":"2022-08-12T16:08:01.659853Z","shell.execute_reply":"2022-08-12T16:08:01.679150Z"},"trusted":true},"execution_count":47,"outputs":[]},{"cell_type":"code","source":"pipeline.fit(X_train, y_train)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.682078Z","iopub.execute_input":"2022-08-12T16:08:01.682574Z","iopub.status.idle":"2022-08-12T16:08:01.927531Z","shell.execute_reply.started":"2022-08-12T16:08:01.682526Z","shell.execute_reply":"2022-08-12T16:08:01.926319Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"Pipeline(steps=[('transformation',\n                 ColumnTransformer(transformers=[('loading_missing_value_imputer',\n                                                  SimpleImputer(),\n                                                  ['loading']),\n                                                 ('numerical_missing_value_imputer',\n                                                  SimpleImputer(),\n                                                  ['loading', 'measurement_3',\n                                                   'measurement_4',\n                                                   'measurement_5',\n                                                   'measurement_6',\n                                                   'measurement_7',\n                                                   'measurement_8',\n                                                   'measurement_9',\n                                                   'measurement_10',\n                                                   'measurement_11',\n                                                   'measurement_12',\n                                                   'measurement_13',\n                                                   'measurement_14',\n                                                   'measurement_15',\n                                                   'measurement_16',\n                                                   'measurement_17']),\n                                                 ('attribute_0_encoder',\n                                                  OneHotEncoder(),\n                                                  ['attribute_0']),\n                                                 ('attribute_1_encoder',\n                                                  OneHotEncoder(),\n                                                  ['attribute_1']),\n                                                 ('product_code_encoder',\n                                                  OneHotEncoder(),\n                                                  ['product_code'])])),\n                ('model', DecisionTreeClassifier(max_depth=4))])"},"metadata":{}}]},{"cell_type":"code","source":"y_pred = pipeline.predict(X_test)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.929273Z","iopub.execute_input":"2022-08-12T16:08:01.929842Z","iopub.status.idle":"2022-08-12T16:08:01.956267Z","shell.execute_reply.started":"2022-08-12T16:08:01.929778Z","shell.execute_reply":"2022-08-12T16:08:01.955125Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"markdown","source":"# We will now save the model and create a model card with metrics about our model!","metadata":{}},{"cell_type":"markdown","source":"We will use `hub_utils` for model hosting and `card` to create a model card. First, we will initialize a local repository to contain our model, model configuration, model card and anything else that we want. (e.g. plots)","metadata":{}},{"cell_type":"code","source":"from skops import card, hub_utils\nimport pickle\n\nmodel_path = \"model.pkl\"\nlocal_repo = \"decision-tree-playground-kaggle\"\n\nwith open(model_path, mode=\"bw\") as f:\n    pickle.dump(pipeline, file=f)\n\nhub_utils.init(\nmodel=model_path, \nrequirements=[f\"scikit-learn={sklearn.__version__}\"], \ndst=local_repo,\ntask=\"tabular-classification\",\ndata=X_test,\n)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.957800Z","iopub.execute_input":"2022-08-12T16:08:01.958544Z","iopub.status.idle":"2022-08-12T16:08:01.971908Z","shell.execute_reply.started":"2022-08-12T16:08:01.958496Z","shell.execute_reply":"2022-08-12T16:08:01.970902Z"},"trusted":true},"execution_count":50,"outputs":[]},{"cell_type":"markdown","source":"## We will now create our card 🃏 ","metadata":{}},{"cell_type":"markdown","source":"Creating the model card is as simple as instantiating `Card` class of `skops`. Calling `metadata_from_config` method will create metadata section of the model card from configuration file. We will use `add` method to pass information to our model card.","metadata":{}},{"cell_type":"code","source":"from pathlib import Path\nmodel_card = card.Card(pipeline, metadata=card.metadata_from_config(Path(local_repo)))\n\n## let's fill some information about the model\nlimitations = \"This model is not ready to be used in production.\"\nmodel_description = \"This is a DecisionTreeClassifier model built for Kaggle Tabular Playground Series August 2022, trained on supersoaker production failures dataset.\"\nmodel_card_authors = \"huggingface\"\nget_started_code = f\"import pickle \\nwith open({local_repo}/{model_path}, 'rb') as file: \\n    clf = pickle.load(file)\"\n\n# pass this information to the card\nmodel_card.add(\n    get_started_code=get_started_code,\n    model_card_authors=model_card_authors,\n    limitations=limitations,\n    model_description=model_description,\n)\n# adding methods return the model card itself for easy method chaining","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.973796Z","iopub.execute_input":"2022-08-12T16:08:01.974696Z","iopub.status.idle":"2022-08-12T16:08:02.071532Z","shell.execute_reply.started":"2022-08-12T16:08:01.974655Z","shell.execute_reply":"2022-08-12T16:08:02.070310Z"},"trusted":true},"execution_count":51,"outputs":[{"execution_count":51,"output_type":"execute_result","data":{"text/plain":"Card(\n  model=Pipeline(steps=[('transformat...cisionTreeClassifier(max_depth=4))]),\n  metadata.library_name=sklearn,\n  metadata.tags=['sklearn', 'skops', 'tabular-classification'],\n  metadata.widget={...},\n  get_started_code=\"import pickle \\\\n...s file: \\\\n clf = pickle.load(file)\",\n  model_card_authors='huggingface',\n  limitations='This model is not ready to be used in production.',\n  model_description='This is a Decisi...soaker production failures dataset.',\n)"},"metadata":{}}]},{"cell_type":"markdown","source":"We will now plot and create insights about our model and write them to the model card. \nPipeline includes the decision tree in the last step of it, you can see the content of pipeline as a tuple. The second element of the tuple includes the object -the tree model- itself so if we want to plot the tree we have to first get it from the pipeline. (see below)","metadata":{}},{"cell_type":"code","source":"pipeline.steps[-1][1]","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:02.073020Z","iopub.execute_input":"2022-08-12T16:08:02.073400Z","iopub.status.idle":"2022-08-12T16:08:02.082681Z","shell.execute_reply.started":"2022-08-12T16:08:02.073367Z","shell.execute_reply":"2022-08-12T16:08:02.080924Z"},"trusted":true},"execution_count":52,"outputs":[{"execution_count":52,"output_type":"execute_result","data":{"text/plain":"DecisionTreeClassifier(max_depth=4)"},"metadata":{}}]},{"cell_type":"markdown","source":"We can use `add_metrics` to pass metrics to our model card, which skops will parse into a table for us. We will use `add_plots` to add our plots. ","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import accuracy_score, f1_score, ConfusionMatrixDisplay, confusion_matrix\nmodel_card.add(eval_method=\"The model is evaluated using test split, on accuracy and F1 score with micro average.\")\nmodel_card.add_metrics(accuracy=accuracy_score(y_test, y_pred))\nmodel_card.add_metrics(**{\"f1 score\": f1_score(y_test, y_pred, average=\"micro\")})\n\nmodel = pipeline.steps[-1][1]\n# we will plot the tree and add the plot to our card\nfrom sklearn.tree import plot_tree\nplt.figure()\nplot_tree(model,filled=True)  \nplt.savefig(f'{local_repo}/tree.png',format='png',bbox_inches = \"tight\")\n\n# let's make a prediction and evaluate the model\n\ny_pred = pipeline.predict(X_test)\ncm = confusion_matrix(y_test, y_pred, labels=model.classes_)\ndisp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)\ndisp.plot()\n# save the plot\nplt.savefig(Path(local_repo) / \"confusion_matrix.png\")\n# add figures to model card with their new sections as keys to the dictionary\nmodel_card.add_plot(**{\"Tree Plot\": f'{local_repo}/tree.png', \"Confusion Matrix\": f\"{local_repo}/confusion_matrix.png\"})","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:02.084852Z","iopub.execute_input":"2022-08-12T16:08:02.085287Z","iopub.status.idle":"2022-08-12T16:08:05.482006Z","shell.execute_reply.started":"2022-08-12T16:08:02.085232Z","shell.execute_reply":"2022-08-12T16:08:05.480747Z"},"trusted":true},"execution_count":53,"outputs":[{"execution_count":53,"output_type":"execute_result","data":{"text/plain":"Card(\n  model=Pipeline(steps=[('transformat...cisionTreeClassifier(max_depth=4))]),\n  metadata.library_name=sklearn,\n  metadata.tags=['sklearn', 'skops', 'tabular-classification'],\n  metadata.widget={...},\n  get_started_code=\"import pickle \\\\n...s file: \\\\n clf = pickle.load(file)\",\n  model_card_authors='huggingface',\n  limitations='This model is not ready to be used in production.',\n  model_description='This is a Decisi...soaker production failures dataset.',\n  eval_method='The model is evaluated...cy and F1 score with micro average.',\n  Tree Plot='decision-tree-playground-kaggle/tree.png',\n  Confusion Matrix='decision-tree-playground-kaggle/confusion_matrix.png',\n)"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"<Figure size 432x288 with 1 Axes>","image/png":"\n"},"metadata":{"needs_background":"light"}},{"output_type":"display_data","data":{"text/plain":"<Figure size 432x288 with 2 Axes>","image/png":"\n"},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","source":"We will now save our model card.","metadata":{}},{"cell_type":"code","source":"model_card.save(f\"{local_repo}/README.md\")","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:05.483575Z","iopub.execute_input":"2022-08-12T16:08:05.484066Z","iopub.status.idle":"2022-08-12T16:08:05.518708Z","shell.execute_reply.started":"2022-08-12T16:08:05.484021Z","shell.execute_reply":"2022-08-12T16:08:05.517522Z"},"trusted":true},"execution_count":54,"outputs":[]},{"cell_type":"markdown","source":"Let's push our model repository to Hub! \nHugging Face Hub requires us to authenticate ourselves, we can do that using `notebook_login`\n","metadata":{}},{"cell_type":"code","source":"from huggingface_hub import notebook_login\nnotebook_login()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:04:27.699235Z","iopub.execute_input":"2022-08-12T16:04:27.699722Z","iopub.status.idle":"2022-08-12T16:04:27.744734Z","shell.execute_reply.started":"2022-08-12T16:04:27.699676Z","shell.execute_reply":"2022-08-12T16:04:27.743310Z"},"trusted":true},"execution_count":27,"outputs":[{"output_type":"display_data","data":{"text/plain":"VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"c262065390b9467180ad8645dedb582f"}},"metadata":{}}]},{"cell_type":"markdown","source":"We can push our model using `hub_utils.push`","metadata":{}},{"cell_type":"code","source":"# if the repository doesn't exist remotely on the Hugging Face Hub, it will be created when we set create_remote to True\nrepo_id = \"scikit-learn/tabular-playground\"\nhub_utils.push(\n    repo_id=repo_id,\n    source=local_repo,\n    token=token,\n    commit_message=\"pushing files to the repo from the example!\",\n    create_remote=True,\n)\n","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:15.078202Z","iopub.execute_input":"2022-08-12T16:08:15.078653Z","iopub.status.idle":"2022-08-12T16:08:18.508240Z","shell.execute_reply.started":"2022-08-12T16:08:15.078614Z","shell.execute_reply":"2022-08-12T16:08:18.506828Z"},"trusted":true},"execution_count":55,"outputs":[]},{"cell_type":"markdown","source":"## After we push it, the widget is enabled like below:","metadata":{}},{"cell_type":"markdown","source":"![Widget](https://huggingface.co/scikit-learn/tabular-playground/resolve/main/widget_screenshot.png)","metadata":{}},{"cell_type":"markdown","source":"# See how repository and our model card looks like [here](https://huggingface.co/scikit-learn/tabular-playground)  ✨","metadata":{}}]}