thanthamky commited on
Commit
914e01f
1 Parent(s): 99e0926

Upload 3 files

Browse files
app/1-eda.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
app/2-data_preprocessing.ipynb ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "attachments": {},
5
+ "cell_type": "markdown",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Data Preprocessing\n",
9
+ "\n",
10
+ "This file shows how I performed data cleaning and feature engineering. "
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "markdown",
15
+ "metadata": {},
16
+ "source": [
17
+ "## Set up"
18
+ ]
19
+ },
20
+ {
21
+ "attachments": {},
22
+ "cell_type": "markdown",
23
+ "metadata": {},
24
+ "source": [
25
+ "Import libraries."
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": 2,
31
+ "metadata": {},
32
+ "outputs": [],
33
+ "source": [
34
+ "import numpy as np\n",
35
+ "import pandas as pd\n",
36
+ "from sklearn.model_selection import train_test_split\n",
37
+ "from sklearn.preprocessing import MinMaxScaler"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "code",
42
+ "execution_count": null,
43
+ "metadata": {},
44
+ "outputs": [],
45
+ "source": [
46
+ "#!pip install scikit-learn"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "markdown",
51
+ "metadata": {},
52
+ "source": [
53
+ "Load datasets."
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": 3,
59
+ "metadata": {},
60
+ "outputs": [],
61
+ "source": [
62
+ "df_train_full = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/raw/train.csv\")\n",
63
+ "df_test = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/raw/test.csv\")"
64
+ ]
65
+ },
66
+ {
67
+ "attachments": {},
68
+ "cell_type": "markdown",
69
+ "metadata": {},
70
+ "source": [
71
+ "Since the test set provided does not have the target variable, so we have to create an internal validation set to evaluate the model performance."
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "code",
76
+ "execution_count": 4,
77
+ "metadata": {},
78
+ "outputs": [],
79
+ "source": [
80
+ "df_train, df_val = train_test_split(df_train_full, test_size=0.2, random_state=99)"
81
+ ]
82
+ },
83
+ {
84
+ "cell_type": "markdown",
85
+ "metadata": {},
86
+ "source": [
87
+ "## Data Cleaning"
88
+ ]
89
+ },
90
+ {
91
+ "cell_type": "markdown",
92
+ "metadata": {},
93
+ "source": [
94
+ "Remove the observations whose the target variable `fraud` is equal to -1."
95
+ ]
96
+ },
97
+ {
98
+ "cell_type": "code",
99
+ "execution_count": 5,
100
+ "metadata": {},
101
+ "outputs": [],
102
+ "source": [
103
+ "df_train = df_train[df_train[\"fraud\"] != -1]\n",
104
+ "df_val = df_val[df_val[\"fraud\"] != -1]"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "markdown",
109
+ "metadata": {},
110
+ "source": [
111
+ "For values that match the following conditions, treat them as missing values to be imputed later.\n",
112
+ "\n",
113
+ "- `age_of_driver > 100`\n",
114
+ "- `annual_income = -1`\n",
115
+ "- `zip_code = -1`\n",
116
+ "\n",
117
+ "According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), the oldest living person is 115, as of 2018. I think it is reasonable to assume that any `age_of_driver > 100` in this dataset is a clerical error."
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "execution_count": 6,
123
+ "metadata": {},
124
+ "outputs": [],
125
+ "source": [
126
+ "for df in [df_train, df_val, df_test]:\n",
127
+ " df.loc[df[\"age_of_driver\"] > 100, \"age_of_driver\"] = np.nan\n",
128
+ " df.loc[df[\"annual_income\"] == -1, \"annual_income\"] = np.nan\n",
129
+ " df.loc[df[\"zip_code\"] == 0, \"zip_code\"] = np.nan"
130
+ ]
131
+ },
132
+ {
133
+ "attachments": {},
134
+ "cell_type": "markdown",
135
+ "metadata": {},
136
+ "source": [
137
+ "Now, we will do an imputation for the missing values. Since there is only a very small percentage of missing values, we will simply do a mean/mode imputation for the continuous/categorical variables. Note that the mean/mode is computed based on the training set only to prevent data leakage."
138
+ ]
139
+ },
140
+ {
141
+ "cell_type": "code",
142
+ "execution_count": 7,
143
+ "metadata": {},
144
+ "outputs": [
145
+ {
146
+ "name": "stderr",
147
+ "output_type": "stream",
148
+ "text": [
149
+ "/tmp/ipykernel_293/883070373.py:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
150
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
151
+ "\n",
152
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
153
+ "\n",
154
+ "\n",
155
+ " df[feature].fillna(int(feature_mean), inplace=True)\n",
156
+ "/tmp/ipykernel_293/883070373.py:10: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
157
+ "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
158
+ "\n",
159
+ "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
160
+ "\n",
161
+ "\n",
162
+ " df[feature].fillna(feature_mode.values[0], inplace=True)\n"
163
+ ]
164
+ }
165
+ ],
166
+ "source": [
167
+ "for df in [df_train, df_val, df_test]:\n",
168
+ " # mean imputation for continuous variables\n",
169
+ " for feature in [\"age_of_driver\", \"annual_income\", \"claim_est_payout\", \"age_of_vehicle\"]:\n",
170
+ " feature_mean = df_train.loc[:, feature].mean(skipna=True)\n",
171
+ " df[feature].fillna(int(feature_mean), inplace=True)\n",
172
+ "\n",
173
+ " # mode imputation for categorical variables\n",
174
+ " for feature in [\"marital_status\", \"witness_present_ind\", \"zip_code\"]:\n",
175
+ " feature_mode = df_train.loc[:, feature].mode(dropna=True)\n",
176
+ " df[feature].fillna(feature_mode.values[0], inplace=True)"
177
+ ]
178
+ },
179
+ {
180
+ "cell_type": "markdown",
181
+ "metadata": {},
182
+ "source": [
183
+ "## Feature Engineering"
184
+ ]
185
+ },
186
+ {
187
+ "attachments": {},
188
+ "cell_type": "markdown",
189
+ "metadata": {},
190
+ "source": [
191
+ "Remove features that do not seem to be related to the target variable (based on common sense)."
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 8,
197
+ "metadata": {},
198
+ "outputs": [],
199
+ "source": [
200
+ "for df in [df_train, df_val, df_test]:\n",
201
+ " df.drop(columns=[\"claim_date\", \"claim_day_of_week\", \"vehicle_color\"], inplace=True)"
202
+ ]
203
+ },
204
+ {
205
+ "attachments": {},
206
+ "cell_type": "markdown",
207
+ "metadata": {},
208
+ "source": [
209
+ "There are many unique `zip_code`. Creating dummy variables for `zip_code` will increase the dimensionality of the data too much. One idea is to transform it into `latitude` and `longitude` using the data from [UnitedStatesZipCodes.org](https://www.unitedstateszipcodes.org/zip-code-database/)."
210
+ ]
211
+ },
212
+ {
213
+ "cell_type": "code",
214
+ "execution_count": 10,
215
+ "metadata": {},
216
+ "outputs": [],
217
+ "source": [
218
+ "zip_code_database = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/external/zip_code_database.csv\")\n",
219
+ "latitude_and_longitude_lookup = {\n",
220
+ " row.zip: (row.latitude, row.longitude) for row in zip_code_database.itertuples()\n",
221
+ "}\n",
222
+ "\n",
223
+ "for df in [df_train, df_val, df_test]:\n",
224
+ " df[\"latitude\"] = df[\"zip_code\"].apply(lambda x: latitude_and_longitude_lookup[x][0])\n",
225
+ " df[\"longitude\"] = df[\"zip_code\"].apply(lambda x: latitude_and_longitude_lookup[x][1])"
226
+ ]
227
+ },
228
+ {
229
+ "attachments": {},
230
+ "cell_type": "markdown",
231
+ "metadata": {},
232
+ "source": [
233
+ "Another idea is to use [target encoding](https://maxhalford.github.io/blog/target-encoding/), but after a few experiments it seems to perform worse than just transforming it to `latitude` and `longitude`."
234
+ ]
235
+ },
236
+ {
237
+ "cell_type": "code",
238
+ "execution_count": 11,
239
+ "metadata": {},
240
+ "outputs": [],
241
+ "source": [
242
+ "#from category_encoders.target_encoder import TargetEncoder\n",
243
+ "#\n",
244
+ "#target_encoder = TargetEncoder(cols=[\"zip_code\"], smoothing=10)\n",
245
+ "#target_encoder.fit(df_train[\"zip_code\"], df_train[\"fraud\"])\n",
246
+ "#\n",
247
+ "#for df in [df_train, df_val, df_test]:\n",
248
+ "# df[\"zip_code_target_encoded\"] = target_encoder.transform(df[\"zip_code\"])"
249
+ ]
250
+ },
251
+ {
252
+ "attachments": {},
253
+ "cell_type": "markdown",
254
+ "metadata": {},
255
+ "source": [
256
+ "Now we can drop `zip_code`."
257
+ ]
258
+ },
259
+ {
260
+ "cell_type": "code",
261
+ "execution_count": 12,
262
+ "metadata": {},
263
+ "outputs": [],
264
+ "source": [
265
+ "for df in [df_train, df_val, df_test]:\n",
266
+ " df.drop(columns=[\"zip_code\"], inplace=True)"
267
+ ]
268
+ },
269
+ {
270
+ "cell_type": "markdown",
271
+ "metadata": {},
272
+ "source": [
273
+ "## Export processed data"
274
+ ]
275
+ },
276
+ {
277
+ "cell_type": "code",
278
+ "execution_count": 14,
279
+ "metadata": {},
280
+ "outputs": [],
281
+ "source": [
282
+ "#df_train.to_csv(\"../data/processed/train.csv\", index=False)\n",
283
+ "#df_val.to_csv(\"../data/processed/val.csv\", index=False)\n",
284
+ "#df_test.to_csv(\"../data/processed/test.csv\", index=False)"
285
+ ]
286
+ },
287
+ {
288
+ "cell_type": "code",
289
+ "execution_count": null,
290
+ "metadata": {},
291
+ "outputs": [],
292
+ "source": []
293
+ },
294
+ {
295
+ "cell_type": "code",
296
+ "execution_count": null,
297
+ "metadata": {},
298
+ "outputs": [],
299
+ "source": []
300
+ }
301
+ ],
302
+ "metadata": {
303
+ "interpreter": {
304
+ "hash": "03e93f2959c516196957ae17ec0aa5d1e9fc5dd82cbe13968d4cfc2a60558992"
305
+ },
306
+ "kernelspec": {
307
+ "display_name": "Python 3 (ipykernel)",
308
+ "language": "python",
309
+ "name": "python3"
310
+ },
311
+ "language_info": {
312
+ "codemirror_mode": {
313
+ "name": "ipython",
314
+ "version": 3
315
+ },
316
+ "file_extension": ".py",
317
+ "mimetype": "text/x-python",
318
+ "name": "python",
319
+ "nbconvert_exporter": "python",
320
+ "pygments_lexer": "ipython3",
321
+ "version": "3.12.1"
322
+ }
323
+ },
324
+ "nbformat": 4,
325
+ "nbformat_minor": 4
326
+ }
app/3-modeling.ipynb ADDED
@@ -0,0 +1,830 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "attachments": {},
5
+ "cell_type": "markdown",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Modeling\n",
9
+ "\n",
10
+ "In this notebook, the performance of different models is examined."
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "markdown",
15
+ "metadata": {},
16
+ "source": [
17
+ "## Setup"
18
+ ]
19
+ },
20
+ {
21
+ "attachments": {},
22
+ "cell_type": "markdown",
23
+ "metadata": {},
24
+ "source": [
25
+ "Import libraries."
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": 35,
31
+ "metadata": {},
32
+ "outputs": [],
33
+ "source": [
34
+ "import numpy as np\n",
35
+ "import pandas as pd\n",
36
+ "from imblearn.pipeline import make_pipeline\n",
37
+ "from imblearn.over_sampling import SMOTE\n",
38
+ "from sklearn.compose import make_column_transformer\n",
39
+ "from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier\n",
40
+ "from sklearn.linear_model import LogisticRegression\n",
41
+ "from sklearn.metrics import roc_auc_score\n",
42
+ "from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV\n",
43
+ "from sklearn.neighbors import KNeighborsClassifier\n",
44
+ "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
45
+ "from xgboost import XGBClassifier"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": 5,
51
+ "metadata": {},
52
+ "outputs": [
53
+ {
54
+ "name": "stdout",
55
+ "output_type": "stream",
56
+ "text": [
57
+ "Requirement already satisfied: imblearn in /home/user/miniconda/lib/python3.12/site-packages (0.0)\n",
58
+ "Collecting xgboost\n",
59
+ " Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl.metadata (2.0 kB)\n",
60
+ "Requirement already satisfied: imbalanced-learn in /home/user/miniconda/lib/python3.12/site-packages (from imblearn) (0.12.3)\n",
61
+ "Requirement already satisfied: numpy in /home/user/miniconda/lib/python3.12/site-packages (from xgboost) (1.26.4)\n",
62
+ "Requirement already satisfied: scipy in /home/user/miniconda/lib/python3.12/site-packages (from xgboost) (1.13.1)\n",
63
+ "Requirement already satisfied: scikit-learn>=1.0.2 in /home/user/miniconda/lib/python3.12/site-packages (from imbalanced-learn->imblearn) (1.5.0)\n",
64
+ "Requirement already satisfied: joblib>=1.1.1 in /home/user/miniconda/lib/python3.12/site-packages (from imbalanced-learn->imblearn) (1.4.2)\n",
65
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/user/miniconda/lib/python3.12/site-packages (from imbalanced-learn->imblearn) (3.5.0)\n",
66
+ "Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)\n",
67
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m297.1/297.1 MB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
68
+ "\u001b[?25hInstalling collected packages: xgboost\n",
69
+ "Successfully installed xgboost-2.0.3\n"
70
+ ]
71
+ }
72
+ ],
73
+ "source": [
74
+ "!pip install imblearn xgboost"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "markdown",
79
+ "metadata": {},
80
+ "source": [
81
+ "Load datasets."
82
+ ]
83
+ },
84
+ {
85
+ "cell_type": "code",
86
+ "execution_count": 36,
87
+ "metadata": {},
88
+ "outputs": [],
89
+ "source": [
90
+ "df_train = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/processed/train.csv\")\n",
91
+ "df_val = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/processed/val.csv\")\n",
92
+ "df_test = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/processed/test.csv\")"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "code",
97
+ "execution_count": 37,
98
+ "metadata": {},
99
+ "outputs": [],
100
+ "source": [
101
+ "X_train = df_train.drop(columns=[\"claim_number\", \"fraud\"])\n",
102
+ "y_train = df_train[\"fraud\"]\n",
103
+ "X_val = df_val.drop(columns=[\"claim_number\", \"fraud\"])\n",
104
+ "y_val = df_val[\"fraud\"]\n",
105
+ "X_test = df_test.drop(columns=[\"claim_number\"])"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "markdown",
110
+ "metadata": {},
111
+ "source": [
112
+ "## Model Selection"
113
+ ]
114
+ },
115
+ {
116
+ "attachments": {},
117
+ "cell_type": "markdown",
118
+ "metadata": {},
119
+ "source": [
120
+ "`OneHotEncoder` will dummify categorical features, and numerical features will be re-scaled with `MinMaxScaler`."
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "code",
125
+ "execution_count": 38,
126
+ "metadata": {},
127
+ "outputs": [],
128
+ "source": [
129
+ "categorical_features = X_train.columns[X_train.dtypes == object].tolist()\n",
130
+ "column_transformer = make_column_transformer(\n",
131
+ " (OneHotEncoder(drop=\"first\"), categorical_features),\n",
132
+ " remainder=\"passthrough\",\n",
133
+ ")\n",
134
+ "scaler = MinMaxScaler()"
135
+ ]
136
+ },
137
+ {
138
+ "attachments": {},
139
+ "cell_type": "markdown",
140
+ "metadata": {},
141
+ "source": [
142
+ "A simple function that defines the training pipeline: fit the model, predict on the validation set, print the evaluation metric."
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "code",
147
+ "execution_count": 39,
148
+ "metadata": {},
149
+ "outputs": [],
150
+ "source": [
151
+ "def modeling(X_train, y_train, X_val, y_val, steps):\n",
152
+ " pipeline = make_pipeline(*steps)\n",
153
+ " pipeline.fit(X_train, y_train)\n",
154
+ " y_val_pred = pipeline.predict_proba(X_val)[:, 1]\n",
155
+ " metric = roc_auc_score(y_val, y_val_pred)\n",
156
+ " if isinstance(pipeline._final_estimator, RandomizedSearchCV) or isinstance(pipeline._final_estimator, GridSearchCV):\n",
157
+ " print(f\"Best params: {pipeline._final_estimator.best_params_}\")\n",
158
+ " print(f\"AUC score: {metric}\")\n",
159
+ " return pipeline"
160
+ ]
161
+ },
162
+ {
163
+ "attachments": {},
164
+ "cell_type": "markdown",
165
+ "metadata": {},
166
+ "source": [
167
+ "### K-Nearest Neighbor\n",
168
+ "\n",
169
+ "KNN has two hyperparameters: the number of neighbors, and whether all points in each neighborhood are weighted equally or weighted by the inverse of their distance. Since the number of hyperparameters is small. A grid search is used to find the optimal hyperparameter values."
170
+ ]
171
+ },
172
+ {
173
+ "cell_type": "code",
174
+ "execution_count": 40,
175
+ "metadata": {},
176
+ "outputs": [
177
+ {
178
+ "name": "stdout",
179
+ "output_type": "stream",
180
+ "text": [
181
+ "Best params: {'n_neighbors': 50, 'weights': 'distance'}\n",
182
+ "AUC score: 0.6507841602442943\n"
183
+ ]
184
+ }
185
+ ],
186
+ "source": [
187
+ "param_grid = {\n",
188
+ " \"n_neighbors\": [5, 10, 25, 50],\n",
189
+ " \"weights\": [\"uniform\", \"distance\"],\n",
190
+ "}\n",
191
+ "\n",
192
+ "knn_clf = GridSearchCV(\n",
193
+ " KNeighborsClassifier(),\n",
194
+ " param_grid=param_grid,\n",
195
+ " n_jobs=-1,\n",
196
+ " cv=5,\n",
197
+ " scoring=\"roc_auc\",\n",
198
+ ")\n",
199
+ "\n",
200
+ "knn_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, knn_clf])"
201
+ ]
202
+ },
203
+ {
204
+ "attachments": {},
205
+ "cell_type": "markdown",
206
+ "metadata": {},
207
+ "source": [
208
+ "### Logistic Regression\n",
209
+ "\n",
210
+ "For logistic regression, there is no hyperparameter to tune."
211
+ ]
212
+ },
213
+ {
214
+ "cell_type": "code",
215
+ "execution_count": 41,
216
+ "metadata": {},
217
+ "outputs": [
218
+ {
219
+ "name": "stdout",
220
+ "output_type": "stream",
221
+ "text": [
222
+ "AUC score: 0.7157014847720347\n"
223
+ ]
224
+ }
225
+ ],
226
+ "source": [
227
+ "lr_clf = LogisticRegression()\n",
228
+ "lr_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, lr_clf])"
229
+ ]
230
+ },
231
+ {
232
+ "attachments": {},
233
+ "cell_type": "markdown",
234
+ "metadata": {},
235
+ "source": [
236
+ "Look at the model coefficients."
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": 42,
242
+ "metadata": {},
243
+ "outputs": [
244
+ {
245
+ "data": {
246
+ "text/html": [
247
+ "<div>\n",
248
+ "<style scoped>\n",
249
+ " .dataframe tbody tr th:only-of-type {\n",
250
+ " vertical-align: middle;\n",
251
+ " }\n",
252
+ "\n",
253
+ " .dataframe tbody tr th {\n",
254
+ " vertical-align: top;\n",
255
+ " }\n",
256
+ "\n",
257
+ " .dataframe thead th {\n",
258
+ " text-align: right;\n",
259
+ " }\n",
260
+ "</style>\n",
261
+ "<table border=\"1\" class=\"dataframe\">\n",
262
+ " <thead>\n",
263
+ " <tr style=\"text-align: right;\">\n",
264
+ " <th></th>\n",
265
+ " <th>feature_name</th>\n",
266
+ " <th>coefficient</th>\n",
267
+ " </tr>\n",
268
+ " </thead>\n",
269
+ " <tbody>\n",
270
+ " <tr>\n",
271
+ " <th>0</th>\n",
272
+ " <td>past_num_of_claims</td>\n",
273
+ " <td>1.750160</td>\n",
274
+ " </tr>\n",
275
+ " <tr>\n",
276
+ " <th>1</th>\n",
277
+ " <td>annual_income</td>\n",
278
+ " <td>1.570769</td>\n",
279
+ " </tr>\n",
280
+ " <tr>\n",
281
+ " <th>2</th>\n",
282
+ " <td>age_of_vehicle</td>\n",
283
+ " <td>0.982407</td>\n",
284
+ " </tr>\n",
285
+ " <tr>\n",
286
+ " <th>3</th>\n",
287
+ " <td>address_change_ind</td>\n",
288
+ " <td>0.398596</td>\n",
289
+ " </tr>\n",
290
+ " <tr>\n",
291
+ " <th>4</th>\n",
292
+ " <td>longitude</td>\n",
293
+ " <td>0.362837</td>\n",
294
+ " </tr>\n",
295
+ " <tr>\n",
296
+ " <th>5</th>\n",
297
+ " <td>living_status_Rent</td>\n",
298
+ " <td>0.128913</td>\n",
299
+ " </tr>\n",
300
+ " <tr>\n",
301
+ " <th>6</th>\n",
302
+ " <td>policy_report_filed_ind</td>\n",
303
+ " <td>0.083922</td>\n",
304
+ " </tr>\n",
305
+ " <tr>\n",
306
+ " <th>7</th>\n",
307
+ " <td>channel_Phone</td>\n",
308
+ " <td>0.039526</td>\n",
309
+ " </tr>\n",
310
+ " <tr>\n",
311
+ " <th>8</th>\n",
312
+ " <td>liab_prct</td>\n",
313
+ " <td>0.031912</td>\n",
314
+ " </tr>\n",
315
+ " <tr>\n",
316
+ " <th>9</th>\n",
317
+ " <td>vehicle_weight</td>\n",
318
+ " <td>0.031770</td>\n",
319
+ " </tr>\n",
320
+ " <tr>\n",
321
+ " <th>10</th>\n",
322
+ " <td>vehicle_price</td>\n",
323
+ " <td>0.030162</td>\n",
324
+ " </tr>\n",
325
+ " <tr>\n",
326
+ " <th>11</th>\n",
327
+ " <td>vehicle_category_Medium</td>\n",
328
+ " <td>0.027484</td>\n",
329
+ " </tr>\n",
330
+ " <tr>\n",
331
+ " <th>12</th>\n",
332
+ " <td>vehicle_category_Large</td>\n",
333
+ " <td>-0.063941</td>\n",
334
+ " </tr>\n",
335
+ " <tr>\n",
336
+ " <th>13</th>\n",
337
+ " <td>latitude</td>\n",
338
+ " <td>-0.166059</td>\n",
339
+ " </tr>\n",
340
+ " <tr>\n",
341
+ " <th>14</th>\n",
342
+ " <td>accident_site_Local</td>\n",
343
+ " <td>-0.234709</td>\n",
344
+ " </tr>\n",
345
+ " <tr>\n",
346
+ " <th>15</th>\n",
347
+ " <td>gender_M</td>\n",
348
+ " <td>-0.277402</td>\n",
349
+ " </tr>\n",
350
+ " <tr>\n",
351
+ " <th>16</th>\n",
352
+ " <td>channel_Online</td>\n",
353
+ " <td>-0.306284</td>\n",
354
+ " </tr>\n",
355
+ " <tr>\n",
356
+ " <th>17</th>\n",
357
+ " <td>claim_est_payout</td>\n",
358
+ " <td>-0.344002</td>\n",
359
+ " </tr>\n",
360
+ " <tr>\n",
361
+ " <th>18</th>\n",
362
+ " <td>marital_status</td>\n",
363
+ " <td>-0.459327</td>\n",
364
+ " </tr>\n",
365
+ " <tr>\n",
366
+ " <th>19</th>\n",
367
+ " <td>high_education_ind</td>\n",
368
+ " <td>-0.647302</td>\n",
369
+ " </tr>\n",
370
+ " <tr>\n",
371
+ " <th>20</th>\n",
372
+ " <td>witness_present_ind</td>\n",
373
+ " <td>-0.709166</td>\n",
374
+ " </tr>\n",
375
+ " <tr>\n",
376
+ " <th>21</th>\n",
377
+ " <td>accident_site_Parking Lot</td>\n",
378
+ " <td>-1.012493</td>\n",
379
+ " </tr>\n",
380
+ " <tr>\n",
381
+ " <th>22</th>\n",
382
+ " <td>safty_rating</td>\n",
383
+ " <td>-1.031068</td>\n",
384
+ " </tr>\n",
385
+ " <tr>\n",
386
+ " <th>23</th>\n",
387
+ " <td>age_of_driver</td>\n",
388
+ " <td>-2.510087</td>\n",
389
+ " </tr>\n",
390
+ " </tbody>\n",
391
+ "</table>\n",
392
+ "</div>"
393
+ ],
394
+ "text/plain": [
395
+ " feature_name coefficient\n",
396
+ "0 past_num_of_claims 1.750160\n",
397
+ "1 annual_income 1.570769\n",
398
+ "2 age_of_vehicle 0.982407\n",
399
+ "3 address_change_ind 0.398596\n",
400
+ "4 longitude 0.362837\n",
401
+ "5 living_status_Rent 0.128913\n",
402
+ "6 policy_report_filed_ind 0.083922\n",
403
+ "7 channel_Phone 0.039526\n",
404
+ "8 liab_prct 0.031912\n",
405
+ "9 vehicle_weight 0.031770\n",
406
+ "10 vehicle_price 0.030162\n",
407
+ "11 vehicle_category_Medium 0.027484\n",
408
+ "12 vehicle_category_Large -0.063941\n",
409
+ "13 latitude -0.166059\n",
410
+ "14 accident_site_Local -0.234709\n",
411
+ "15 gender_M -0.277402\n",
412
+ "16 channel_Online -0.306284\n",
413
+ "17 claim_est_payout -0.344002\n",
414
+ "18 marital_status -0.459327\n",
415
+ "19 high_education_ind -0.647302\n",
416
+ "20 witness_present_ind -0.709166\n",
417
+ "21 accident_site_Parking Lot -1.012493\n",
418
+ "22 safty_rating -1.031068\n",
419
+ "23 age_of_driver -2.510087"
420
+ ]
421
+ },
422
+ "execution_count": 42,
423
+ "metadata": {},
424
+ "output_type": "execute_result"
425
+ }
426
+ ],
427
+ "source": [
428
+ "def add_dummies(df, categorical_features):\n",
429
+ " dummies = pd.get_dummies(df[categorical_features], drop_first=True)\n",
430
+ " df = pd.concat([dummies, df], axis=1)\n",
431
+ " df = df.drop(categorical_features, axis=1)\n",
432
+ " return df.columns\n",
433
+ "\n",
434
+ "feature_names = add_dummies(X_train, categorical_features)\n",
435
+ "\n",
436
+ "pd.DataFrame({\n",
437
+ " \"feature_name\": feature_names,\n",
438
+ " \"coefficient\": lr_pipeline._final_estimator.coef_[0]\n",
439
+ "}).sort_values(by=\"coefficient\", ascending=False).reset_index(drop=True)"
440
+ ]
441
+ },
442
+ {
443
+ "attachments": {},
444
+ "cell_type": "markdown",
445
+ "metadata": {},
446
+ "source": [
447
+ "### XGBoost\n",
448
+ "\n",
449
+ "Since there are many hyperparameters in XGBoost, I decide to use a randomized search for hyperparameter tuning."
450
+ ]
451
+ },
452
+ {
453
+ "cell_type": "code",
454
+ "execution_count": 43,
455
+ "metadata": {},
456
+ "outputs": [
457
+ {
458
+ "name": "stdout",
459
+ "output_type": "stream",
460
+ "text": [
461
+ "Best params: {'subsample': 0.7, 'n_estimators': 100, 'min_child_weight': 7.0, 'max_depth': 1, 'learning_rate': 0.3, 'gamma': 0.25, 'colsample_bytree': 1.0, 'colsample_bylevel': 0.8}\n",
462
+ "AUC score: 0.7299474921988243\n"
463
+ ]
464
+ }
465
+ ],
466
+ "source": [
467
+ "param_grid = {\n",
468
+ " \"max_depth\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
469
+ " \"learning_rate\": [0.001, 0.01, 0.1, 0.2, 0.3],\n",
470
+ " \"subsample\": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n",
471
+ " \"colsample_bytree\": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n",
472
+ " \"colsample_bylevel\": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n",
473
+ " \"min_child_weight\": [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],\n",
474
+ " \"gamma\": [0, 0.25, 0.5, 1.0],\n",
475
+ " \"n_estimators\": [10, 20, 40, 60, 80, 100, 150, 200]\n",
476
+ "}\n",
477
+ "\n",
478
+ "xgb_clf = RandomizedSearchCV(\n",
479
+ " XGBClassifier(),\n",
480
+ " param_distributions=param_grid,\n",
481
+ " n_iter=50,\n",
482
+ " n_jobs=-1,\n",
483
+ " cv=5,\n",
484
+ " random_state=23,\n",
485
+ " scoring=\"roc_auc\",\n",
486
+ ")\n",
487
+ "\n",
488
+ "xgb_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, xgb_clf])"
489
+ ]
490
+ },
491
+ {
492
+ "attachments": {},
493
+ "cell_type": "markdown",
494
+ "metadata": {},
495
+ "source": [
496
+ "Although the class imbalance is not very serious in this dataset, I want to see if using SMOTE to synthesize new examples for the minority class can improve the predictive performance. However, it seems that using SMOTE only worsens the performance."
497
+ ]
498
+ },
499
+ {
500
+ "cell_type": "code",
501
+ "execution_count": 44,
502
+ "metadata": {},
503
+ "outputs": [
504
+ {
505
+ "name": "stdout",
506
+ "output_type": "stream",
507
+ "text": [
508
+ "Best params: {'subsample': 1.0, 'n_estimators': 200, 'min_child_weight': 0.5, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0.25, 'colsample_bytree': 0.5, 'colsample_bylevel': 0.6}\n",
509
+ "AUC score: 0.6962796916323821\n"
510
+ ]
511
+ }
512
+ ],
513
+ "source": [
514
+ "sampler = SMOTE(random_state=42)\n",
515
+ "xgb_pipeline_smote = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, sampler, xgb_clf])"
516
+ ]
517
+ },
518
+ {
519
+ "attachments": {},
520
+ "cell_type": "markdown",
521
+ "metadata": {},
522
+ "source": [
523
+ "Save the XGBoost model (without SMOTE), since it has the best performance."
524
+ ]
525
+ },
526
+ {
527
+ "cell_type": "code",
528
+ "execution_count": 45,
529
+ "metadata": {},
530
+ "outputs": [],
531
+ "source": [
532
+ "best_model = xgb_pipeline._final_estimator.best_estimator_\n",
533
+ "steps = [column_transformer, scaler, best_model]\n",
534
+ "pipeline = make_pipeline(*steps)\n",
535
+ "y_test_pred = pipeline.predict_proba(X_test)[:, 1]\n",
536
+ "\n",
537
+ "df = pd.DataFrame({\n",
538
+ " \"claim_number\": df_test[\"claim_number\"],\n",
539
+ " \"fraud\": y_test_pred\n",
540
+ "})\n",
541
+ "#df.to_csv(\"../data/submission/submission.csv\", index=False)"
542
+ ]
543
+ },
544
+ {
545
+ "attachments": {},
546
+ "cell_type": "markdown",
547
+ "metadata": {},
548
+ "source": [
549
+ "To examine which feature is important, I introduce a feature with random numbers. A feature can be considered as important If the importance of that feature is larger than that of the random feature."
550
+ ]
551
+ },
552
+ {
553
+ "cell_type": "code",
554
+ "execution_count": 18,
555
+ "metadata": {},
556
+ "outputs": [
557
+ {
558
+ "data": {
559
+ "text/html": [
560
+ "<div>\n",
561
+ "<style scoped>\n",
562
+ " .dataframe tbody tr th:only-of-type {\n",
563
+ " vertical-align: middle;\n",
564
+ " }\n",
565
+ "\n",
566
+ " .dataframe tbody tr th {\n",
567
+ " vertical-align: top;\n",
568
+ " }\n",
569
+ "\n",
570
+ " .dataframe thead th {\n",
571
+ " text-align: right;\n",
572
+ " }\n",
573
+ "</style>\n",
574
+ "<table border=\"1\" class=\"dataframe\">\n",
575
+ " <thead>\n",
576
+ " <tr style=\"text-align: right;\">\n",
577
+ " <th></th>\n",
578
+ " <th>feature_name</th>\n",
579
+ " <th>importance</th>\n",
580
+ " </tr>\n",
581
+ " </thead>\n",
582
+ " <tbody>\n",
583
+ " <tr>\n",
584
+ " <th>0</th>\n",
585
+ " <td>accident_site_Parking Lot</td>\n",
586
+ " <td>0.111572</td>\n",
587
+ " </tr>\n",
588
+ " <tr>\n",
589
+ " <th>1</th>\n",
590
+ " <td>high_education_ind</td>\n",
591
+ " <td>0.082720</td>\n",
592
+ " </tr>\n",
593
+ " <tr>\n",
594
+ " <th>2</th>\n",
595
+ " <td>witness_present_ind</td>\n",
596
+ " <td>0.072724</td>\n",
597
+ " </tr>\n",
598
+ " <tr>\n",
599
+ " <th>3</th>\n",
600
+ " <td>past_num_of_claims</td>\n",
601
+ " <td>0.052461</td>\n",
602
+ " </tr>\n",
603
+ " <tr>\n",
604
+ " <th>4</th>\n",
605
+ " <td>marital_status</td>\n",
606
+ " <td>0.052270</td>\n",
607
+ " </tr>\n",
608
+ " <tr>\n",
609
+ " <th>5</th>\n",
610
+ " <td>address_change_ind</td>\n",
611
+ " <td>0.044381</td>\n",
612
+ " </tr>\n",
613
+ " <tr>\n",
614
+ " <th>6</th>\n",
615
+ " <td>age_of_driver</td>\n",
616
+ " <td>0.039922</td>\n",
617
+ " </tr>\n",
618
+ " <tr>\n",
619
+ " <th>7</th>\n",
620
+ " <td>longitude</td>\n",
621
+ " <td>0.034581</td>\n",
622
+ " </tr>\n",
623
+ " <tr>\n",
624
+ " <th>8</th>\n",
625
+ " <td>safty_rating</td>\n",
626
+ " <td>0.033645</td>\n",
627
+ " </tr>\n",
628
+ " <tr>\n",
629
+ " <th>9</th>\n",
630
+ " <td>claim_est_payout</td>\n",
631
+ " <td>0.032631</td>\n",
632
+ " </tr>\n",
633
+ " <tr>\n",
634
+ " <th>10</th>\n",
635
+ " <td>random_feature</td>\n",
636
+ " <td>0.032600</td>\n",
637
+ " </tr>\n",
638
+ " <tr>\n",
639
+ " <th>11</th>\n",
640
+ " <td>liab_prct</td>\n",
641
+ " <td>0.032246</td>\n",
642
+ " </tr>\n",
643
+ " <tr>\n",
644
+ " <th>12</th>\n",
645
+ " <td>vehicle_price</td>\n",
646
+ " <td>0.032152</td>\n",
647
+ " </tr>\n",
648
+ " <tr>\n",
649
+ " <th>13</th>\n",
650
+ " <td>annual_income</td>\n",
651
+ " <td>0.031335</td>\n",
652
+ " </tr>\n",
653
+ " <tr>\n",
654
+ " <th>14</th>\n",
655
+ " <td>vehicle_weight</td>\n",
656
+ " <td>0.030896</td>\n",
657
+ " </tr>\n",
658
+ " <tr>\n",
659
+ " <th>15</th>\n",
660
+ " <td>latitude</td>\n",
661
+ " <td>0.030324</td>\n",
662
+ " </tr>\n",
663
+ " <tr>\n",
664
+ " <th>16</th>\n",
665
+ " <td>channel_Online</td>\n",
666
+ " <td>0.030144</td>\n",
667
+ " </tr>\n",
668
+ " <tr>\n",
669
+ " <th>17</th>\n",
670
+ " <td>accident_site_Local</td>\n",
671
+ " <td>0.029325</td>\n",
672
+ " </tr>\n",
673
+ " <tr>\n",
674
+ " <th>18</th>\n",
675
+ " <td>gender_M</td>\n",
676
+ " <td>0.028732</td>\n",
677
+ " </tr>\n",
678
+ " <tr>\n",
679
+ " <th>19</th>\n",
680
+ " <td>vehicle_category_Large</td>\n",
681
+ " <td>0.028661</td>\n",
682
+ " </tr>\n",
683
+ " <tr>\n",
684
+ " <th>20</th>\n",
685
+ " <td>channel_Phone</td>\n",
686
+ " <td>0.027671</td>\n",
687
+ " </tr>\n",
688
+ " <tr>\n",
689
+ " <th>21</th>\n",
690
+ " <td>vehicle_category_Medium</td>\n",
691
+ " <td>0.027547</td>\n",
692
+ " </tr>\n",
693
+ " <tr>\n",
694
+ " <th>22</th>\n",
695
+ " <td>living_status_Rent</td>\n",
696
+ " <td>0.027294</td>\n",
697
+ " </tr>\n",
698
+ " <tr>\n",
699
+ " <th>23</th>\n",
700
+ " <td>age_of_vehicle</td>\n",
701
+ " <td>0.027125</td>\n",
702
+ " </tr>\n",
703
+ " <tr>\n",
704
+ " <th>24</th>\n",
705
+ " <td>policy_report_filed_ind</td>\n",
706
+ " <td>0.027040</td>\n",
707
+ " </tr>\n",
708
+ " </tbody>\n",
709
+ "</table>\n",
710
+ "</div>"
711
+ ],
712
+ "text/plain": [
713
+ " feature_name importance\n",
714
+ "0 accident_site_Parking Lot 0.111572\n",
715
+ "1 high_education_ind 0.082720\n",
716
+ "2 witness_present_ind 0.072724\n",
717
+ "3 past_num_of_claims 0.052461\n",
718
+ "4 marital_status 0.052270\n",
719
+ "5 address_change_ind 0.044381\n",
720
+ "6 age_of_driver 0.039922\n",
721
+ "7 longitude 0.034581\n",
722
+ "8 safty_rating 0.033645\n",
723
+ "9 claim_est_payout 0.032631\n",
724
+ "10 random_feature 0.032600\n",
725
+ "11 liab_prct 0.032246\n",
726
+ "12 vehicle_price 0.032152\n",
727
+ "13 annual_income 0.031335\n",
728
+ "14 vehicle_weight 0.030896\n",
729
+ "15 latitude 0.030324\n",
730
+ "16 channel_Online 0.030144\n",
731
+ "17 accident_site_Local 0.029325\n",
732
+ "18 gender_M 0.028732\n",
733
+ "19 vehicle_category_Large 0.028661\n",
734
+ "20 channel_Phone 0.027671\n",
735
+ "21 vehicle_category_Medium 0.027547\n",
736
+ "22 living_status_Rent 0.027294\n",
737
+ "23 age_of_vehicle 0.027125\n",
738
+ "24 policy_report_filed_ind 0.027040"
739
+ ]
740
+ },
741
+ "execution_count": 18,
742
+ "metadata": {},
743
+ "output_type": "execute_result"
744
+ }
745
+ ],
746
+ "source": [
747
+ "X_train[\"random_feature\"] = np.random.uniform(size=len(X_train))\n",
748
+ "xgb_clf_random_feature = XGBClassifier(**xgb_pipeline._final_estimator.best_params_)\n",
749
+ "steps = [column_transformer, scaler, xgb_clf_random_feature]\n",
750
+ "xgb_pipeline_random_feature = make_pipeline(*steps)\n",
751
+ "xgb_pipeline_random_feature = xgb_pipeline_random_feature.fit(X_train, y_train)\n",
752
+ "\n",
753
+ "pd.DataFrame({\n",
754
+ " \"feature_name\": list(feature_names) + [\"random_feature\"],\n",
755
+ " \"importance\": xgb_pipeline_random_feature._final_estimator.feature_importances_\n",
756
+ "}).sort_values(by=\"importance\", ascending=False).reset_index(drop=True)"
757
+ ]
758
+ },
759
+ {
760
+ "cell_type": "code",
761
+ "execution_count": null,
762
+ "metadata": {},
763
+ "outputs": [],
764
+ "source": [
765
+ "X_train"
766
+ ]
767
+ },
768
+ {
769
+ "cell_type": "code",
770
+ "execution_count": null,
771
+ "metadata": {},
772
+ "outputs": [],
773
+ "source": [
774
+ "y_train"
775
+ ]
776
+ },
777
+ {
778
+ "cell_type": "code",
779
+ "execution_count": 47,
780
+ "metadata": {},
781
+ "outputs": [],
782
+ "source": [
783
+ "with open('./best_model_3.pickle', 'wb') as handle:\n",
784
+ " #pickle.dump(a, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
785
+ "\n",
786
+ " pickle.dump(xgb_pipeline, handle)"
787
+ ]
788
+ },
789
+ {
790
+ "cell_type": "code",
791
+ "execution_count": 31,
792
+ "metadata": {},
793
+ "outputs": [],
794
+ "source": [
795
+ "import pickle"
796
+ ]
797
+ },
798
+ {
799
+ "cell_type": "code",
800
+ "execution_count": null,
801
+ "metadata": {},
802
+ "outputs": [],
803
+ "source": []
804
+ }
805
+ ],
806
+ "metadata": {
807
+ "interpreter": {
808
+ "hash": "03e93f2959c516196957ae17ec0aa5d1e9fc5dd82cbe13968d4cfc2a60558992"
809
+ },
810
+ "kernelspec": {
811
+ "display_name": "Python 3 (ipykernel)",
812
+ "language": "python",
813
+ "name": "python3"
814
+ },
815
+ "language_info": {
816
+ "codemirror_mode": {
817
+ "name": "ipython",
818
+ "version": 3
819
+ },
820
+ "file_extension": ".py",
821
+ "mimetype": "text/x-python",
822
+ "name": "python",
823
+ "nbconvert_exporter": "python",
824
+ "pygments_lexer": "ipython3",
825
+ "version": "3.12.1"
826
+ }
827
+ },
828
+ "nbformat": 4,
829
+ "nbformat_minor": 4
830
+ }