infr-car-jupyter

Sleeping

App Files Files Community

thanthamky commited on Jun 11

Commit

914e01f

•

1 Parent(s): 99e0926

Upload 3 files

Browse files

Files changed (3) hide show

app/1-eda.ipynb +0 -0
app/2-data_preprocessing.ipynb +326 -0
app/3-modeling.ipynb +830 -0

app/1-eda.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

app/2-data_preprocessing.ipynb ADDED Viewed

	@@ -0,0 +1,326 @@

+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data Preprocessing\n",
+    "\n",
+    "This file shows how I performed data cleaning and feature engineering. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Import libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.preprocessing import MinMaxScaler"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install scikit-learn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load datasets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_train_full = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/raw/train.csv\")\n",
+    "df_test = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/raw/test.csv\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since the test set provided does not have the target variable, so we have to create an internal validation set to evaluate the model performance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_train, df_val = train_test_split(df_train_full, test_size=0.2, random_state=99)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Cleaning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Remove the observations whose the target variable `fraud` is equal to -1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_train = df_train[df_train[\"fraud\"] != -1]\n",
+    "df_val = df_val[df_val[\"fraud\"] != -1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For values that match the following conditions, treat them as missing values to be imputed later.\n",
+    "\n",
+    "- `age_of_driver > 100`\n",
+    "- `annual_income = -1`\n",
+    "- `zip_code = -1`\n",
+    "\n",
+    "According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), the oldest living person is 115, as of 2018. I think it is reasonable to assume that any `age_of_driver > 100` in this dataset is a clerical error."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for df in [df_train, df_val, df_test]:\n",
+    "    df.loc[df[\"age_of_driver\"] > 100, \"age_of_driver\"] = np.nan\n",
+    "    df.loc[df[\"annual_income\"] == -1, \"annual_income\"] = np.nan\n",
+    "    df.loc[df[\"zip_code\"] == 0, \"zip_code\"] = np.nan"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, we will do an imputation for the missing values. Since there is only a very small percentage of missing values, we will simply do a mean/mode imputation for the continuous/categorical variables. Note that the mean/mode is computed based on the training set only to prevent data leakage."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/tmp/ipykernel_293/883070373.py:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+      "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+      "\n",
+      "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+      "\n",
+      "\n",
+      "  df[feature].fillna(int(feature_mean), inplace=True)\n",
+      "/tmp/ipykernel_293/883070373.py:10: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
+      "The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
+      "\n",
+      "For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
+      "\n",
+      "\n",
+      "  df[feature].fillna(feature_mode.values[0], inplace=True)\n"
+     ]
+    }
+   ],
+   "source": [
+    "for df in [df_train, df_val, df_test]:\n",
+    "    # mean imputation for continuous variables\n",
+    "    for feature in [\"age_of_driver\", \"annual_income\", \"claim_est_payout\", \"age_of_vehicle\"]:\n",
+    "        feature_mean = df_train.loc[:, feature].mean(skipna=True)\n",
+    "        df[feature].fillna(int(feature_mean), inplace=True)\n",
+    "\n",
+    "    # mode imputation for categorical variables\n",
+    "    for feature in [\"marital_status\", \"witness_present_ind\", \"zip_code\"]:\n",
+    "        feature_mode = df_train.loc[:, feature].mode(dropna=True)\n",
+    "        df[feature].fillna(feature_mode.values[0], inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Feature Engineering"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Remove features that do not seem to be related to the target variable (based on common sense)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for df in [df_train, df_val, df_test]:\n",
+    "    df.drop(columns=[\"claim_date\", \"claim_day_of_week\", \"vehicle_color\"], inplace=True)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are many unique `zip_code`. Creating dummy variables for `zip_code` will increase the dimensionality of the data too much. One idea is to transform it into `latitude` and `longitude` using the data from [UnitedStatesZipCodes.org](https://www.unitedstateszipcodes.org/zip-code-database/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zip_code_database = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/external/zip_code_database.csv\")\n",
+    "latitude_and_longitude_lookup = {\n",
+    "    row.zip: (row.latitude, row.longitude) for row in zip_code_database.itertuples()\n",
+    "}\n",
+    "\n",
+    "for df in [df_train, df_val, df_test]:\n",
+    "    df[\"latitude\"] = df[\"zip_code\"].apply(lambda x: latitude_and_longitude_lookup[x][0])\n",
+    "    df[\"longitude\"] = df[\"zip_code\"].apply(lambda x: latitude_and_longitude_lookup[x][1])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Another idea is to use [target encoding](https://maxhalford.github.io/blog/target-encoding/), but after a few experiments it seems to perform worse than just transforming it to `latitude` and `longitude`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#from category_encoders.target_encoder import TargetEncoder\n",
+    "#\n",
+    "#target_encoder = TargetEncoder(cols=[\"zip_code\"], smoothing=10)\n",
+    "#target_encoder.fit(df_train[\"zip_code\"], df_train[\"fraud\"])\n",
+    "#\n",
+    "#for df in [df_train, df_val, df_test]:\n",
+    "#    df[\"zip_code_target_encoded\"] = target_encoder.transform(df[\"zip_code\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can drop `zip_code`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for df in [df_train, df_val, df_test]:\n",
+    "    df.drop(columns=[\"zip_code\"], inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Export processed data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#df_train.to_csv(\"../data/processed/train.csv\", index=False)\n",
+    "#df_val.to_csv(\"../data/processed/val.csv\", index=False)\n",
+    "#df_test.to_csv(\"../data/processed/test.csv\", index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "03e93f2959c516196957ae17ec0aa5d1e9fc5dd82cbe13968d4cfc2a60558992"
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

app/3-modeling.ipynb ADDED Viewed

	@@ -0,0 +1,830 @@

+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Modeling\n",
+    "\n",
+    "In this notebook, the performance of different models is examined."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Import libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from imblearn.pipeline import make_pipeline\n",
+    "from imblearn.over_sampling import SMOTE\n",
+    "from sklearn.compose import make_column_transformer\n",
+    "from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import roc_auc_score\n",
+    "from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV\n",
+    "from sklearn.neighbors import KNeighborsClassifier\n",
+    "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
+    "from xgboost import XGBClassifier"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Requirement already satisfied: imblearn in /home/user/miniconda/lib/python3.12/site-packages (0.0)\n",
+      "Collecting xgboost\n",
+      "  Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl.metadata (2.0 kB)\n",
+      "Requirement already satisfied: imbalanced-learn in /home/user/miniconda/lib/python3.12/site-packages (from imblearn) (0.12.3)\n",
+      "Requirement already satisfied: numpy in /home/user/miniconda/lib/python3.12/site-packages (from xgboost) (1.26.4)\n",
+      "Requirement already satisfied: scipy in /home/user/miniconda/lib/python3.12/site-packages (from xgboost) (1.13.1)\n",
+      "Requirement already satisfied: scikit-learn>=1.0.2 in /home/user/miniconda/lib/python3.12/site-packages (from imbalanced-learn->imblearn) (1.5.0)\n",
+      "Requirement already satisfied: joblib>=1.1.1 in /home/user/miniconda/lib/python3.12/site-packages (from imbalanced-learn->imblearn) (1.4.2)\n",
+      "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/user/miniconda/lib/python3.12/site-packages (from imbalanced-learn->imblearn) (3.5.0)\n",
+      "Downloading xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl (297.1 MB)\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m297.1/297.1 MB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hInstalling collected packages: xgboost\n",
+      "Successfully installed xgboost-2.0.3\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip install imblearn xgboost"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load datasets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_train = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/processed/train.csv\")\n",
+    "df_val = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/processed/val.csv\")\n",
+    "df_test = pd.read_csv(\"https://raw.githubusercontent.com/kingyiusuen/travelers-insurance-fraud/master/data/processed/test.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train = df_train.drop(columns=[\"claim_number\", \"fraud\"])\n",
+    "y_train = df_train[\"fraud\"]\n",
+    "X_val = df_val.drop(columns=[\"claim_number\", \"fraud\"])\n",
+    "y_val = df_val[\"fraud\"]\n",
+    "X_test = df_test.drop(columns=[\"claim_number\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Selection"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`OneHotEncoder` will dummify categorical features, and numerical features will be re-scaled with `MinMaxScaler`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "categorical_features = X_train.columns[X_train.dtypes == object].tolist()\n",
+    "column_transformer = make_column_transformer(\n",
+    "    (OneHotEncoder(drop=\"first\"), categorical_features),\n",
+    "    remainder=\"passthrough\",\n",
+    ")\n",
+    "scaler = MinMaxScaler()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A simple function that defines the training pipeline: fit the model, predict on the validation set, print the evaluation metric."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def modeling(X_train, y_train, X_val, y_val, steps):\n",
+    "    pipeline = make_pipeline(*steps)\n",
+    "    pipeline.fit(X_train, y_train)\n",
+    "    y_val_pred = pipeline.predict_proba(X_val)[:, 1]\n",
+    "    metric = roc_auc_score(y_val, y_val_pred)\n",
+    "    if isinstance(pipeline._final_estimator, RandomizedSearchCV) or isinstance(pipeline._final_estimator, GridSearchCV):\n",
+    "        print(f\"Best params: {pipeline._final_estimator.best_params_}\")\n",
+    "    print(f\"AUC score: {metric}\")\n",
+    "    return pipeline"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### K-Nearest Neighbor\n",
+    "\n",
+    "KNN has two hyperparameters: the number of neighbors, and whether all points in each neighborhood are weighted equally or weighted by the inverse of their distance. Since the number of hyperparameters is small. A grid search is used to find the optimal hyperparameter values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Best params: {'n_neighbors': 50, 'weights': 'distance'}\n",
+      "AUC score: 0.6507841602442943\n"
+     ]
+    }
+   ],
+   "source": [
+    "param_grid = {\n",
+    "    \"n_neighbors\": [5, 10, 25, 50],\n",
+    "    \"weights\": [\"uniform\", \"distance\"],\n",
+    "}\n",
+    "\n",
+    "knn_clf = GridSearchCV(\n",
+    "    KNeighborsClassifier(),\n",
+    "    param_grid=param_grid,\n",
+    "    n_jobs=-1,\n",
+    "    cv=5,\n",
+    "    scoring=\"roc_auc\",\n",
+    ")\n",
+    "\n",
+    "knn_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, knn_clf])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Logistic Regression\n",
+    "\n",
+    "For logistic regression, there is no hyperparameter to tune."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "AUC score: 0.7157014847720347\n"
+     ]
+    }
+   ],
+   "source": [
+    "lr_clf = LogisticRegression()\n",
+    "lr_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, lr_clf])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Look at the model coefficients."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>feature_name</th>\n",
+       "      <th>coefficient</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>past_num_of_claims</td>\n",
+       "      <td>1.750160</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>annual_income</td>\n",
+       "      <td>1.570769</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>age_of_vehicle</td>\n",
+       "      <td>0.982407</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>address_change_ind</td>\n",
+       "      <td>0.398596</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>longitude</td>\n",
+       "      <td>0.362837</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>living_status_Rent</td>\n",
+       "      <td>0.128913</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>policy_report_filed_ind</td>\n",
+       "      <td>0.083922</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>channel_Phone</td>\n",
+       "      <td>0.039526</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>liab_prct</td>\n",
+       "      <td>0.031912</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>vehicle_weight</td>\n",
+       "      <td>0.031770</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>vehicle_price</td>\n",
+       "      <td>0.030162</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>vehicle_category_Medium</td>\n",
+       "      <td>0.027484</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>vehicle_category_Large</td>\n",
+       "      <td>-0.063941</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>latitude</td>\n",
+       "      <td>-0.166059</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>accident_site_Local</td>\n",
+       "      <td>-0.234709</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>15</th>\n",
+       "      <td>gender_M</td>\n",
+       "      <td>-0.277402</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16</th>\n",
+       "      <td>channel_Online</td>\n",
+       "      <td>-0.306284</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>17</th>\n",
+       "      <td>claim_est_payout</td>\n",
+       "      <td>-0.344002</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>marital_status</td>\n",
+       "      <td>-0.459327</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>high_education_ind</td>\n",
+       "      <td>-0.647302</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>20</th>\n",
+       "      <td>witness_present_ind</td>\n",
+       "      <td>-0.709166</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>21</th>\n",
+       "      <td>accident_site_Parking Lot</td>\n",
+       "      <td>-1.012493</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22</th>\n",
+       "      <td>safty_rating</td>\n",
+       "      <td>-1.031068</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>23</th>\n",
+       "      <td>age_of_driver</td>\n",
+       "      <td>-2.510087</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                 feature_name  coefficient\n",
+       "0          past_num_of_claims     1.750160\n",
+       "1               annual_income     1.570769\n",
+       "2              age_of_vehicle     0.982407\n",
+       "3          address_change_ind     0.398596\n",
+       "4                   longitude     0.362837\n",
+       "5          living_status_Rent     0.128913\n",
+       "6     policy_report_filed_ind     0.083922\n",
+       "7               channel_Phone     0.039526\n",
+       "8                   liab_prct     0.031912\n",
+       "9              vehicle_weight     0.031770\n",
+       "10              vehicle_price     0.030162\n",
+       "11    vehicle_category_Medium     0.027484\n",
+       "12     vehicle_category_Large    -0.063941\n",
+       "13                   latitude    -0.166059\n",
+       "14        accident_site_Local    -0.234709\n",
+       "15                   gender_M    -0.277402\n",
+       "16             channel_Online    -0.306284\n",
+       "17           claim_est_payout    -0.344002\n",
+       "18             marital_status    -0.459327\n",
+       "19         high_education_ind    -0.647302\n",
+       "20        witness_present_ind    -0.709166\n",
+       "21  accident_site_Parking Lot    -1.012493\n",
+       "22               safty_rating    -1.031068\n",
+       "23              age_of_driver    -2.510087"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def add_dummies(df, categorical_features):\n",
+    "    dummies = pd.get_dummies(df[categorical_features], drop_first=True)\n",
+    "    df = pd.concat([dummies, df], axis=1)\n",
+    "    df = df.drop(categorical_features, axis=1)\n",
+    "    return df.columns\n",
+    "\n",
+    "feature_names = add_dummies(X_train, categorical_features)\n",
+    "\n",
+    "pd.DataFrame({\n",
+    "    \"feature_name\": feature_names,\n",
+    "    \"coefficient\": lr_pipeline._final_estimator.coef_[0]\n",
+    "}).sort_values(by=\"coefficient\", ascending=False).reset_index(drop=True)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### XGBoost\n",
+    "\n",
+    "Since there are many hyperparameters in XGBoost, I decide to use a randomized search for hyperparameter tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Best params: {'subsample': 0.7, 'n_estimators': 100, 'min_child_weight': 7.0, 'max_depth': 1, 'learning_rate': 0.3, 'gamma': 0.25, 'colsample_bytree': 1.0, 'colsample_bylevel': 0.8}\n",
+      "AUC score: 0.7299474921988243\n"
+     ]
+    }
+   ],
+   "source": [
+    "param_grid = {\n",
+    "    \"max_depth\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
+    "    \"learning_rate\": [0.001, 0.01, 0.1, 0.2, 0.3],\n",
+    "    \"subsample\": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n",
+    "    \"colsample_bytree\": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n",
+    "    \"colsample_bylevel\": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n",
+    "    \"min_child_weight\": [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],\n",
+    "    \"gamma\": [0, 0.25, 0.5, 1.0],\n",
+    "    \"n_estimators\": [10, 20, 40, 60, 80, 100, 150, 200]\n",
+    "}\n",
+    "\n",
+    "xgb_clf = RandomizedSearchCV(\n",
+    "    XGBClassifier(),\n",
+    "    param_distributions=param_grid,\n",
+    "    n_iter=50,\n",
+    "    n_jobs=-1,\n",
+    "    cv=5,\n",
+    "    random_state=23,\n",
+    "    scoring=\"roc_auc\",\n",
+    ")\n",
+    "\n",
+    "xgb_pipeline = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, xgb_clf])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Although the class imbalance is not very serious in this dataset, I want to see if using SMOTE to synthesize new examples for the minority class can improve the predictive performance. However, it seems that using SMOTE only worsens the performance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Best params: {'subsample': 1.0, 'n_estimators': 200, 'min_child_weight': 0.5, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0.25, 'colsample_bytree': 0.5, 'colsample_bylevel': 0.6}\n",
+      "AUC score: 0.6962796916323821\n"
+     ]
+    }
+   ],
+   "source": [
+    "sampler = SMOTE(random_state=42)\n",
+    "xgb_pipeline_smote = modeling(X_train, y_train, X_val, y_val, [column_transformer, scaler, sampler, xgb_clf])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Save the XGBoost model (without SMOTE), since it has the best performance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "best_model = xgb_pipeline._final_estimator.best_estimator_\n",
+    "steps = [column_transformer, scaler, best_model]\n",
+    "pipeline = make_pipeline(*steps)\n",
+    "y_test_pred = pipeline.predict_proba(X_test)[:, 1]\n",
+    "\n",
+    "df = pd.DataFrame({\n",
+    "    \"claim_number\": df_test[\"claim_number\"],\n",
+    "    \"fraud\": y_test_pred\n",
+    "})\n",
+    "#df.to_csv(\"../data/submission/submission.csv\", index=False)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To examine which feature is important, I introduce a feature with random numbers. A feature can be considered as important If the importance of that feature is larger than that of the random feature."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>feature_name</th>\n",
+       "      <th>importance</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>accident_site_Parking Lot</td>\n",
+       "      <td>0.111572</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>high_education_ind</td>\n",
+       "      <td>0.082720</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>witness_present_ind</td>\n",
+       "      <td>0.072724</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>past_num_of_claims</td>\n",
+       "      <td>0.052461</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>marital_status</td>\n",
+       "      <td>0.052270</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>address_change_ind</td>\n",
+       "      <td>0.044381</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>age_of_driver</td>\n",
+       "      <td>0.039922</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>longitude</td>\n",
+       "      <td>0.034581</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>safty_rating</td>\n",
+       "      <td>0.033645</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>claim_est_payout</td>\n",
+       "      <td>0.032631</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>random_feature</td>\n",
+       "      <td>0.032600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>liab_prct</td>\n",
+       "      <td>0.032246</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>vehicle_price</td>\n",
+       "      <td>0.032152</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>annual_income</td>\n",
+       "      <td>0.031335</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>vehicle_weight</td>\n",
+       "      <td>0.030896</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>15</th>\n",
+       "      <td>latitude</td>\n",
+       "      <td>0.030324</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16</th>\n",
+       "      <td>channel_Online</td>\n",
+       "      <td>0.030144</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>17</th>\n",
+       "      <td>accident_site_Local</td>\n",
+       "      <td>0.029325</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>gender_M</td>\n",
+       "      <td>0.028732</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>vehicle_category_Large</td>\n",
+       "      <td>0.028661</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>20</th>\n",
+       "      <td>channel_Phone</td>\n",
+       "      <td>0.027671</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>21</th>\n",
+       "      <td>vehicle_category_Medium</td>\n",
+       "      <td>0.027547</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22</th>\n",
+       "      <td>living_status_Rent</td>\n",
+       "      <td>0.027294</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>23</th>\n",
+       "      <td>age_of_vehicle</td>\n",
+       "      <td>0.027125</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>24</th>\n",
+       "      <td>policy_report_filed_ind</td>\n",
+       "      <td>0.027040</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                 feature_name  importance\n",
+       "0   accident_site_Parking Lot    0.111572\n",
+       "1          high_education_ind    0.082720\n",
+       "2         witness_present_ind    0.072724\n",
+       "3          past_num_of_claims    0.052461\n",
+       "4              marital_status    0.052270\n",
+       "5          address_change_ind    0.044381\n",
+       "6               age_of_driver    0.039922\n",
+       "7                   longitude    0.034581\n",
+       "8                safty_rating    0.033645\n",
+       "9            claim_est_payout    0.032631\n",
+       "10             random_feature    0.032600\n",
+       "11                  liab_prct    0.032246\n",
+       "12              vehicle_price    0.032152\n",
+       "13              annual_income    0.031335\n",
+       "14             vehicle_weight    0.030896\n",
+       "15                   latitude    0.030324\n",
+       "16             channel_Online    0.030144\n",
+       "17        accident_site_Local    0.029325\n",
+       "18                   gender_M    0.028732\n",
+       "19     vehicle_category_Large    0.028661\n",
+       "20              channel_Phone    0.027671\n",
+       "21    vehicle_category_Medium    0.027547\n",
+       "22         living_status_Rent    0.027294\n",
+       "23             age_of_vehicle    0.027125\n",
+       "24    policy_report_filed_ind    0.027040"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "X_train[\"random_feature\"] = np.random.uniform(size=len(X_train))\n",
+    "xgb_clf_random_feature = XGBClassifier(**xgb_pipeline._final_estimator.best_params_)\n",
+    "steps = [column_transformer, scaler, xgb_clf_random_feature]\n",
+    "xgb_pipeline_random_feature = make_pipeline(*steps)\n",
+    "xgb_pipeline_random_feature = xgb_pipeline_random_feature.fit(X_train, y_train)\n",
+    "\n",
+    "pd.DataFrame({\n",
+    "    \"feature_name\": list(feature_names) + [\"random_feature\"],\n",
+    "    \"importance\": xgb_pipeline_random_feature._final_estimator.feature_importances_\n",
+    "}).sort_values(by=\"importance\", ascending=False).reset_index(drop=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_train"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('./best_model_3.pickle', 'wb') as handle:\n",
+    "    #pickle.dump(a, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
+    "\n",
+    "    pickle.dump(xgb_pipeline, handle)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "03e93f2959c516196957ae17ec0aa5d1e9fc5dd82cbe13968d4cfc2a60558992"
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}