{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Introduction\n", "\n", "This notebook was written to train Porto Alegre Traffic Accidents Data after the first cleaning, processing, and transforming step. This was made in a notebook in the `data` folder. In truth, we will have 3 models.\n", "\n", "1. Predict the probability of injured people.\n", "\n", "2. Predict the probability of seriously injured people.\n", "\n", "3. Predict the probability of dead people in the event or after it.\n", "\n", "The path to training the models will be the same, just make some filtering on data and analyze the results properly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Data Loading" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
latitude-30.009614-30.0403-30.069
longitude-51.185581-51.1958-51.1437
feridosTrueTrueTrue
feridos_grFalseFalseFalse
fataisFalseFalseFalse
caminhaoFalseFalseFalse
motoTrueTrueFalse
carsTrueTrueTrue
transportFalseFalseFalse
othersFalseFalseFalse
holidayFalseTrueTrue
day_1000
day_2000
day_3000
day_4000
day_5100
day_6011
hour_1000
hour_2000
hour_3000
hour_4000
hour_5000
hour_6000
hour_7000
hour_8000
hour_9000
hour_10010
hour_11000
hour_12000
hour_13000
hour_14000
hour_15000
hour_16000
hour_17000
hour_18000
hour_19101
hour_20000
hour_21000
hour_22000
hour_23000
type_ATROPELAMENTO001
type_CHOQUE000
type_COLISÃO000
type_OUTROS000
\n", "
" ], "text/plain": [ " 0 1 2\n", "latitude -30.009614 -30.0403 -30.069\n", "longitude -51.185581 -51.1958 -51.1437\n", "feridos True True True\n", "feridos_gr False False False\n", "fatais False False False\n", "caminhao False False False\n", "moto True True False\n", "cars True True True\n", "transport False False False\n", "others False False False\n", "holiday False True True\n", "day_1 0 0 0\n", "day_2 0 0 0\n", "day_3 0 0 0\n", "day_4 0 0 0\n", "day_5 1 0 0\n", "day_6 0 1 1\n", "hour_1 0 0 0\n", "hour_2 0 0 0\n", "hour_3 0 0 0\n", "hour_4 0 0 0\n", "hour_5 0 0 0\n", "hour_6 0 0 0\n", "hour_7 0 0 0\n", "hour_8 0 0 0\n", "hour_9 0 0 0\n", "hour_10 0 1 0\n", "hour_11 0 0 0\n", "hour_12 0 0 0\n", "hour_13 0 0 0\n", "hour_14 0 0 0\n", "hour_15 0 0 0\n", "hour_16 0 0 0\n", "hour_17 0 0 0\n", "hour_18 0 0 0\n", "hour_19 1 0 1\n", "hour_20 0 0 0\n", "hour_21 0 0 0\n", "hour_22 0 0 0\n", "hour_23 0 0 0\n", "type_ATROPELAMENTO 0 0 1\n", "type_CHOQUE 0 0 0\n", "type_COLISÃO 0 0 0\n", "type_OUTROS 0 0 0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os.path as path\n", "from pandas import read_csv\n", "\n", "file_csv = path.abspath(\"../\")\n", "\n", "file_csv = path.join(file_csv, \"data\" ,\"accidents_trans.csv\")\n", "\n", "accidents_trans = read_csv(file_csv)\n", "\n", "accidents_trans.head(3).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Data Preparation" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import joblib as jb # Use to save the model to deploy\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Our model to predict the probability of feridos_gr will be create with 25497 rows and 41 features.\n" ] } ], "source": [ "outputs = [\"feridos\", \"feridos_gr\", \"fatais\"]\n", "inputs = [col for col in accidents_trans.columns if col not in outputs]\n", "\n", "X = accidents_trans[inputs].copy()\n", "Y = accidents_trans[outputs].copy()\n", "\n", "# Filtering data considering the output\n", "output = \"feridos_gr\"\n", "\n", "if output == \"feridos_gr\":\n", " X = X[Y[\"feridos\"]]\n", " Y = Y.loc[Y[\"feridos\"], \"feridos_gr\"]\n", "elif output == \"fatais\":\n", " X = X[Y[\"feridos_gr\"]]\n", " Y = Y.loc[Y[\"feridos_gr\"], \"fatais\"]\n", "else:\n", " Y = Y[\"feridos\"]\n", "\n", "print(f\"Our model to predict the probability of \" \\\n", " f\"{output} will be create with {X.shape[0]} \" \\\n", " f\"rows and {X.shape[1]} features.\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import csv\n", "\n", "with open(\"model_features.csv\", 'w') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(X.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Considering that we will use models scaling sensitive, we will need to scale our data first. Beside this, we will need to save our scaler for future use." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['c:\\\\Users\\\\grego\\\\OneDrive\\\\Documentos\\\\Documentos Pessoais\\\\00_DataCamp\\\\09_VSC\\\\poa_car_accidents\\\\poa_car_accidents\\\\model\\\\scaler_feridos_gr.pkl']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setting the random state using my luck number :-)\n", "lucky_num = 7\n", "\n", "# X_train and y_train to train our model\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X,\n", " Y,\n", " test_size=0.30,\n", " random_state=lucky_num,\n", " shuffle=True, # Used because our data is sort by date\n", " stratify=Y) # Used because our data is unbalanced\n", "\n", "# Scaling\n", "scaler = StandardScaler()\n", "X_train = scaler.fit_transform(X_train)\n", "X_test = scaler.transform(X_test)\n", "\n", "# Saving scaler\n", "file_name = \"scaler_\" + output + '.pkl'\n", "jb.dump(scaler, path.join(path.abspath(\"./\"), file_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Data Modeling\n", "\n", "We will create and use cross-validation to evaluate the following models:\n", "\n", "- Logistic Regression;\n", "\n", "- Gaussian Naive Bayes;\n", "\n", "- K Neighbors;\n", "\n", "- Random Forest;\n", "\n", "- Gradient Boosting; and,\n", "\n", "- XGBoost.\n", "\n", "We will use two scores to select and evaluate our models:\n", "\n", "- F1 score: composition between the precision (how much our model correct classify every true label) and recall (how moch our model correct indicate true labels); and,\n", "\n", "- Brier score: average between the correct and the predict probability.\n", "\n", "However, we will see other metrics to support our decision:\n", "\n", "- Accurancy;\n", "\n", "- ROC_AOC; and,\n", "\n", "- Log loss (an other way to quantify the quality of probability predictions).\n", "\n", "And, before you go, we will find for each model if there is a hyperparameter to deal with the unbalanced output." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import xgboost as xgb\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import cross_validate \n", "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier\n", "from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score, brier_score_loss, log_loss\n", "\n", "scores = [\"accuracy\", \"f1\", \"precision\", \"recall\", \"roc_auc\", \"neg_brier_score\",\"neg_log_loss\"]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def eval_model(cls) -> tuple:\n", " \"\"\"This function will calculate the metrics\n", " to evaluate a classification model.\n", " \"\"\"\n", " # Predicting labels and probabilities\n", " y_pred = cls.predict(X_test)\n", " y_prob = cls.predict_proba(X_test)[:,1]\n", "\n", " # Calculating scores\n", " accurancy = accuracy_score(y_test, y_pred)\n", " f1 = f1_score(y_test, y_pred)\n", " recall = recall_score(y_test, y_pred)\n", " precision = precision_score(y_test, y_pred)\n", " roc_auc = roc_auc_score(y_test, y_prob) # https://datascience.stackexchange.com/questions/114394/does-roc-auc-different-between-crossval-and-test-set-indicate-overfitting-or-oth\n", " brier_score = brier_score_loss(y_test, y_prob)\n", " log_loss_value = log_loss(y_test, y_prob)\n", "\n", " return accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value\n", "\n", "def create_model(name: str, cls) -> list:\n", " \"\"\"This function will create some models\n", " and return scores to evaluate it.\"\"\"\n", " # Ftting model\n", " cls.fit(X_train, y_train)\n", "\n", " # Using cross-validation to evaluate the model fitted\n", " cls_cross = cross_validate(\n", " estimator=cls,\n", " X=X_train,\n", " y=y_train,\n", " cv=5,\n", " scoring=scores)\n", "\n", " df_cv = pd.DataFrame.from_dict(cls_cross, orient='index', columns=[\"CV\"+str(i) for i in range(1,6)])\n", "\n", " # Calculating score to test set\n", " accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls)\n", "\n", " # Filling a dataframe to better presentation\n", " df_cv.at[\"test_accuracy\", \"TestSet\"] = accurancy\n", " df_cv.at[\"test_f1\", \"TestSet\"] = f1\n", " df_cv.at[\"test_recall\", \"TestSet\"] = recall\n", " df_cv.at[\"test_precision\", \"TestSet\"] = precision\n", " df_cv.at[\"test_roc_auc\", \"TestSet\"] = roc_auc\n", " df_cv.at[\"test_neg_brier_score\", \"TestSet\"] = -brier_score\n", " df_cv.at[\"test_neg_log_loss\", \"TestSet\"] = -log_loss_value\n", "\n", " caption = f\"{name} Validation Scores\"\n", "\n", " display(df_cv.style.set_caption(caption))\n", "\n", " return [accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LR Validation Scores
 CV1CV2CV3CV4CV5TestSet
fit_time0.0372100.0317870.0310540.0304180.031932nan
score_time0.0000000.0085810.0088670.0090670.008065nan
test_accuracy0.6767510.6504200.6517230.6623700.6416360.660523
test_f10.4376220.4222220.4134030.4242710.4014970.414431
test_precision0.3524330.3299570.3263790.3373860.3156730.331889
test_recall0.5771210.5861180.5637070.5714290.5514140.551621
test_roc_auc0.6884610.6678540.6631180.6710070.6440100.667540
test_neg_brier_score-0.221720-0.228903-0.228032-0.226401-0.231747-0.225709
test_neg_log_loss-0.635799-0.651183-0.649342-0.646086-0.657217-0.644816
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NB Validation Scores
 CV1CV2CV3CV4CV5TestSet
fit_time0.0143980.0095570.0068210.0074240.012280nan
score_time0.0118090.0111340.0170920.0119010.011843nan
test_accuracy0.7137250.6966390.6954330.6951530.6789020.699608
test_f10.3981150.3857060.3862220.3739930.3583430.381260
test_precision0.3673910.3451780.3440640.3381890.3174600.345703
test_recall0.4344470.4370180.4401540.4182750.4113110.424970
test_roc_auc0.6584950.6370910.6338050.6332380.6099900.634031
test_neg_brier_score-0.251232-0.260890-0.271004-0.273609-0.285054-0.268468
test_neg_log_loss-1.412893-1.627295-1.745289-1.752608-1.950351-1.659029
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
KNN Validation Scores
 CV1CV2CV3CV4CV5TestSet
fit_time0.0000000.0021920.0040060.0045140.006648nan
score_time0.2410180.2384260.2348270.2507300.503909nan
test_accuracy0.7498600.7577030.7509110.7537130.7467080.754379
test_f10.2033900.2143510.2012580.2088210.2084060.214136
test_precision0.3323620.3653250.3333330.3473050.3269230.353103
test_recall0.1465300.1516710.1441440.1492920.1529560.153661
test_roc_auc0.5830180.5888340.5754260.5758210.5733160.582555
test_neg_brier_score-0.190980-0.186599-0.191325-0.190608-0.193769-0.189077
test_neg_log_loss-2.500209-2.097705-2.376728-2.528142-2.570247-2.428060
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RF Validation Scores
 CV1CV2CV3CV4CV5TestSet
fit_time1.5222041.4899511.4852311.5175331.491071nan
score_time0.1521380.1459540.1560160.1500850.141980nan
test_accuracy0.7386550.7467790.7405440.7340990.7374610.733595
test_f10.2408460.2626430.2270450.2031910.2449640.242379
test_precision0.3281600.3593750.3230400.2922710.3282940.318359
test_recall0.1902310.2069410.1750320.1557270.1953730.195678
test_roc_auc0.6163970.6147560.5976880.6022900.5965750.601436
test_neg_brier_score-0.184849-0.182776-0.186704-0.187369-0.189693-0.188506
test_neg_log_loss-0.711816-0.673727-0.766073-0.719537-0.775028-0.743245
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GBC Validation Scores
 CV1CV2CV3CV4CV5TestSet
fit_time1.2927031.3290371.3154731.2994521.313950nan
score_time0.0145600.0261270.0188120.0159990.020753nan
test_accuracy0.7831930.7778710.7822920.7797700.7800500.781830
test_f10.1123850.0725150.0912280.0752940.0818710.085479
test_precision0.5212770.4025970.5000000.4383560.4545450.490566
test_recall0.0629820.0398460.0501930.0411840.0449870.046819
test_roc_auc0.6842780.6654490.6708040.6715220.6543030.669660
test_neg_brier_score-0.157619-0.160471-0.159413-0.159018-0.161594-0.159691
test_neg_log_loss-0.488690-0.495751-0.493124-0.492051-0.498777-0.493648
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
XGB Validation Scores
 CV1CV2CV3CV4CV5TestSet
fit_time0.6438580.6401530.6771210.6341370.669338nan
score_time0.0236050.0164120.0208050.0150400.028892nan
test_accuracy0.6201680.6126050.6043710.6287480.6200620.619216
test_f10.3978690.3866960.3779740.3862900.3830760.387639
test_precision0.3039350.2951930.2873410.3017370.2964790.298285
test_recall0.5758350.5604110.5521240.5366800.5411310.553421
test_roc_auc0.6403170.6262100.6199240.6270780.6207470.630028
test_neg_brier_score-0.231951-0.238560-0.237950-0.233297-0.239657-0.236789
test_neg_log_loss-0.660186-0.678450-0.673157-0.666774-0.681109-0.671979
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# XGB hyperparameter that deals with unbalanced\n", "scale_pos_weight = Y.mean()**-1\n", "\n", "# Creating the model objects\n", "cls_lr = LogisticRegression(\n", " class_weight=\"balanced\", # Hyperparameter to deal with unbalanced output\n", " random_state=lucky_num)\n", "# cls_svm = SVC(random_state=lucky_num) # Remove due its resource consumption and worst results\n", "cls_nb = GaussianNB()\n", "cls_knn = KNeighborsClassifier()\n", "cls_rf = RandomForestClassifier(\n", " random_state=lucky_num,\n", " class_weight=\"balanced_subsample\") # Hyperparameter to deal with unbalanced output\n", "cls_gbc = GradientBoostingClassifier(random_state=lucky_num)\n", "cls_xgb = xgb.XGBClassifier(\n", " objective=\"binary:logistic\",\n", " verbose=None,\n", " random_state=lucky_num,\n", " scale_pos_weight = scale_pos_weight)\n", "\n", "# Lists to iterate on our modeling function\n", "cls_name = [\"LR\", \"NB\", \"KNN\", \"RF\", \"GBC\", \"XGB\"]\n", "cls_list = [cls_lr, cls_NB, cls_knn, cls_rf, cls_gbc, cls_xgb]\n", "\n", "mdl_summaries = []\n", "for name, inst in zip(cls_name, cls_list):\n", " mdl_list = create_model(name, inst)\n", " mdl_list = [name] + mdl_list\n", " mdl_summaries.append(mdl_list)\n", "\n", "df_mdl = pd.DataFrame(\n", " mdl_summaries,\n", " columns=[\n", " \"model\",\n", " \"test_accuracy\",\n", " \"test_f1\",\n", " \"test_precision\",\n", " \"test_recall\",\n", " \"test_roc_auc\",\n", " \"test_brier\",\n", " \"test_log_loss\"])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Test set validation scores
 modeltest_accuracytest_f1test_precisiontest_recalltest_roc_auctest_briertest_log_loss
0LR0.6605230.4144310.3318890.5516210.6675400.2257090.644816
1XGB0.6192160.3876390.2982850.5534210.6300280.2367890.671979
2NB0.6996080.3812600.3457030.4249700.6340310.2684681.659029
3RF0.7335950.2423790.3183590.1956780.6014360.1885060.743245
4KNN0.7543790.2141360.3531030.1536610.5825550.1890772.428060
5GBC0.7818300.0854790.4905660.0468190.6696600.1596910.493648
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_mdl.sort_values(\n", " \"test_f1\",\n", " ascending=False,\n", " inplace=True,\n", " ignore_index=True)\n", "\n", "display(df_mdl.style.set_caption(\"Test set validation scores\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any of models present good results! We will try to fit a composite model with the 3 better." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Test set validation scores for Composite Model
 CV1CV2CV3CV4CV5TestSet
fit_time0.6221650.7325430.5918490.6991490.617794nan
score_time0.0237850.0277770.0309910.0278070.023930nan
test_accuracy0.7148460.7011200.6957130.6937520.6772210.699346
test_f10.4122400.4042430.3892010.3856100.3698030.389597
test_precision0.3742140.3573540.3456540.3423150.3219050.349191
test_recall0.4588690.4652960.4453020.4414410.4344470.440576
test_roc_auc0.6791170.6622250.6516890.6587220.6408810.658847
test_neg_brier_score-0.199904-0.208466-0.211428-0.209929-0.218236-0.208991
test_neg_log_loss-0.590700-0.611156-0.616876-0.613285-0.633314-0.611563
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Selecting the models\n", "cls_name = [\"LR\", \"NB\", \"XGB\"]\n", "cls_list = [cls_lr, cls_nb, cls_xgb]\n", "\n", "# Training the voting classifier\n", "cls_vot = VotingClassifier([*zip(cls_name, cls_list)], voting=\"soft\")\n", "cls_vot.fit(X_train, y_train)\n", "\n", "# Using cross-validation to evaluate the model fitted\n", "cls_cross = cross_validate(\n", " estimator=cls_vot,\n", " X=X_train,\n", " y=y_train,\n", " cv=5,\n", " scoring=scores)\n", "\n", "df_vot = pd.DataFrame.from_dict(cls_cross, orient='index', columns=[\"CV\"+str(i) for i in range(1,6)])\n", "\n", "# Calculating score to test set\n", "accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls_vot)\n", "\n", "# Filling a dataframe to better presentation\n", "df_vot.at[\"test_accuracy\", \"TestSet\"] = accurancy\n", "df_vot.at[\"test_f1\", \"TestSet\"] = f1\n", "df_vot.at[\"test_recall\", \"TestSet\"] = recall\n", "df_vot.at[\"test_precision\", \"TestSet\"] = precision\n", "df_vot.at[\"test_roc_auc\", \"TestSet\"] = roc_auc\n", "df_vot.at[\"test_neg_brier_score\", \"TestSet\"] = -brier_score\n", "df_vot.at[\"test_neg_log_loss\", \"TestSet\"] = -log_loss_value\n", "\n", "display(df_vot.style.set_caption(\"Test set validation scores for Composite Model\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The composite model is not better than neat models. Well, maybe some tuning could handle this. But this will be done in future work." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['c:\\\\Users\\\\grego\\\\OneDrive\\\\Documentos\\\\Documentos Pessoais\\\\00_DataCamp\\\\09_VSC\\\\poa_car_accidents\\\\poa_car_accidents\\\\model\\\\model_feridos.pkl']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Saving\n", "# file_name = \"model_\" + output + '.pkl'\n", "# jb.dump(cls_vot, path.join(path.abspath(\"./\"), file_name))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "1372d04dbd71fdc5436c5d6e671c1b9287e750e86143c81b5a7ba0acaf653c5e" } } }, "nbformat": 4, "nbformat_minor": 2 }