{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Introduction\n",
"\n",
"This notebook was written to train Porto Alegre Traffic Accidents Data after the first cleaning, processing, and transforming step. This was made in a notebook in the `data` folder. In truth, we will have 3 models.\n",
"\n",
"1. Predict the probability of injured people.\n",
"\n",
"2. Predict the probability of seriously injured people.\n",
"\n",
"3. Predict the probability of dead people in the event or after it.\n",
"\n",
"The path to training the models will be the same, just make some filtering on data and analyze the results properly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Data Loading"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" latitude | \n",
" -30.009614 | \n",
" -30.0403 | \n",
" -30.069 | \n",
"
\n",
" \n",
" longitude | \n",
" -51.185581 | \n",
" -51.1958 | \n",
" -51.1437 | \n",
"
\n",
" \n",
" feridos | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" feridos_gr | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" fatais | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" caminhao | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" moto | \n",
" True | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" cars | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" transport | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" others | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" holiday | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" day_1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" day_2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" day_3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" day_4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" day_5 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" day_6 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" hour_1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_6 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_7 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_8 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_9 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_10 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_11 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_12 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_13 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_14 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_15 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_16 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_17 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_19 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" hour_20 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_21 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_22 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" hour_23 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" type_ATROPELAMENTO | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" type_CHOQUE | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" type_COLISÃO | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" type_OUTROS | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"latitude -30.009614 -30.0403 -30.069\n",
"longitude -51.185581 -51.1958 -51.1437\n",
"feridos True True True\n",
"feridos_gr False False False\n",
"fatais False False False\n",
"caminhao False False False\n",
"moto True True False\n",
"cars True True True\n",
"transport False False False\n",
"others False False False\n",
"holiday False True True\n",
"day_1 0 0 0\n",
"day_2 0 0 0\n",
"day_3 0 0 0\n",
"day_4 0 0 0\n",
"day_5 1 0 0\n",
"day_6 0 1 1\n",
"hour_1 0 0 0\n",
"hour_2 0 0 0\n",
"hour_3 0 0 0\n",
"hour_4 0 0 0\n",
"hour_5 0 0 0\n",
"hour_6 0 0 0\n",
"hour_7 0 0 0\n",
"hour_8 0 0 0\n",
"hour_9 0 0 0\n",
"hour_10 0 1 0\n",
"hour_11 0 0 0\n",
"hour_12 0 0 0\n",
"hour_13 0 0 0\n",
"hour_14 0 0 0\n",
"hour_15 0 0 0\n",
"hour_16 0 0 0\n",
"hour_17 0 0 0\n",
"hour_18 0 0 0\n",
"hour_19 1 0 1\n",
"hour_20 0 0 0\n",
"hour_21 0 0 0\n",
"hour_22 0 0 0\n",
"hour_23 0 0 0\n",
"type_ATROPELAMENTO 0 0 1\n",
"type_CHOQUE 0 0 0\n",
"type_COLISÃO 0 0 0\n",
"type_OUTROS 0 0 0"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os.path as path\n",
"from pandas import read_csv\n",
"\n",
"file_csv = path.abspath(\"../\")\n",
"\n",
"file_csv = path.join(file_csv, \"data\" ,\"accidents_trans.csv\")\n",
"\n",
"accidents_trans = read_csv(file_csv)\n",
"\n",
"accidents_trans.head(3).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Data Preparation"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import joblib as jb # Use to save the model to deploy\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Our model to predict the probability of feridos will be create with 68218 rows and 41 features.\n"
]
}
],
"source": [
"outputs = [\"feridos\", \"feridos_gr\", \"fatais\"]\n",
"inputs = [col for col in accidents_trans.columns if col not in outputs]\n",
"\n",
"X = accidents_trans[inputs].copy()\n",
"Y = accidents_trans[outputs].copy()\n",
"\n",
"# Filtering data considering the output\n",
"output = \"feridos\"\n",
"\n",
"if output == \"feridos_gr\":\n",
" X = X[Y[\"feridos\"]]\n",
" Y = Y.loc[Y[\"feridos\"], \"feridos_gr\"]\n",
"elif output == \"fatais\":\n",
" X = X[Y[\"feridos_gr\"]]\n",
" Y = Y.loc[Y[\"feridos_gr\"], \"fatais\"]\n",
"else:\n",
" Y = Y[\"feridos\"]\n",
"\n",
"print(f\"Our model to predict the probability of \" \\\n",
" f\"{output} will be create with {X.shape[0]} \" \\\n",
" f\"rows and {X.shape[1]} features.\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"\n",
"with open(\"model_features.csv\", 'w') as f:\n",
" writer = csv.writer(f)\n",
" writer.writerow(X.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Considering that we will use models scaling sensitive, we will need to scale our data first. Beside this, we will need to save our scaler for future use."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['c:\\\\Users\\\\grego\\\\OneDrive\\\\Documentos\\\\Documentos Pessoais\\\\00_DataCamp\\\\09_VSC\\\\poa_car_accidents\\\\poa_car_accidents\\\\model\\\\scaler_feridos.pkl']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Setting the random state using my luck number :-)\n",
"lucky_num = 7\n",
"\n",
"# X_train and y_train to train our model\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X,\n",
" Y,\n",
" test_size=0.30,\n",
" random_state=lucky_num,\n",
" shuffle=True, # Used because our data is sort by date\n",
" stratify=Y) # Used because our data is unbalanced\n",
"\n",
"# Scaling\n",
"scaler = StandardScaler()\n",
"X_train = scaler.fit_transform(X_train)\n",
"X_test = scaler.transform(X_test)\n",
"\n",
"# Saving scaler\n",
"file_name = \"scaler_\" + output + '.pkl'\n",
"jb.dump(scaler, path.join(path.abspath(\"./\"), file_name))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Data Modeling\n",
"\n",
"We will create and use cross-validation to evaluate the following models:\n",
"\n",
"- Logistic Regression;\n",
"\n",
"- Gaussian Naive Bayes;\n",
"\n",
"- K Neighbors;\n",
"\n",
"- Random Forest;\n",
"\n",
"- Gradient Boosting; and,\n",
"\n",
"- XGBoost.\n",
"\n",
"We will use two scores to select and evaluate our models:\n",
"\n",
"- F1 score: composition between the precision (how much our model correct classify every true label) and recall (how moch our model correct indicate true labels); and,\n",
"\n",
"- Brier score: average between the correct and the predict probability.\n",
"\n",
"However, we will see other metrics to support our decision:\n",
"\n",
"- Accurancy;\n",
"\n",
"- ROC_AOC; and,\n",
"\n",
"- Log loss (an other way to quantify the quality of probability predictions).\n",
"\n",
"And, before you go, we will find for each model if there is a hyperparameter to deal with the unbalanced output."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import xgboost as xgb\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import cross_validate \n",
"from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier\n",
"from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score, brier_score_loss, log_loss\n",
"\n",
"scores = [\"accuracy\", \"f1\", \"precision\", \"recall\", \"roc_auc\", \"neg_brier_score\",\"neg_log_loss\"]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def eval_model(cls) -> tuple:\n",
" \"\"\"This function will calculate the metrics\n",
" to evaluate a classification model.\n",
" \"\"\"\n",
" # Predicting labels and probabilities\n",
" y_pred = cls.predict(X_test)\n",
" y_prob = cls.predict_proba(X_test)[:,1]\n",
"\n",
" # Calculating scores\n",
" accurancy = accuracy_score(y_test, y_pred)\n",
" f1 = f1_score(y_test, y_pred)\n",
" recall = recall_score(y_test, y_pred)\n",
" precision = precision_score(y_test, y_pred)\n",
" roc_auc = roc_auc_score(y_test, y_prob) # https://datascience.stackexchange.com/questions/114394/does-roc-auc-different-between-crossval-and-test-set-indicate-overfitting-or-oth\n",
" brier_score = brier_score_loss(y_test, y_prob)\n",
" log_loss_value = log_loss(y_test, y_prob)\n",
"\n",
" return accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value\n",
"\n",
"def create_model(name: str, cls) -> list:\n",
" \"\"\"This function will create some models\n",
" and return scores to evaluate it.\"\"\"\n",
" # Ftting model\n",
" cls.fit(X_train, y_train)\n",
"\n",
" # Using cross-validation to evaluate the model fitted\n",
" cls_cross = cross_validate(\n",
" estimator=cls,\n",
" X=X_train,\n",
" y=y_train,\n",
" cv=5,\n",
" scoring=scores)\n",
"\n",
" df_cv = pd.DataFrame.from_dict(cls_cross, orient='index', columns=[\"CV\"+str(i) for i in range(1,6)])\n",
"\n",
" # Calculating score to test set\n",
" accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls)\n",
"\n",
" # Filling a dataframe to better presentation\n",
" df_cv.at[\"test_accuracy\", \"TestSet\"] = accurancy\n",
" df_cv.at[\"test_f1\", \"TestSet\"] = f1\n",
" df_cv.at[\"test_recall\", \"TestSet\"] = recall\n",
" df_cv.at[\"test_precision\", \"TestSet\"] = precision\n",
" df_cv.at[\"test_roc_auc\", \"TestSet\"] = roc_auc\n",
" df_cv.at[\"test_neg_brier_score\", \"TestSet\"] = -brier_score\n",
" df_cv.at[\"test_neg_log_loss\", \"TestSet\"] = -log_loss_value\n",
"\n",
" caption = f\"{name} Validation Scores\"\n",
"\n",
" display(df_cv.style.set_caption(caption))\n",
"\n",
" return [accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" LR Validation Scores\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 0.082354 | \n",
" 0.080257 | \n",
" 0.089329 | \n",
" 0.094720 | \n",
" 0.087742 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 0.016066 | \n",
" 0.017635 | \n",
" 0.020100 | \n",
" 0.018260 | \n",
" 0.018356 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.869228 | \n",
" 0.868391 | \n",
" 0.872356 | \n",
" 0.869005 | \n",
" 0.867539 | \n",
" 0.865924 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.817584 | \n",
" 0.818116 | \n",
" 0.823920 | \n",
" 0.819611 | \n",
" 0.817011 | \n",
" 0.814469 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.854135 | \n",
" 0.846154 | \n",
" 0.850582 | \n",
" 0.844326 | \n",
" 0.844498 | \n",
" 0.843439 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.784034 | \n",
" 0.791877 | \n",
" 0.798880 | \n",
" 0.796301 | \n",
" 0.791258 | \n",
" 0.787423 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.903418 | \n",
" 0.904970 | \n",
" 0.906377 | \n",
" 0.902405 | \n",
" 0.906939 | \n",
" 0.904458 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.109808 | \n",
" -0.109221 | \n",
" -0.106382 | \n",
" -0.110939 | \n",
" -0.109709 | \n",
" -0.110435 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -0.370200 | \n",
" -0.366684 | \n",
" -0.360534 | \n",
" -0.372374 | \n",
" -0.367532 | \n",
" -0.370350 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" NB Validation Scores\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 0.035410 | \n",
" 0.030015 | \n",
" 0.032639 | \n",
" 0.029752 | \n",
" 0.030653 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 0.037826 | \n",
" 0.040993 | \n",
" 0.032376 | \n",
" 0.030767 | \n",
" 0.028092 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.768401 | \n",
" 0.763376 | \n",
" 0.765131 | \n",
" 0.771518 | \n",
" 0.772251 | \n",
" 0.766637 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.667068 | \n",
" 0.654223 | \n",
" 0.660922 | \n",
" 0.675876 | \n",
" 0.668899 | \n",
" 0.664795 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.720885 | \n",
" 0.720836 | \n",
" 0.717898 | \n",
" 0.719254 | \n",
" 0.732333 | \n",
" 0.717684 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.620728 | \n",
" 0.598880 | \n",
" 0.612325 | \n",
" 0.637433 | \n",
" 0.615579 | \n",
" 0.619166 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.852290 | \n",
" 0.847184 | \n",
" 0.843733 | \n",
" 0.851873 | \n",
" 0.856047 | \n",
" 0.848834 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.206596 | \n",
" -0.210362 | \n",
" -0.211968 | \n",
" -0.204214 | \n",
" -0.202682 | \n",
" -0.208278 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -1.668014 | \n",
" -1.788896 | \n",
" -1.917438 | \n",
" -1.662381 | \n",
" -1.670358 | \n",
" -1.761326 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" KNN Validation Scores\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 0.010002 | \n",
" 0.011312 | \n",
" 0.011621 | \n",
" 0.013843 | \n",
" 0.011473 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 1.660269 | \n",
" 1.360570 | \n",
" 1.651296 | \n",
" 1.734129 | \n",
" 1.823339 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.842320 | \n",
" 0.848707 | \n",
" 0.847330 | \n",
" 0.842723 | \n",
" 0.847749 | \n",
" 0.843692 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.776492 | \n",
" 0.787218 | \n",
" 0.783551 | \n",
" 0.779053 | \n",
" 0.786365 | \n",
" 0.779698 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.825758 | \n",
" 0.829867 | \n",
" 0.833544 | \n",
" 0.820068 | \n",
" 0.826691 | \n",
" 0.823778 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.732773 | \n",
" 0.748739 | \n",
" 0.739216 | \n",
" 0.741945 | \n",
" 0.749790 | \n",
" 0.740097 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.867330 | \n",
" 0.869924 | \n",
" 0.872951 | \n",
" 0.866868 | \n",
" 0.872277 | \n",
" 0.872155 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.130989 | \n",
" -0.127425 | \n",
" -0.126777 | \n",
" -0.130655 | \n",
" -0.127401 | \n",
" -0.128215 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -2.083997 | \n",
" -1.959589 | \n",
" -1.815403 | \n",
" -2.007178 | \n",
" -1.929602 | \n",
" -1.877810 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" RF Validation Scores\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 4.099665 | \n",
" 4.061200 | \n",
" 4.090116 | \n",
" 4.055705 | \n",
" 4.050387 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 0.390365 | \n",
" 0.389244 | \n",
" 0.392108 | \n",
" 0.387358 | \n",
" 0.400155 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.856141 | \n",
" 0.859282 | \n",
" 0.861571 | \n",
" 0.853508 | \n",
" 0.855393 | \n",
" 0.856152 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.800349 | \n",
" 0.805217 | \n",
" 0.807681 | \n",
" 0.798676 | \n",
" 0.799477 | \n",
" 0.800623 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.831522 | \n",
" 0.834234 | \n",
" 0.840194 | \n",
" 0.821006 | \n",
" 0.829717 | \n",
" 0.830547 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.771429 | \n",
" 0.778151 | \n",
" 0.777591 | \n",
" 0.777529 | \n",
" 0.771365 | \n",
" 0.772781 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.890122 | \n",
" 0.890561 | \n",
" 0.897321 | \n",
" 0.887396 | \n",
" 0.891078 | \n",
" 0.893466 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.116884 | \n",
" -0.114867 | \n",
" -0.111343 | \n",
" -0.117719 | \n",
" -0.116295 | \n",
" -0.115285 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -0.607395 | \n",
" -0.579640 | \n",
" -0.536542 | \n",
" -0.614554 | \n",
" -0.631888 | \n",
" -0.562042 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" GBC Validation Scores\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 4.591437 | \n",
" 4.437213 | \n",
" 4.121067 | \n",
" 4.142180 | \n",
" 4.113901 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 0.055993 | \n",
" 0.048113 | \n",
" 0.049492 | \n",
" 0.050163 | \n",
" 0.055706 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.871113 | \n",
" 0.873207 | \n",
" 0.878639 | \n",
" 0.870052 | \n",
" 0.870157 | \n",
" 0.871054 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.817169 | \n",
" 0.820831 | \n",
" 0.827709 | \n",
" 0.817043 | \n",
" 0.817109 | \n",
" 0.817560 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.869744 | \n",
" 0.869865 | \n",
" 0.881850 | \n",
" 0.862166 | \n",
" 0.862660 | \n",
" 0.867518 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.770588 | \n",
" 0.777031 | \n",
" 0.779832 | \n",
" 0.776408 | \n",
" 0.776128 | \n",
" 0.773042 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.907041 | \n",
" 0.908041 | \n",
" 0.911930 | \n",
" 0.906283 | \n",
" 0.909348 | \n",
" 0.908648 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.105054 | \n",
" -0.103463 | \n",
" -0.099338 | \n",
" -0.104658 | \n",
" -0.104459 | \n",
" -0.104280 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -0.352792 | \n",
" -0.348499 | \n",
" -0.338605 | \n",
" -0.351285 | \n",
" -0.350193 | \n",
" -0.350152 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" XGB Validation Scores\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 3.802029 | \n",
" 3.036764 | \n",
" 2.979647 | \n",
" 2.177232 | \n",
" 2.287098 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 0.069013 | \n",
" 0.071819 | \n",
" 0.049402 | \n",
" 0.057279 | \n",
" 0.050020 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.860224 | \n",
" 0.851848 | \n",
" 0.854136 | \n",
" 0.853298 | \n",
" 0.856021 | \n",
" 0.854344 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.814145 | \n",
" 0.804747 | \n",
" 0.808259 | \n",
" 0.807370 | \n",
" 0.810371 | \n",
" 0.808283 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.809300 | \n",
" 0.793038 | \n",
" 0.794587 | \n",
" 0.792657 | \n",
" 0.797936 | \n",
" 0.795443 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.819048 | \n",
" 0.816807 | \n",
" 0.822409 | \n",
" 0.822639 | \n",
" 0.823200 | \n",
" 0.821545 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.908407 | \n",
" 0.906379 | \n",
" 0.910833 | \n",
" 0.907507 | \n",
" 0.908959 | \n",
" 0.908681 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.116893 | \n",
" -0.119319 | \n",
" -0.116034 | \n",
" -0.119313 | \n",
" -0.118294 | \n",
" -0.118266 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -0.393473 | \n",
" -0.395306 | \n",
" -0.384403 | \n",
" -0.397352 | \n",
" -0.394224 | \n",
" -0.392001 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# XGB hyperparameter that deals with unbalanced\n",
"scale_pos_weight = Y.mean()**-1\n",
"\n",
"# Creating the model objects\n",
"cls_lr = LogisticRegression(\n",
" class_weight=\"balanced\", # Hyperparameter to deal with unbalanced output\n",
" random_state=lucky_num)\n",
"# cls_svm = SVC(random_state=lucky_num) # Remove due its resource consumption and worst results\n",
"cls_NB = GaussianNB()\n",
"cls_knn = KNeighborsClassifier()\n",
"cls_rf = RandomForestClassifier(\n",
" random_state=lucky_num,\n",
" class_weight=\"balanced_subsample\") # Hyperparameter to deal with unbalanced output\n",
"cls_gbc = GradientBoostingClassifier(random_state=lucky_num)\n",
"cls_xgb = xgb.XGBClassifier(\n",
" objective=\"binary:logistic\",\n",
" verbose=None,\n",
" random_state=lucky_num,\n",
" scale_pos_weight = scale_pos_weight)\n",
"\n",
"# Lists to iterate on our modeling function\n",
"cls_name = [\"LR\", \"NB\", \"KNN\", \"RF\", \"GBC\", \"XGB\"]\n",
"cls_list = [cls_lr, cls_NB, cls_knn, cls_rf, cls_gbc, cls_xgb]\n",
"\n",
"mdl_summaries = []\n",
"for name, inst in zip(cls_name, cls_list):\n",
" mdl_list = create_model(name, inst)\n",
" mdl_list = [name] + mdl_list\n",
" mdl_summaries.append(mdl_list)\n",
"\n",
"df_mdl = pd.DataFrame(\n",
" mdl_summaries,\n",
" columns=[\n",
" \"model\",\n",
" \"test_accuracy\",\n",
" \"test_f1\",\n",
" \"test_precision\",\n",
" \"test_recall\",\n",
" \"test_roc_auc\",\n",
" \"test_brier\",\n",
" \"test_log_loss\"])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" Test set validation scores\n",
" \n",
" \n",
" | \n",
" model | \n",
" test_accuracy | \n",
" test_f1 | \n",
" test_precision | \n",
" test_recall | \n",
" test_roc_auc | \n",
" test_brier | \n",
" test_log_loss | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GBC | \n",
" 0.871054 | \n",
" 0.817560 | \n",
" 0.867518 | \n",
" 0.773042 | \n",
" 0.908648 | \n",
" 0.104280 | \n",
" 0.350152 | \n",
"
\n",
" \n",
" 1 | \n",
" LR | \n",
" 0.865924 | \n",
" 0.814469 | \n",
" 0.843439 | \n",
" 0.787423 | \n",
" 0.904458 | \n",
" 0.110435 | \n",
" 0.370350 | \n",
"
\n",
" \n",
" 2 | \n",
" XGB | \n",
" 0.854344 | \n",
" 0.808283 | \n",
" 0.795443 | \n",
" 0.821545 | \n",
" 0.908681 | \n",
" 0.118266 | \n",
" 0.392001 | \n",
"
\n",
" \n",
" 3 | \n",
" RF | \n",
" 0.856152 | \n",
" 0.800623 | \n",
" 0.830547 | \n",
" 0.772781 | \n",
" 0.893466 | \n",
" 0.115285 | \n",
" 0.562042 | \n",
"
\n",
" \n",
" 4 | \n",
" KNN | \n",
" 0.843692 | \n",
" 0.779698 | \n",
" 0.823778 | \n",
" 0.740097 | \n",
" 0.872155 | \n",
" 0.128215 | \n",
" 1.877810 | \n",
"
\n",
" \n",
" 5 | \n",
" NB | \n",
" 0.766637 | \n",
" 0.664795 | \n",
" 0.717684 | \n",
" 0.619166 | \n",
" 0.848834 | \n",
" 0.208278 | \n",
" 1.761326 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_mdl.sort_values(\n",
" \"test_f1\",\n",
" ascending=False,\n",
" inplace=True,\n",
" ignore_index=True)\n",
"\n",
"display(df_mdl.style.set_caption(\"Test set validation scores\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GBC, LR, XGB and RF preset great results! We have two ways here: hyperparameters tunning or creating a composite model. Let's begin with the composite model.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" Test set validation scores for Composite Model\n",
" \n",
" \n",
" | \n",
" CV1 | \n",
" CV2 | \n",
" CV3 | \n",
" CV4 | \n",
" CV5 | \n",
" TestSet | \n",
"
\n",
" \n",
" \n",
" \n",
" fit_time | \n",
" 10.109613 | \n",
" 11.766011 | \n",
" 11.450818 | \n",
" 11.737634 | \n",
" 12.702598 | \n",
" nan | \n",
"
\n",
" \n",
" score_time | \n",
" 0.490518 | \n",
" 0.532695 | \n",
" 0.529459 | \n",
" 0.549051 | \n",
" 0.586749 | \n",
" nan | \n",
"
\n",
" \n",
" test_accuracy | \n",
" 0.870799 | \n",
" 0.871532 | \n",
" 0.875497 | \n",
" 0.869948 | \n",
" 0.869215 | \n",
" 0.869002 | \n",
"
\n",
" \n",
" test_f1 | \n",
" 0.818689 | \n",
" 0.820797 | \n",
" 0.826297 | \n",
" 0.819319 | \n",
" 0.817531 | \n",
" 0.817283 | \n",
"
\n",
" \n",
" test_precision | \n",
" 0.860939 | \n",
" 0.857492 | \n",
" 0.863511 | \n",
" 0.852042 | \n",
" 0.854090 | \n",
" 0.853645 | \n",
"
\n",
" \n",
" test_recall | \n",
" 0.780392 | \n",
" 0.787115 | \n",
" 0.792157 | \n",
" 0.789017 | \n",
" 0.783973 | \n",
" 0.783893 | \n",
"
\n",
" \n",
" test_roc_auc | \n",
" 0.909022 | \n",
" 0.908890 | \n",
" 0.912418 | \n",
" 0.907315 | \n",
" 0.910340 | \n",
" 0.910500 | \n",
"
\n",
" \n",
" test_neg_brier_score | \n",
" -0.105818 | \n",
" -0.105354 | \n",
" -0.101743 | \n",
" -0.106567 | \n",
" -0.105957 | \n",
" -0.105743 | \n",
"
\n",
" \n",
" test_neg_log_loss | \n",
" -0.356051 | \n",
" -0.353269 | \n",
" -0.344184 | \n",
" -0.357062 | \n",
" -0.355010 | \n",
" -0.353621 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Selecting the models\n",
"cls_name = [\"GBC\", \"XGB\", \"LR\", \"RF\",]\n",
"cls_list = [cls_gbc, cls_xgb, cls_lr, cls_rf]\n",
"\n",
"# Training the voting classifier\n",
"cls_vot = VotingClassifier([*zip(cls_name, cls_list)], voting=\"soft\")\n",
"cls_vot.fit(X_train, y_train)\n",
"\n",
"# Using cross-validation to evaluate the model fitted\n",
"cls_cross = cross_validate(\n",
" estimator=cls_vot,\n",
" X=X_train,\n",
" y=y_train,\n",
" cv=5,\n",
" scoring=scores)\n",
"\n",
"df_vot = pd.DataFrame.from_dict(cls_cross, orient='index', columns=[\"CV\"+str(i) for i in range(1,6)])\n",
"\n",
"# Calculating score to test set\n",
"accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls_vot)\n",
"\n",
"# Filling a dataframe to better presentation\n",
"df_vot.at[\"test_accuracy\", \"TestSet\"] = accurancy\n",
"df_vot.at[\"test_f1\", \"TestSet\"] = f1\n",
"df_vot.at[\"test_recall\", \"TestSet\"] = recall\n",
"df_vot.at[\"test_precision\", \"TestSet\"] = precision\n",
"df_vot.at[\"test_roc_auc\", \"TestSet\"] = roc_auc\n",
"df_vot.at[\"test_neg_brier_score\", \"TestSet\"] = -brier_score\n",
"df_vot.at[\"test_neg_log_loss\", \"TestSet\"] = -log_loss_value\n",
"\n",
"display(df_vot.style.set_caption(\"Test set validation scores for Composite Model\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The composite model does not present any evidence of overfitting. For now, we will use it on our app."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['c:\\\\Users\\\\grego\\\\OneDrive\\\\Documentos\\\\Documentos Pessoais\\\\00_DataCamp\\\\09_VSC\\\\poa_car_accidents\\\\poa_car_accidents\\\\model\\\\model_feridos.pkl']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Saving\n",
"file_name = \"model_\" + output + '.pkl'\n",
"jb.dump(cls_vot, path.join(path.abspath(\"./\"), file_name))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "1372d04dbd71fdc5436c5d6e671c1b9287e750e86143c81b5a7ba0acaf653c5e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}