# 1. Introduction

This notebook was written to train Porto Alegre Traffic Accidents Data after the first cleaning, processing, and transforming step. This was made in a notebook in the `data` folder. In truth, we will have 3 models.

1. Predict the probability of injured people.

2. Predict the probability of seriously injured people.

3. Predict the probability of dead people in the event or after it.

The path to training the models will be the same, just make some filtering on data and analyze the results properly.

# 2. Data Loading

In [1]:
import os.path as path
from pandas import read_csv

file_csv =  path.abspath("../")

file_csv = path.join(file_csv, "data" ,"accidents_trans.csv")

accidents_trans = read_csv(file_csv)

accidents_trans.head(3).T

Unnamed: 0,0,1,2
latitude,-30.009614,-30.0403,-30.069
longitude,-51.185581,-51.1958,-51.1437
feridos,True,True,True
feridos_gr,False,False,False
fatais,False,False,False
caminhao,False,False,False
moto,True,True,False
cars,True,True,True
transport,False,False,False
others,False,False,False


# 3. Data Preparation

In [2]:
import joblib as jb # Use to save the model to deploy
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [3]:
outputs = ["feridos", "feridos_gr", "fatais"]
inputs = [col for col in accidents_trans.columns if col not in outputs]

X = accidents_trans[inputs].copy()
Y = accidents_trans[outputs].copy()

# Filtering data considering the output
output = "feridos"

if output == "feridos_gr":
    X = X[Y["feridos"]]
    Y = Y.loc[Y["feridos"], "feridos_gr"]
elif output == "fatais":
    X = X[Y["feridos_gr"]]
    Y = Y.loc[Y["feridos_gr"], "fatais"]
else:
    Y = Y["feridos"]

print(f"Our model to predict the probability of " \
      f"{output} will be create with {X.shape[0]} " \
      f"rows and {X.shape[1]} features.")

Our model to predict the probability of feridos will be create with 68218 rows and 41 features.


In [4]:
import csv

with open("model_features.csv", 'w') as f:
    writer = csv.writer(f)
    writer.writerow(X.columns)

Considering that we will use models scaling sensitive, we will need to scale our data first. Beside this, we will need to save our scaler for future use.

In [5]:
# Setting the random state using my luck number :-)
lucky_num = 7

# X_train and y_train to train our model
X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    test_size=0.30,
    random_state=lucky_num,
    shuffle=True,  # Used because our data is sort by date
    stratify=Y)  # Used because our data is unbalanced

# Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Saving scaler
file_name = "scaler_" + output + '.pkl'
jb.dump(scaler, path.join(path.abspath("./"), file_name))

['c:\\Users\\grego\\OneDrive\\Documentos\\Documentos Pessoais\\00_DataCamp\\09_VSC\\poa_car_accidents\\poa_car_accidents\\model\\scaler_feridos.pkl']

# 4. Data Modeling

We will create and use cross-validation to evaluate the following models:

- Logistic Regression;

- Gaussian Naive Bayes;

- K Neighbors;

- Random Forest;

- Gradient Boosting; and,

- XGBoost.

We will use two scores to select and evaluate our models:

- F1 score: composition between the precision (how much our model correct classify every true label) and recall (how moch our model correct indicate true labels); and,

- Brier score: average between the correct and the predict probability.

However, we will see other metrics to support our decision:

- Accurancy;

- ROC_AOC; and,

- Log loss (an other way to quantify the quality of probability predictions).

And, before you go, we will find for each model if there is a hyperparameter to deal with the unbalanced output.

In [6]:
import pandas as pd
import xgboost as xgb
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score, brier_score_loss, log_loss

scores = ["accuracy", "f1", "precision", "recall", "roc_auc", "neg_brier_score","neg_log_loss"]

In [7]:
def eval_model(cls) -> tuple:
    """This function will calculate the metrics
    to evaluate a classification model.
    """
    # Predicting labels and probabilities
    y_pred = cls.predict(X_test)
    y_prob = cls.predict_proba(X_test)[:,1]

    # Calculating scores
    accurancy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_prob)  # https://datascience.stackexchange.com/questions/114394/does-roc-auc-different-between-crossval-and-test-set-indicate-overfitting-or-oth
    brier_score = brier_score_loss(y_test, y_prob)
    log_loss_value = log_loss(y_test, y_prob)

    return accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value

def create_model(name: str, cls) -> list:
    """This function will create some models
    and return scores to evaluate it."""
    # Ftting model
    cls.fit(X_train, y_train)

    # Using cross-validation to evaluate the model fitted
    cls_cross = cross_validate(
        estimator=cls,
        X=X_train,
        y=y_train,
        cv=5,
        scoring=scores)

    df_cv = pd.DataFrame.from_dict(cls_cross, orient='index', columns=["CV"+str(i) for i in range(1,6)])

    # Calculating score to test set
    accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls)

    # Filling a dataframe to better presentation
    df_cv.at["test_accuracy", "TestSet"] = accurancy
    df_cv.at["test_f1", "TestSet"] = f1
    df_cv.at["test_recall", "TestSet"] = recall
    df_cv.at["test_precision", "TestSet"] = precision
    df_cv.at["test_roc_auc", "TestSet"] = roc_auc
    df_cv.at["test_neg_brier_score", "TestSet"] = -brier_score
    df_cv.at["test_neg_log_loss", "TestSet"] = -log_loss_value

    caption = f"{name} Validation Scores"

    display(df_cv.style.set_caption(caption))

    return [accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value]

In [8]:
# XGB hyperparameter that deals with unbalanced
scale_pos_weight = Y.mean()**-1

# Creating the model objects
cls_lr = LogisticRegression(
            class_weight="balanced",  # Hyperparameter to deal with unbalanced output
            random_state=lucky_num)
# cls_svm = SVC(random_state=lucky_num)  # Remove due its resource consumption and worst results
cls_NB = GaussianNB()
cls_knn = KNeighborsClassifier()
cls_rf = RandomForestClassifier(
            random_state=lucky_num,
            class_weight="balanced_subsample")  # Hyperparameter to deal with unbalanced output
cls_gbc = GradientBoostingClassifier(random_state=lucky_num)
cls_xgb = xgb.XGBClassifier(
            objective="binary:logistic",
            verbose=None,
            random_state=lucky_num,
            scale_pos_weight = scale_pos_weight)

# Lists to iterate on our modeling function
cls_name = ["LR", "NB", "KNN", "RF", "GBC", "XGB"]
cls_list = [cls_lr, cls_NB, cls_knn, cls_rf, cls_gbc, cls_xgb]

mdl_summaries = []
for name, inst in zip(cls_name, cls_list):
    mdl_list = create_model(name, inst)
    mdl_list = [name] + mdl_list
    mdl_summaries.append(mdl_list)

df_mdl = pd.DataFrame(
            mdl_summaries,
            columns=[
                "model",
                "test_accuracy",
                "test_f1",
                "test_precision",
                "test_recall",
                "test_roc_auc",
                "test_brier",
                "test_log_loss"])

Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.082354,0.080257,0.089329,0.09472,0.087742,
score_time,0.016066,0.017635,0.0201,0.01826,0.018356,
test_accuracy,0.869228,0.868391,0.872356,0.869005,0.867539,0.865924
test_f1,0.817584,0.818116,0.82392,0.819611,0.817011,0.814469
test_precision,0.854135,0.846154,0.850582,0.844326,0.844498,0.843439
test_recall,0.784034,0.791877,0.79888,0.796301,0.791258,0.787423
test_roc_auc,0.903418,0.90497,0.906377,0.902405,0.906939,0.904458
test_neg_brier_score,-0.109808,-0.109221,-0.106382,-0.110939,-0.109709,-0.110435
test_neg_log_loss,-0.3702,-0.366684,-0.360534,-0.372374,-0.367532,-0.37035


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.03541,0.030015,0.032639,0.029752,0.030653,
score_time,0.037826,0.040993,0.032376,0.030767,0.028092,
test_accuracy,0.768401,0.763376,0.765131,0.771518,0.772251,0.766637
test_f1,0.667068,0.654223,0.660922,0.675876,0.668899,0.664795
test_precision,0.720885,0.720836,0.717898,0.719254,0.732333,0.717684
test_recall,0.620728,0.59888,0.612325,0.637433,0.615579,0.619166
test_roc_auc,0.85229,0.847184,0.843733,0.851873,0.856047,0.848834
test_neg_brier_score,-0.206596,-0.210362,-0.211968,-0.204214,-0.202682,-0.208278
test_neg_log_loss,-1.668014,-1.788896,-1.917438,-1.662381,-1.670358,-1.761326


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.010002,0.011312,0.011621,0.013843,0.011473,
score_time,1.660269,1.36057,1.651296,1.734129,1.823339,
test_accuracy,0.84232,0.848707,0.84733,0.842723,0.847749,0.843692
test_f1,0.776492,0.787218,0.783551,0.779053,0.786365,0.779698
test_precision,0.825758,0.829867,0.833544,0.820068,0.826691,0.823778
test_recall,0.732773,0.748739,0.739216,0.741945,0.74979,0.740097
test_roc_auc,0.86733,0.869924,0.872951,0.866868,0.872277,0.872155
test_neg_brier_score,-0.130989,-0.127425,-0.126777,-0.130655,-0.127401,-0.128215
test_neg_log_loss,-2.083997,-1.959589,-1.815403,-2.007178,-1.929602,-1.87781


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,4.099665,4.0612,4.090116,4.055705,4.050387,
score_time,0.390365,0.389244,0.392108,0.387358,0.400155,
test_accuracy,0.856141,0.859282,0.861571,0.853508,0.855393,0.856152
test_f1,0.800349,0.805217,0.807681,0.798676,0.799477,0.800623
test_precision,0.831522,0.834234,0.840194,0.821006,0.829717,0.830547
test_recall,0.771429,0.778151,0.777591,0.777529,0.771365,0.772781
test_roc_auc,0.890122,0.890561,0.897321,0.887396,0.891078,0.893466
test_neg_brier_score,-0.116884,-0.114867,-0.111343,-0.117719,-0.116295,-0.115285
test_neg_log_loss,-0.607395,-0.57964,-0.536542,-0.614554,-0.631888,-0.562042


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,4.591437,4.437213,4.121067,4.14218,4.113901,
score_time,0.055993,0.048113,0.049492,0.050163,0.055706,
test_accuracy,0.871113,0.873207,0.878639,0.870052,0.870157,0.871054
test_f1,0.817169,0.820831,0.827709,0.817043,0.817109,0.81756
test_precision,0.869744,0.869865,0.88185,0.862166,0.86266,0.867518
test_recall,0.770588,0.777031,0.779832,0.776408,0.776128,0.773042
test_roc_auc,0.907041,0.908041,0.91193,0.906283,0.909348,0.908648
test_neg_brier_score,-0.105054,-0.103463,-0.099338,-0.104658,-0.104459,-0.10428
test_neg_log_loss,-0.352792,-0.348499,-0.338605,-0.351285,-0.350193,-0.350152


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,3.802029,3.036764,2.979647,2.177232,2.287098,
score_time,0.069013,0.071819,0.049402,0.057279,0.05002,
test_accuracy,0.860224,0.851848,0.854136,0.853298,0.856021,0.854344
test_f1,0.814145,0.804747,0.808259,0.80737,0.810371,0.808283
test_precision,0.8093,0.793038,0.794587,0.792657,0.797936,0.795443
test_recall,0.819048,0.816807,0.822409,0.822639,0.8232,0.821545
test_roc_auc,0.908407,0.906379,0.910833,0.907507,0.908959,0.908681
test_neg_brier_score,-0.116893,-0.119319,-0.116034,-0.119313,-0.118294,-0.118266
test_neg_log_loss,-0.393473,-0.395306,-0.384403,-0.397352,-0.394224,-0.392001


In [9]:
df_mdl.sort_values(
        "test_f1",
        ascending=False,
        inplace=True,
        ignore_index=True)

display(df_mdl.style.set_caption("Test set validation scores"))

Unnamed: 0,model,test_accuracy,test_f1,test_precision,test_recall,test_roc_auc,test_brier,test_log_loss
0,GBC,0.871054,0.81756,0.867518,0.773042,0.908648,0.10428,0.350152
1,LR,0.865924,0.814469,0.843439,0.787423,0.904458,0.110435,0.37035
2,XGB,0.854344,0.808283,0.795443,0.821545,0.908681,0.118266,0.392001
3,RF,0.856152,0.800623,0.830547,0.772781,0.893466,0.115285,0.562042
4,KNN,0.843692,0.779698,0.823778,0.740097,0.872155,0.128215,1.87781
5,NB,0.766637,0.664795,0.717684,0.619166,0.848834,0.208278,1.761326


GBC, LR, XGB and RF preset great results! We have two ways here: hyperparameters tunning or creating a composite model. Let's begin with the composite model.


In [10]:
# Selecting the models
cls_name = ["GBC", "XGB", "LR", "RF",]
cls_list = [cls_gbc, cls_xgb, cls_lr, cls_rf]

# Training the voting classifier
cls_vot = VotingClassifier([*zip(cls_name, cls_list)], voting="soft")
cls_vot.fit(X_train, y_train)

# Using cross-validation to evaluate the model fitted
cls_cross = cross_validate(
    estimator=cls_vot,
    X=X_train,
    y=y_train,
    cv=5,
    scoring=scores)

df_vot = pd.DataFrame.from_dict(cls_cross, orient='index', columns=["CV"+str(i) for i in range(1,6)])

# Calculating score to test set
accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls_vot)

# Filling a dataframe to better presentation
df_vot.at["test_accuracy", "TestSet"] = accurancy
df_vot.at["test_f1", "TestSet"] = f1
df_vot.at["test_recall", "TestSet"] = recall
df_vot.at["test_precision", "TestSet"] = precision
df_vot.at["test_roc_auc", "TestSet"] = roc_auc
df_vot.at["test_neg_brier_score", "TestSet"] = -brier_score
df_vot.at["test_neg_log_loss", "TestSet"] = -log_loss_value

display(df_vot.style.set_caption("Test set validation scores for Composite Model"))

Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,10.109613,11.766011,11.450818,11.737634,12.702598,
score_time,0.490518,0.532695,0.529459,0.549051,0.586749,
test_accuracy,0.870799,0.871532,0.875497,0.869948,0.869215,0.869002
test_f1,0.818689,0.820797,0.826297,0.819319,0.817531,0.817283
test_precision,0.860939,0.857492,0.863511,0.852042,0.85409,0.853645
test_recall,0.780392,0.787115,0.792157,0.789017,0.783973,0.783893
test_roc_auc,0.909022,0.90889,0.912418,0.907315,0.91034,0.9105
test_neg_brier_score,-0.105818,-0.105354,-0.101743,-0.106567,-0.105957,-0.105743
test_neg_log_loss,-0.356051,-0.353269,-0.344184,-0.357062,-0.35501,-0.353621


The composite model does not present any evidence of overfitting. For now, we will use it on our app.

In [11]:
# Saving
file_name = "model_" + output + '.pkl'
jb.dump(cls_vot, path.join(path.abspath("./"), file_name))

['c:\\Users\\grego\\OneDrive\\Documentos\\Documentos Pessoais\\00_DataCamp\\09_VSC\\poa_car_accidents\\poa_car_accidents\\model\\model_feridos.pkl']