# 1. Introduction

This notebook was written to train Porto Alegre Traffic Accidents Data after the first cleaning, processing, and transforming step. This was made in a notebook in the `data` folder. In truth, we will have 3 models.

1. Predict the probability of injured people.

2. Predict the probability of seriously injured people.

3. Predict the probability of dead people in the event or after it.

The path to training the models will be the same, just make some filtering on data and analyze the results properly.

# 2. Data Loading

In [5]:
import os.path as path
from pandas import read_csv

file_csv =  path.abspath("../")

file_csv = path.join(file_csv, "data" ,"accidents_trans.csv")

accidents_trans = read_csv(file_csv)

accidents_trans.head(3).T

Unnamed: 0,0,1,2
latitude,-30.009614,-30.0403,-30.069
longitude,-51.185581,-51.1958,-51.1437
feridos,True,True,True
feridos_gr,False,False,False
fatais,False,False,False
caminhao,False,False,False
moto,True,True,False
cars,True,True,True
transport,False,False,False
others,False,False,False


# 3. Data Preparation

In [6]:
import joblib as jb # Use to save the model to deploy
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [7]:
outputs = ["feridos", "feridos_gr", "fatais"]
inputs = [col for col in accidents_trans.columns if col not in outputs]

X = accidents_trans[inputs].copy()
Y = accidents_trans[outputs].copy()

# Filtering data considering the output
output = "feridos_gr"

if output == "feridos_gr":
    X = X[Y["feridos"]]
    Y = Y.loc[Y["feridos"], "feridos_gr"]
elif output == "fatais":
    X = X[Y["feridos_gr"]]
    Y = Y.loc[Y["feridos_gr"], "fatais"]
else:
    Y = Y["feridos"]

print(f"Our model to predict the probability of " \
      f"{output} will be create with {X.shape[0]} " \
      f"rows and {X.shape[1]} features.")

Our model to predict the probability of feridos_gr will be create with 25497 rows and 41 features.


In [8]:
import csv

with open("model_features.csv", 'w') as f:
    writer = csv.writer(f)
    writer.writerow(X.columns)

Considering that we will use models scaling sensitive, we will need to scale our data first. Beside this, we will need to save our scaler for future use.

In [9]:
# Setting the random state using my luck number :-)
lucky_num = 7

# X_train and y_train to train our model
X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    test_size=0.30,
    random_state=lucky_num,
    shuffle=True,  # Used because our data is sort by date
    stratify=Y)  # Used because our data is unbalanced

# Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Saving scaler
file_name = "scaler_" + output + '.pkl'
jb.dump(scaler, path.join(path.abspath("./"), file_name))

['c:\\Users\\grego\\OneDrive\\Documentos\\Documentos Pessoais\\00_DataCamp\\09_VSC\\poa_car_accidents\\poa_car_accidents\\model\\scaler_feridos_gr.pkl']

# 4. Data Modeling

We will create and use cross-validation to evaluate the following models:

- Logistic Regression;

- Gaussian Naive Bayes;

- K Neighbors;

- Random Forest;

- Gradient Boosting; and,

- XGBoost.

We will use two scores to select and evaluate our models:

- F1 score: composition between the precision (how much our model correct classify every true label) and recall (how moch our model correct indicate true labels); and,

- Brier score: average between the correct and the predict probability.

However, we will see other metrics to support our decision:

- Accurancy;

- ROC_AOC; and,

- Log loss (an other way to quantify the quality of probability predictions).

And, before you go, we will find for each model if there is a hyperparameter to deal with the unbalanced output.

In [10]:
import pandas as pd
import xgboost as xgb
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score, brier_score_loss, log_loss

scores = ["accuracy", "f1", "precision", "recall", "roc_auc", "neg_brier_score","neg_log_loss"]

In [11]:
def eval_model(cls) -> tuple:
    """This function will calculate the metrics
    to evaluate a classification model.
    """
    # Predicting labels and probabilities
    y_pred = cls.predict(X_test)
    y_prob = cls.predict_proba(X_test)[:,1]

    # Calculating scores
    accurancy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_prob)  # https://datascience.stackexchange.com/questions/114394/does-roc-auc-different-between-crossval-and-test-set-indicate-overfitting-or-oth
    brier_score = brier_score_loss(y_test, y_prob)
    log_loss_value = log_loss(y_test, y_prob)

    return accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value

def create_model(name: str, cls) -> list:
    """This function will create some models
    and return scores to evaluate it."""
    # Ftting model
    cls.fit(X_train, y_train)

    # Using cross-validation to evaluate the model fitted
    cls_cross = cross_validate(
        estimator=cls,
        X=X_train,
        y=y_train,
        cv=5,
        scoring=scores)

    df_cv = pd.DataFrame.from_dict(cls_cross, orient='index', columns=["CV"+str(i) for i in range(1,6)])

    # Calculating score to test set
    accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls)

    # Filling a dataframe to better presentation
    df_cv.at["test_accuracy", "TestSet"] = accurancy
    df_cv.at["test_f1", "TestSet"] = f1
    df_cv.at["test_recall", "TestSet"] = recall
    df_cv.at["test_precision", "TestSet"] = precision
    df_cv.at["test_roc_auc", "TestSet"] = roc_auc
    df_cv.at["test_neg_brier_score", "TestSet"] = -brier_score
    df_cv.at["test_neg_log_loss", "TestSet"] = -log_loss_value

    caption = f"{name} Validation Scores"

    display(df_cv.style.set_caption(caption))

    return [accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value]

In [15]:
# XGB hyperparameter that deals with unbalanced
scale_pos_weight = Y.mean()**-1

# Creating the model objects
cls_lr = LogisticRegression(
            class_weight="balanced",  # Hyperparameter to deal with unbalanced output
            random_state=lucky_num)
# cls_svm = SVC(random_state=lucky_num)  # Remove due its resource consumption and worst results
cls_nb = GaussianNB()
cls_knn = KNeighborsClassifier()
cls_rf = RandomForestClassifier(
            random_state=lucky_num,
            class_weight="balanced_subsample")  # Hyperparameter to deal with unbalanced output
cls_gbc = GradientBoostingClassifier(random_state=lucky_num)
cls_xgb = xgb.XGBClassifier(
            objective="binary:logistic",
            verbose=None,
            random_state=lucky_num,
            scale_pos_weight = scale_pos_weight)

# Lists to iterate on our modeling function
cls_name = ["LR", "NB", "KNN", "RF", "GBC", "XGB"]
cls_list = [cls_lr, cls_NB, cls_knn, cls_rf, cls_gbc, cls_xgb]

mdl_summaries = []
for name, inst in zip(cls_name, cls_list):
    mdl_list = create_model(name, inst)
    mdl_list = [name] + mdl_list
    mdl_summaries.append(mdl_list)

df_mdl = pd.DataFrame(
            mdl_summaries,
            columns=[
                "model",
                "test_accuracy",
                "test_f1",
                "test_precision",
                "test_recall",
                "test_roc_auc",
                "test_brier",
                "test_log_loss"])

Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.03721,0.031787,0.031054,0.030418,0.031932,
score_time,0.0,0.008581,0.008867,0.009067,0.008065,
test_accuracy,0.676751,0.65042,0.651723,0.66237,0.641636,0.660523
test_f1,0.437622,0.422222,0.413403,0.424271,0.401497,0.414431
test_precision,0.352433,0.329957,0.326379,0.337386,0.315673,0.331889
test_recall,0.577121,0.586118,0.563707,0.571429,0.551414,0.551621
test_roc_auc,0.688461,0.667854,0.663118,0.671007,0.64401,0.66754
test_neg_brier_score,-0.22172,-0.228903,-0.228032,-0.226401,-0.231747,-0.225709
test_neg_log_loss,-0.635799,-0.651183,-0.649342,-0.646086,-0.657217,-0.644816


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.014398,0.009557,0.006821,0.007424,0.01228,
score_time,0.011809,0.011134,0.017092,0.011901,0.011843,
test_accuracy,0.713725,0.696639,0.695433,0.695153,0.678902,0.699608
test_f1,0.398115,0.385706,0.386222,0.373993,0.358343,0.38126
test_precision,0.367391,0.345178,0.344064,0.338189,0.31746,0.345703
test_recall,0.434447,0.437018,0.440154,0.418275,0.411311,0.42497
test_roc_auc,0.658495,0.637091,0.633805,0.633238,0.60999,0.634031
test_neg_brier_score,-0.251232,-0.26089,-0.271004,-0.273609,-0.285054,-0.268468
test_neg_log_loss,-1.412893,-1.627295,-1.745289,-1.752608,-1.950351,-1.659029


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.0,0.002192,0.004006,0.004514,0.006648,
score_time,0.241018,0.238426,0.234827,0.25073,0.503909,
test_accuracy,0.74986,0.757703,0.750911,0.753713,0.746708,0.754379
test_f1,0.20339,0.214351,0.201258,0.208821,0.208406,0.214136
test_precision,0.332362,0.365325,0.333333,0.347305,0.326923,0.353103
test_recall,0.14653,0.151671,0.144144,0.149292,0.152956,0.153661
test_roc_auc,0.583018,0.588834,0.575426,0.575821,0.573316,0.582555
test_neg_brier_score,-0.19098,-0.186599,-0.191325,-0.190608,-0.193769,-0.189077
test_neg_log_loss,-2.500209,-2.097705,-2.376728,-2.528142,-2.570247,-2.42806


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,1.522204,1.489951,1.485231,1.517533,1.491071,
score_time,0.152138,0.145954,0.156016,0.150085,0.14198,
test_accuracy,0.738655,0.746779,0.740544,0.734099,0.737461,0.733595
test_f1,0.240846,0.262643,0.227045,0.203191,0.244964,0.242379
test_precision,0.32816,0.359375,0.32304,0.292271,0.328294,0.318359
test_recall,0.190231,0.206941,0.175032,0.155727,0.195373,0.195678
test_roc_auc,0.616397,0.614756,0.597688,0.60229,0.596575,0.601436
test_neg_brier_score,-0.184849,-0.182776,-0.186704,-0.187369,-0.189693,-0.188506
test_neg_log_loss,-0.711816,-0.673727,-0.766073,-0.719537,-0.775028,-0.743245


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,1.292703,1.329037,1.315473,1.299452,1.31395,
score_time,0.01456,0.026127,0.018812,0.015999,0.020753,
test_accuracy,0.783193,0.777871,0.782292,0.77977,0.78005,0.78183
test_f1,0.112385,0.072515,0.091228,0.075294,0.081871,0.085479
test_precision,0.521277,0.402597,0.5,0.438356,0.454545,0.490566
test_recall,0.062982,0.039846,0.050193,0.041184,0.044987,0.046819
test_roc_auc,0.684278,0.665449,0.670804,0.671522,0.654303,0.66966
test_neg_brier_score,-0.157619,-0.160471,-0.159413,-0.159018,-0.161594,-0.159691
test_neg_log_loss,-0.48869,-0.495751,-0.493124,-0.492051,-0.498777,-0.493648


Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.643858,0.640153,0.677121,0.634137,0.669338,
score_time,0.023605,0.016412,0.020805,0.01504,0.028892,
test_accuracy,0.620168,0.612605,0.604371,0.628748,0.620062,0.619216
test_f1,0.397869,0.386696,0.377974,0.38629,0.383076,0.387639
test_precision,0.303935,0.295193,0.287341,0.301737,0.296479,0.298285
test_recall,0.575835,0.560411,0.552124,0.53668,0.541131,0.553421
test_roc_auc,0.640317,0.62621,0.619924,0.627078,0.620747,0.630028
test_neg_brier_score,-0.231951,-0.23856,-0.23795,-0.233297,-0.239657,-0.236789
test_neg_log_loss,-0.660186,-0.67845,-0.673157,-0.666774,-0.681109,-0.671979


In [16]:
df_mdl.sort_values(
        "test_f1",
        ascending=False,
        inplace=True,
        ignore_index=True)

display(df_mdl.style.set_caption("Test set validation scores"))

Unnamed: 0,model,test_accuracy,test_f1,test_precision,test_recall,test_roc_auc,test_brier,test_log_loss
0,LR,0.660523,0.414431,0.331889,0.551621,0.66754,0.225709,0.644816
1,XGB,0.619216,0.387639,0.298285,0.553421,0.630028,0.236789,0.671979
2,NB,0.699608,0.38126,0.345703,0.42497,0.634031,0.268468,1.659029
3,RF,0.733595,0.242379,0.318359,0.195678,0.601436,0.188506,0.743245
4,KNN,0.754379,0.214136,0.353103,0.153661,0.582555,0.189077,2.42806
5,GBC,0.78183,0.085479,0.490566,0.046819,0.66966,0.159691,0.493648


Any of models present good results! We will try to fit a composite model with the 3 better.

In [21]:
# Selecting the models
cls_name = ["LR", "NB", "XGB"]
cls_list = [cls_lr, cls_nb, cls_xgb]

# Training the voting classifier
cls_vot = VotingClassifier([*zip(cls_name, cls_list)], voting="soft")
cls_vot.fit(X_train, y_train)

# Using cross-validation to evaluate the model fitted
cls_cross = cross_validate(
    estimator=cls_vot,
    X=X_train,
    y=y_train,
    cv=5,
    scoring=scores)

df_vot = pd.DataFrame.from_dict(cls_cross, orient='index', columns=["CV"+str(i) for i in range(1,6)])

# Calculating score to test set
accurancy, f1, precision, recall, roc_auc, brier_score, log_loss_value = eval_model(cls_vot)

# Filling a dataframe to better presentation
df_vot.at["test_accuracy", "TestSet"] = accurancy
df_vot.at["test_f1", "TestSet"] = f1
df_vot.at["test_recall", "TestSet"] = recall
df_vot.at["test_precision", "TestSet"] = precision
df_vot.at["test_roc_auc", "TestSet"] = roc_auc
df_vot.at["test_neg_brier_score", "TestSet"] = -brier_score
df_vot.at["test_neg_log_loss", "TestSet"] = -log_loss_value

display(df_vot.style.set_caption("Test set validation scores for Composite Model"))

Unnamed: 0,CV1,CV2,CV3,CV4,CV5,TestSet
fit_time,0.622165,0.732543,0.591849,0.699149,0.617794,
score_time,0.023785,0.027777,0.030991,0.027807,0.02393,
test_accuracy,0.714846,0.70112,0.695713,0.693752,0.677221,0.699346
test_f1,0.41224,0.404243,0.389201,0.38561,0.369803,0.389597
test_precision,0.374214,0.357354,0.345654,0.342315,0.321905,0.349191
test_recall,0.458869,0.465296,0.445302,0.441441,0.434447,0.440576
test_roc_auc,0.679117,0.662225,0.651689,0.658722,0.640881,0.658847
test_neg_brier_score,-0.199904,-0.208466,-0.211428,-0.209929,-0.218236,-0.208991
test_neg_log_loss,-0.5907,-0.611156,-0.616876,-0.613285,-0.633314,-0.611563


The composite model is not better than neat models. Well, maybe some tuning could handle this. But this will be done in future work.

In [11]:
# Saving
# file_name = "model_" + output + '.pkl'
# jb.dump(cls_vot, path.join(path.abspath("./"), file_name))

['c:\\Users\\grego\\OneDrive\\Documentos\\Documentos Pessoais\\00_DataCamp\\09_VSC\\poa_car_accidents\\poa_car_accidents\\model\\model_feridos.pkl']