{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"El dataset contiene registros de sensores de smartphones de 4 actividades relacionadas con caídas y 9 actividades normales.\n",
"\n",
"Las que se corresponden con caídas son: \n",
"* FOL: Caerse hacia adelante \n",
"* FKL: Caerse de rodillas \n",
"* SDL: Caerse de costado \n",
"* BSC: Caerse de una silla \n",
"\n",
"Las actividades normales son:\n",
"* STD: Estar parado \n",
"* WAL: Caminar \n",
"* JOG: Trotar \n",
"* JUM: Saltar \n",
"* STU: Subir escaleras \n",
"* STN: Bajar escaleras \n",
"* SCH: Sentarse \n",
"* CSI: Entrar a un automovil \n",
"* CSO: Salir de un automovil \n",
"\n",
"Los registro del dataset fueron registrados por 11 individuos.\n",
"\n",
"Cada registro pertenece a una ventana temporal de 6 segundos, conteniendo \n",
"datos del acelerómetro y del giroscopio, dando lugar a las siguientes features:\n",
"\n",
"* acc_max: dato de aceleración máxima del 4to segundo. \n",
"* acc_kurtosis: kurtosis de la aceleración durante los 6 segundos. \n",
"* acc_skewness: simetría de la aceleración durante los 6 segundos. \n",
"* gyro_max: dato máximo del giroscopio en el 4to segundo. \n",
"* gyro_kurtosis: kurtosis del giroscopio durante los 6 segundos. \n",
"* gyro_skewness: simetría del giroscopio durante los 6 segundos. \n",
"* lin_max: aceleración lineal máxima (excluyendo la gravedad) del 4to segundo. \n",
"* post_lin_max: aceleración lineal máxima en el 6to segundo. \n",
"* post_gyro_max: dato máximo del giroscopio en el 6to segundo. \n",
"* fall: 1 si se corresponde con una caída, 0 si no. \n",
"* label: código de la actividad. \n",
"\n",
"El dataset contiene 1784 registros, habiendo 1017 que se corresponden con actividades normales y 767 que se corresponden con caídas."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.pipeline import make_pipeline\n",
"\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.ensemble import IsolationForest\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"from imblearn import FunctionSampler\n",
"\n",
"from xgboost import XGBClassifier\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"from sklearn import set_config\n",
"set_config(display=\"diagram\")\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.read_csv('../tp_final_no_anda_la_clase_outlier/Train.csv')\n",
"df2 = pd.read_csv('../tp_final_no_anda_la_clase_outlier/Test.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1428, 12)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(356, 12)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = pd.concat([df1, df2])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1784, 12)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Unnamed: 0 0\n",
"acc_max 0\n",
"gyro_max 0\n",
"acc_kurtosis 0\n",
"gyro_kurtosis 0\n",
"label 0\n",
"lin_max 0\n",
"acc_skewness 0\n",
"gyro_skewness 0\n",
"post_gyro_max 0\n",
"post_lin_max 0\n",
"fall 0\n",
"dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" acc_max | \n",
" gyro_max | \n",
" acc_kurtosis | \n",
" gyro_kurtosis | \n",
" lin_max | \n",
" acc_skewness | \n",
" gyro_skewness | \n",
" post_gyro_max | \n",
" post_lin_max | \n",
" fall | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
" 1784.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 891.500000 | \n",
" 21.768998 | \n",
" 5.028728 | \n",
" 10.031186 | \n",
" 3.916387 | \n",
" 7.976308 | \n",
" 1.732918 | \n",
" 1.629258 | \n",
" 3.191397 | \n",
" 5.228546 | \n",
" 0.429933 | \n",
"
\n",
" \n",
" std | \n",
" 515.140757 | \n",
" 5.479980 | \n",
" 2.943876 | \n",
" 11.836305 | \n",
" 5.489329 | \n",
" 4.258842 | \n",
" 1.529711 | \n",
" 0.999016 | \n",
" 3.429678 | \n",
" 5.004165 | \n",
" 0.495205 | \n",
"
\n",
" \n",
" min | \n",
" 0.000000 | \n",
" 9.787964 | \n",
" 0.026257 | \n",
" -1.743347 | \n",
" -1.532044 | \n",
" 0.043625 | \n",
" -14.066208 | \n",
" -0.460160 | \n",
" -4.984168 | \n",
" -5.382828 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 445.750000 | \n",
" 18.751488 | \n",
" 3.104216 | \n",
" 0.469997 | \n",
" 0.186524 | \n",
" 4.832765 | \n",
" 0.458187 | \n",
" 0.811557 | \n",
" 0.286294 | \n",
" 0.907965 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 891.500000 | \n",
" 22.924268 | \n",
" 4.568088 | \n",
" 8.423476 | \n",
" 2.028413 | \n",
" 8.282902 | \n",
" 1.520431 | \n",
" 1.542694 | \n",
" 2.452813 | \n",
" 3.727967 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 1337.250000 | \n",
" 25.865634 | \n",
" 6.428771 | \n",
" 15.717815 | \n",
" 5.582912 | \n",
" 11.100896 | \n",
" 2.912764 | \n",
" 2.291739 | \n",
" 5.226240 | \n",
" 9.629489 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" max | \n",
" 1783.000000 | \n",
" 32.885551 | \n",
" 17.288546 | \n",
" 231.134385 | \n",
" 34.163811 | \n",
" 25.382307 | \n",
" 6.782592 | \n",
" 5.174101 | \n",
" 16.204944 | \n",
" 23.972115 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 acc_max gyro_max acc_kurtosis gyro_kurtosis \\\n",
"count 1784.000000 1784.000000 1784.000000 1784.000000 1784.000000 \n",
"mean 891.500000 21.768998 5.028728 10.031186 3.916387 \n",
"std 515.140757 5.479980 2.943876 11.836305 5.489329 \n",
"min 0.000000 9.787964 0.026257 -1.743347 -1.532044 \n",
"25% 445.750000 18.751488 3.104216 0.469997 0.186524 \n",
"50% 891.500000 22.924268 4.568088 8.423476 2.028413 \n",
"75% 1337.250000 25.865634 6.428771 15.717815 5.582912 \n",
"max 1783.000000 32.885551 17.288546 231.134385 34.163811 \n",
"\n",
" lin_max acc_skewness gyro_skewness post_gyro_max post_lin_max \\\n",
"count 1784.000000 1784.000000 1784.000000 1784.000000 1784.000000 \n",
"mean 7.976308 1.732918 1.629258 3.191397 5.228546 \n",
"std 4.258842 1.529711 0.999016 3.429678 5.004165 \n",
"min 0.043625 -14.066208 -0.460160 -4.984168 -5.382828 \n",
"25% 4.832765 0.458187 0.811557 0.286294 0.907965 \n",
"50% 8.282902 1.520431 1.542694 2.452813 3.727967 \n",
"75% 11.100896 2.912764 2.291739 5.226240 9.629489 \n",
"max 25.382307 6.782592 5.174101 16.204944 23.972115 \n",
"\n",
" fall \n",
"count 1784.000000 \n",
"mean 0.429933 \n",
"std 0.495205 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 1.000000 \n",
"max 1.000000 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 1784 entries, 0 to 355\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Unnamed: 0 1784 non-null int64 \n",
" 1 acc_max 1784 non-null float64\n",
" 2 gyro_max 1784 non-null float64\n",
" 3 acc_kurtosis 1784 non-null float64\n",
" 4 gyro_kurtosis 1784 non-null float64\n",
" 5 label 1784 non-null object \n",
" 6 lin_max 1784 non-null float64\n",
" 7 acc_skewness 1784 non-null float64\n",
" 8 gyro_skewness 1784 non-null float64\n",
" 9 post_gyro_max 1784 non-null float64\n",
" 10 post_lin_max 1784 non-null float64\n",
" 11 fall 1784 non-null int64 \n",
"dtypes: float64(9), int64(2), object(1)\n",
"memory usage: 181.2+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" acc_max | \n",
" gyro_max | \n",
" acc_kurtosis | \n",
" gyro_kurtosis | \n",
" label | \n",
" lin_max | \n",
" acc_skewness | \n",
" gyro_skewness | \n",
" post_gyro_max | \n",
" post_lin_max | \n",
" fall | \n",
"
\n",
" \n",
" \n",
" \n",
" 1044 | \n",
" 879 | \n",
" 22.960623 | \n",
" 6.481883 | \n",
" 4.701671 | \n",
" 2.504065 | \n",
" CSI | \n",
" 12.424865 | \n",
" 1.209656 | \n",
" 1.738483 | \n",
" 4.721564 | \n",
" 10.974288 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 acc_max gyro_max acc_kurtosis gyro_kurtosis label \\\n",
"1044 879 22.960623 6.481883 4.701671 2.504065 CSI \n",
"\n",
" lin_max acc_skewness gyro_skewness post_gyro_max post_lin_max \\\n",
"1044 12.424865 1.209656 1.738483 4.721564 10.974288 \n",
"\n",
" fall \n",
"1044 0 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1784, 12)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Unnamed: 0'].value_counts().mean() #acá vemos que cada valor de esta columna aparece una sola vez, por lo que es un índice. \n",
"#será dropeada"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"FOL 192\n",
"SDL 192\n",
"FKL 192\n",
"BSC 191\n",
"CSO 113\n",
"STD 113\n",
"SCH 113\n",
"STU 113\n",
"CSI 113\n",
"STN 113\n",
"JUM 113\n",
"WAL 113\n",
"JOG 113\n",
"Name: label, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['label'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1017\n",
"1 767\n",
"Name: fall, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['fall'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fall | \n",
"
\n",
" \n",
" label | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" BSC | \n",
" 1 | \n",
"
\n",
" \n",
" CSI | \n",
" 0 | \n",
"
\n",
" \n",
" CSO | \n",
" 0 | \n",
"
\n",
" \n",
" FKL | \n",
" 1 | \n",
"
\n",
" \n",
" FOL | \n",
" 1 | \n",
"
\n",
" \n",
" JOG | \n",
" 0 | \n",
"
\n",
" \n",
" JUM | \n",
" 0 | \n",
"
\n",
" \n",
" SCH | \n",
" 0 | \n",
"
\n",
" \n",
" SDL | \n",
" 1 | \n",
"
\n",
" \n",
" STD | \n",
" 0 | \n",
"
\n",
" \n",
" STN | \n",
" 0 | \n",
"
\n",
" \n",
" STU | \n",
" 0 | \n",
"
\n",
" \n",
" WAL | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fall\n",
"label \n",
"BSC 1\n",
"CSI 0\n",
"CSO 0\n",
"FKL 1\n",
"FOL 1\n",
"JOG 0\n",
"JUM 0\n",
"SCH 0\n",
"SDL 1\n",
"STD 0\n",
"STN 0\n",
"STU 0\n",
"WAL 0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# acá vemos que las categorías 'BSC', 'FKL', 'FOL', y 'STD' se corresponden al valor '1' de la columna 'fall' por lo que representan caídas, \n",
"# mientras que el resto de las categorias se corresponden con el valor '0' por lo que representan movimientos que no son caídas\n",
"grouped = df.groupby('label').agg({'fall': 'mean'}) \n",
"grouped"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"#acá confirmamos que los que no corresponden a caídas coinciden en cantidad con los \"0\" de la categoría a predecir\n",
"#df.loc[df['label'].isin(grupo[grupo < 120].index.tolist())]['label'].value_counts().sum() == df['fall'].value_counts()[0]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"corr_matrix = df.corr() #vamos a ver como correlacionan entre si las features"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"#fig, ax = plt.subplots(figsize=(10, 6))\n",
"#sns.heatmap(corr_matrix, cmap=\"Blues\", annot=True)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fall 1.000000\n",
"post_lin_max 0.864964\n",
"post_gyro_max 0.765410\n",
"acc_skewness 0.713811\n",
"gyro_skewness 0.685179\n",
"acc_max 0.609653\n",
"lin_max 0.581044\n",
"gyro_kurtosis 0.550182\n",
"acc_kurtosis 0.547179\n",
"gyro_max 0.468947\n",
"Unnamed: 0 -0.857480\n",
"Name: fall, dtype: float64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corr_matrix['fall'].sort_values(ascending = False) #ordenamos de mayor a menor las correlaciones con 'fall'"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Unnamed: 0 -0.857480\n",
"gyro_max 0.468947\n",
"Name: fall, dtype: float64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corr_matrix['fall'][corr_matrix['fall'] < 0.5] # aca vemos que 'gyro_max' correlaciona poco con 'fall'"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Unnamed: 0', 'acc_max', 'gyro_max', 'acc_kurtosis', 'gyro_kurtosis',\n",
" 'label', 'lin_max', 'acc_skewness', 'gyro_skewness', 'post_gyro_max',\n",
" 'post_lin_max', 'fall'],\n",
" dtype='object')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Entocnes como las columnas \"Unnamed: 0', 'gyro_max', y 'label' son innecesarioas, usamos una clase para preprocesar los datos que elimine estas columnas del dataframe"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"class FeatureSelection(BaseEstimator, TransformerMixin):\n",
"\n",
" def __init__(self,selected_features):\n",
" self.selected_features=selected_features\n",
" \n",
" def fit(self,X,y=None):\n",
" return self\n",
"\n",
" def transform(self, X, y=None):\n",
" return X[self.selected_features]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"class OutlierRemover(BaseEstimator, TransformerMixin):\n",
" \n",
" def __init__(self, n_std=3):\n",
" self.n_std = n_std\n",
" \n",
" def fit(self, X, y = None):\n",
" self.mean_ = np.mean(X, axis=0)\n",
" self.std_ = np.std(X, axis=0)\n",
" return self\n",
" \n",
" def transform(self, X, y):\n",
" print(y)\n",
" \n",
" # Filtrar las filas que no contienen valores atípicos\n",
" limite_inferior = self.mean_ - self.n_std * self.std_\n",
" limite_superior = self.mean_ + self.n_std * self.std_\n",
" mask = np.all((X > limite_inferior) & (X < limite_superior), axis=1)\n",
" \n",
" X_filtrado = X[mask]\n",
" y = y[mask]\n",
" return X_filtrado, y\n",
" \n",
" def fit_transform(self, X, y=None, **fit_params):\n",
" return self.fit(X, y).transform(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1427, 11)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(357, 11)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(1427,)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(357,)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Separamos las variables independientes de la target\n",
"X=df.drop(columns=['fall'])\n",
"y=df['fall']\n",
"\n",
"# Dividimos los datos en el set de train y el de test: \n",
"X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=100, stratify=y)\n",
"display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)\n",
"display(type(X_train), type(X_test), type(y_train), type(y_test))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" acc_max | \n",
" gyro_max | \n",
" acc_kurtosis | \n",
" gyro_kurtosis | \n",
" label | \n",
" lin_max | \n",
" acc_skewness | \n",
" gyro_skewness | \n",
" post_gyro_max | \n",
" post_lin_max | \n",
"
\n",
" \n",
" \n",
" \n",
" 765 | \n",
" 931 | \n",
" 17.310921 | \n",
" 5.78264 | \n",
" 5.979438 | \n",
" -0.16566 | \n",
" CSO | \n",
" 4.717529 | \n",
" 1.367272 | \n",
" 0.811601 | \n",
" 5.699724 | \n",
" 4.569499 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 acc_max gyro_max acc_kurtosis gyro_kurtosis label \\\n",
"765 931 17.310921 5.78264 5.979438 -0.16566 CSO \n",
"\n",
" lin_max acc_skewness gyro_skewness post_gyro_max post_lin_max \n",
"765 4.717529 1.367272 0.811601 5.699724 4.569499 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.sample()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) #preparo el cross validation"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"#le pongo estos pasos x defecto al pipeline\n",
"pipeline = Pipeline([('FeatureSelection', FeatureSelection(['acc_max', 'acc_kurtosis', 'gyro_kurtosis',\n",
" 'lin_max', 'acc_skewness', 'gyro_skewness', 'post_gyro_max', 'post_lin_max'])), \n",
"# ('OutlierRemover', OutlierRemover()),\n",
" ('scaler', StandardScaler()), \n",
" ('model', LogisticRegression())], verbose = False) \n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# pipeline.steps[0][1].fit_transform(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# en esta lista de diccionarios pongo las cosas que quiero que pruebe el CV\n",
"# en el pipe vamos a probar 4 modelos con varios hiperparámetros\n",
"param_grid = [ {'model': [KNeighborsClassifier()], \"model__n_neighbors\": [2, 3, 4, 5, 6, 7, 8], 'model__weights' : ['uniform', 'distance'], 'scaler' : [StandardScaler(), MinMaxScaler(), None]}, \n",
" {'model': [LogisticRegression()], 'model__C': [0.01, 0.1, 1, 10, 100, 1000], 'model__penalty': ['l2', None], 'scaler' : [StandardScaler(), MinMaxScaler(), None]} ,\n",
" {'model': [RandomForestClassifier()], 'model__criterion': ['gini', 'entropy'], 'scaler' : [StandardScaler(), MinMaxScaler(), None]},\n",
" {'model': [XGBClassifier(objective='binary:logistic', eval_metric='logloss')], 'model__learning_rate': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2], 'scaler' : [StandardScaler(), MinMaxScaler(), None] }\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"grid = GridSearchCV(pipeline, param_grid, cv=folds)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),\n",
" estimator=Pipeline(steps=[('FeatureSelection',\n",
" FeatureSelection(selected_features=['acc_max',\n",
" 'acc_kurtosis',\n",
" 'gyro_kurtosis',\n",
" 'lin_max',\n",
" 'acc_skewness',\n",
" 'gyro_skewness',\n",
" 'post_gyro_max',\n",
" 'post_lin_max'])),\n",
" ('scaler', StandardScaler()),\n",
" ('model', LogisticRegression())]),\n",
" param_grid=[{'model...\n",
" missing=nan,\n",
" monotone_constraints=None,\n",
" n_estimators=100, n_jobs=None,\n",
" num_parallel_tree=None,\n",
" random_state=None,\n",
" reg_alpha=None,\n",
" reg_lambda=None,\n",
" scale_pos_weight=None,\n",
" subsample=None,\n",
" tree_method=None,\n",
" validate_parameters=None,\n",
" verbosity=None)],\n",
" 'model__learning_rate': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7,\n",
" 0.8, 0.9, 1, 1.1, 1.2],\n",
" 'scaler': [StandardScaler(), MinMaxScaler(), None]}])
"
],
"text/plain": [
"GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True),\n",
" estimator=Pipeline(steps=[('FeatureSelection',\n",
" FeatureSelection(selected_features=['acc_max',\n",
" 'acc_kurtosis',\n",
" 'gyro_kurtosis',\n",
" 'lin_max',\n",
" 'acc_skewness',\n",
" 'gyro_skewness',\n",
" 'post_gyro_max',\n",
" 'post_lin_max'])),\n",
" ('scaler', StandardScaler()),\n",
" ('model', LogisticRegression())]),\n",
" param_grid=[{'model...\n",
" missing=nan,\n",
" monotone_constraints=None,\n",
" n_estimators=100, n_jobs=None,\n",
" num_parallel_tree=None,\n",
" random_state=None,\n",
" reg_alpha=None,\n",
" reg_lambda=None,\n",
" scale_pos_weight=None,\n",
" subsample=None,\n",
" tree_method=None,\n",
" validate_parameters=None,\n",
" verbosity=None)],\n",
" 'model__learning_rate': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7,\n",
" 0.8, 0.9, 1, 1.1, 1.2],\n",
" 'scaler': [StandardScaler(), MinMaxScaler(), None]}])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid.fit(X_train, y_train) #muestra los pasos x defecto"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(steps=[('FeatureSelection',\n",
" FeatureSelection(selected_features=['acc_max', 'acc_kurtosis',\n",
" 'gyro_kurtosis', 'lin_max',\n",
" 'acc_skewness',\n",
" 'gyro_skewness',\n",
" 'post_gyro_max',\n",
" 'post_lin_max'])),\n",
" ('scaler', StandardScaler()),\n",
" ('model',\n",
" XGBClassifier(base_score=0.5, booster='gbtree',\n",
" colsample_bylevel=1, colsample_bynode=1,\n",
" colsample_bytree=1, eval_metric='logloss',\n",
" gamma=0, gpu_id=-1, importance_type='gain',\n",
" interaction_constraints='', learning_rate=0.5,\n",
" max_delta_step=0, max_depth=6,\n",
" min_child_weight=1, missing=nan,\n",
" monotone_constraints='()', n_estimators=100,\n",
" n_jobs=12, num_parallel_tree=1, random_state=0,\n",
" reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n",
" subsample=1, tree_method='exact',\n",
" validate_parameters=1, verbosity=None))])
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
" colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',\n",
" gamma=0, gpu_id=-1, importance_type='gain',\n",
" interaction_constraints='', learning_rate=0.5, max_delta_step=0,\n",
" max_depth=6, min_child_weight=1, missing=nan,\n",
" monotone_constraints='()', n_estimators=100, n_jobs=12,\n",
" num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,\n",
" scale_pos_weight=1, subsample=1, tree_method='exact',\n",
" validate_parameters=1, verbosity=None)
"
],
"text/plain": [
"Pipeline(steps=[('FeatureSelection',\n",
" FeatureSelection(selected_features=['acc_max', 'acc_kurtosis',\n",
" 'gyro_kurtosis', 'lin_max',\n",
" 'acc_skewness',\n",
" 'gyro_skewness',\n",
" 'post_gyro_max',\n",
" 'post_lin_max'])),\n",
" ('scaler', StandardScaler()),\n",
" ('model',\n",
" XGBClassifier(base_score=0.5, booster='gbtree',\n",
" colsample_bylevel=1, colsample_bynode=1,\n",
" colsample_bytree=1, eval_metric='logloss',\n",
" gamma=0, gpu_id=-1, importance_type='gain',\n",
" interaction_constraints='', learning_rate=0.5,\n",
" max_delta_step=0, max_depth=6,\n",
" min_child_weight=1, missing=nan,\n",
" monotone_constraints='()', n_estimators=100,\n",
" n_jobs=12, num_parallel_tree=1, random_state=0,\n",
" reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n",
" subsample=1, tree_method='exact',\n",
" validate_parameters=1, verbosity=None))])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid.best_estimator_ #el mejor modelo "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"El modelo arrojó un accuracy score en el conjunto de entrenamiento de: 0.9824880382775121\n"
]
}
],
"source": [
"print(\"El modelo arrojó un accuracy score en el conjunto de entrenamiento de: \", grid.best_score_) #vemos el accuracy del mejor modelo"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'model': XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,\n",
" colsample_bynode=None, colsample_bytree=None,\n",
" eval_metric='logloss', gamma=None, gpu_id=None,\n",
" importance_type='gain', interaction_constraints=None,\n",
" learning_rate=0.5, max_delta_step=None, max_depth=None,\n",
" min_child_weight=None, missing=nan, monotone_constraints=None,\n",
" n_estimators=100, n_jobs=None, num_parallel_tree=None,\n",
" random_state=None, reg_alpha=None, reg_lambda=None,\n",
" scale_pos_weight=None, subsample=None, tree_method=None,\n",
" validate_parameters=None, verbosity=None),\n",
" 'model__learning_rate': 0.5,\n",
" 'scaler': StandardScaler()}"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid.best_params_ #vemos los mejores hiperparámetros del mejor modelo"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"El modelo tiene un accuracy score de: 0.9803921568627451\n"
]
}
],
"source": [
"print(\"El modelo tiene un accuracy score de: \", accuracy_score(grid.best_estimator_.predict(X_test),y_test))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"y_pred = grid.best_estimator_.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.98 0.99 0.98 204\n",
" 1 0.99 0.97 0.98 153\n",
"\n",
" accuracy 0.98 357\n",
" macro avg 0.98 0.98 0.98 357\n",
"weighted avg 0.98 0.98 0.98 357\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import confusion_matrix\n",
"import seaborn as sns\n",
"import itertools\n",
"y_pred_ = list(itertools.chain(y_pred))\n",
"y_test_ = list(itertools.chain(y_test))\n",
"\n",
"print(classification_report(y_test_, y_pred_))\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"import pickle"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"best_model = grid.best_estimator_\n",
"\n",
"\n",
"with open('mejor_modelo_tp4.pkl', 'wb') as f:\n",
" pickle.dump(best_model, f)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dhdsblend2021",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "052e7fd3051fb62256c874c1940dfbcd26c7f9302251177c1c2130ce8acd18fb"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}