{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Perkenalan" ] }, { "cell_type": "code", "execution_count": 334, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n=================================================\\nGC 5\\n\\nNama : Ryan Trisnadi\\nBatch : HCK-17\\n\\nProyek ini dibuat untuk menguji dan melihat variable dari dataset yang mempunyai efek terbesar terhadap \"default_payment_next_month\" dari dataset \\ncredit_card_default. CSV berisi puluhan kategorikal dan numerikal data yang masing-masing mempunyai faktor terhadap tahap pembayaran utang dari beberapa\\nclient bank/asuransi. \\n\\n\\nPertanyaan:\\n-Resiko apa saja client bank dengan limit balance yang sangat tinggi terhadap utang mereka?\\n-Apakah ada dampak \"education\" terhadap default_payment_next_month?\\n-Adakah perbedaan \"pay\" dan \"bill_amt\" yang bisa disimpulkan terhadap kemampuan client untuk membayar utang balik? \\n-Resiko yang paling tinggi di kolom-kolom apa saja?\\n\\nKita akan menggunakan test-test seperti Logistic Regression, KNN, dan SVM untuk menguji profile client terhadap kemampuan mereka untuk membayar utang \\ndari institusi/bank. \\n\\n=================================================\\n'" ] }, "execution_count": 334, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''\n", "=================================================\n", "GC 5\n", "\n", "Nama : Ryan Trisnadi\n", "Batch : HCK-17\n", "\n", "Proyek ini dibuat untuk menguji dan melihat variable dari dataset yang mempunyai efek terbesar terhadap \"default_payment_next_month\" dari dataset \n", "credit_card_default. CSV berisi puluhan kategorikal dan numerikal data yang masing-masing mempunyai faktor terhadap tahap pembayaran utang dari beberapa\n", "client bank/asuransi. \n", "\n", "\n", "Pertanyaan:\n", "-Resiko apa saja client bank dengan limit balance yang sangat tinggi terhadap utang mereka?\n", "-Apakah ada dampak \"education\" terhadap default_payment_next_month?\n", "-Adakah perbedaan \"pay\" dan \"bill_amt\" yang bisa disimpulkan terhadap kemampuan client untuk membayar utang balik? \n", "-Resiko yang paling tinggi di kolom-kolom apa saja?\n", "\n", "Kita akan menggunakan test-test seperti Logistic Regression, KNN, dan SVM untuk menguji profile client terhadap kemampuan mereka untuk membayar utang \n", "dari institusi/bank. \n", "\n", "=================================================\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Latar Belakang" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita adalah sebuah analyst yang bekerja di bank di departemen Akuntansi. Tujuan kita untuk menganalisa data-data client yang sudah dan belum bayar tagihan, dan melihat kemampuan mereka untuk membayar tagihan balik dengan tepat waktu. Dengan pola pembayaran mereka setiap interval pembayaran, kita bisa mencoba prediksi kemungkinan mereka akan memenuhi pembayaran utang awal (principal) dan bunga (interest). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conceptual Problems\n", "### Jawab pertanyaan berikut:\n", "\n", " Apakah yang dimaksud dengan coeficient pada logistic regression?\n", "\n", " Apakah fungsi parameter kernel pada SVM? Jelaskan salah satu kernel yang kalian pahami!\n", "\n", " Bagaimana cara memilih K yang optimal pada KNN?\n", "\n", " Apa yang dimaksud dengan metrics-metrics berikut : Accuracy, Precision, Recall, F1 Score, dan kapan waktu yang tepat untuk menggunakannya?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query SQL" ] }, { "cell_type": "code", "execution_count": 335, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\nfrom google.colab import auth\\nfrom google.cloud import bigquery\\nauth.authenticate_user()\\nprint(\\'Authenticated\\')\\n\\nproject_id = \"elite-outpost-424308-h1\" #GUNAKAN GCP PROJECT-ID KALIAN MASING-MASING\\nclient = bigquery.Client(project=project_id)\\n'" ] }, "execution_count": 335, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "from google.colab import auth\n", "from google.cloud import bigquery\n", "auth.authenticate_user()\n", "print('Authenticated')\n", "\n", "project_id = \"elite-outpost-424308-h1\" #GUNAKAN GCP PROJECT-ID KALIAN MASING-MASING\n", "client = bigquery.Client(project=project_id)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import dari BigQuery lewat Google Colab ke project_id masing-masing. " ] }, { "cell_type": "code", "execution_count": 336, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"\\ndf = client.query('''\\nSELECT limit_balance, \\n CAST(sex as INT64) AS sex, \\n CAST(education_level as INT64) AS education_level, \\n CAST(marital_status as INT64) AS marital_status, \\n age, \\n pay_0, \\n pay_2, \\n pay_3, \\n pay_4, \\n CAST(pay_5 AS FLOAT64) AS pay_5, \\n CAST(pay_6 AS FLOAT64) AS pay_6, \\n bill_amt_1, \\n bill_amt_2, \\n bill_amt_3, \\n bill_amt_4, \\n bill_amt_5, \\n bill_amt_6, \\n pay_amt_1, \\n pay_amt_2, \\n pay_amt_3, \\n pay_amt_4, \\n pay_amt_5, \\n pay_amt_6, \\n CAST(default_payment_next_month as INT64) AS default_payment_next_month\\nFROM `bigquery-public-data.ml_datasets.credit_card_default`\\nLIMIT 33966\\n''').to_dataframe()\\n\"" ] }, "execution_count": 336, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "df = client.query('''\n", "SELECT limit_balance, \n", " CAST(sex as INT64) AS sex, \n", " CAST(education_level as INT64) AS education_level, \n", " CAST(marital_status as INT64) AS marital_status, \n", " age, \n", " pay_0, \n", " pay_2, \n", " pay_3, \n", " pay_4, \n", " CAST(pay_5 AS FLOAT64) AS pay_5, \n", " CAST(pay_6 AS FLOAT64) AS pay_6, \n", " bill_amt_1, \n", " bill_amt_2, \n", " bill_amt_3, \n", " bill_amt_4, \n", " bill_amt_5, \n", " bill_amt_6, \n", " pay_amt_1, \n", " pay_amt_2, \n", " pay_amt_3, \n", " pay_amt_4, \n", " pay_amt_5, \n", " pay_amt_6, \n", " CAST(default_payment_next_month as INT64) AS default_payment_next_month\n", "FROM `bigquery-public-data.ml_datasets.credit_card_default`\n", "LIMIT 33966\n", "''').to_dataframe()\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Waktu import query, kita mau ganti beberapa SELECT function jadi tipe-nya sudah benar. Contohnya: Sex diganti jadi Integer, dan education_level jadi Integer. Ganti tipe untuk lebih mudah mengolah data sebagai numerikal atau kategorikal data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Huggingface Link" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://huggingface.co/spaces/ryantrisnadi/Deployment/tree/main" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ini link ke website Huggingface dengan data \"Deployment\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Libraries" ] }, { "cell_type": "code", "execution_count": 337, "metadata": {}, "outputs": [], "source": [ "# Import Library\n", "# Library Dataframe\n", "import pandas as pd\n", "# Library Numerical Data\n", "import numpy as np\n", "# Library Statistic\n", "from scipy import stats\n", "from scipy.stats import uniform\n", "\n", "# Library Data Visualization\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Library Preprocessing data\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "# Library Machine Learning Model\n", "from sklearn.metrics.pairwise import manhattan_distances\n", "from sklearn.linear_model import LinearRegression\n", "\n", "import time\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "# Library Model Evaluation\n", "from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score , classification_report , confusion_matrix, precision_score, recall_score, f1_score\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.model_selection import KFold, StratifiedKFold\n", "from scipy.stats import uniform, randint\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "# Library Outlier Handling\n", "from feature_engine.outliers import Winsorizer\n", "# Library Correlation\n", "from scipy.stats import kendalltau, pearsonr, spearmanr\n", "\n", "\n", "# Model Evaluation\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n", "\n", "# Save Model\n", "import pickle\n", "import joblib\n", "import json\n", "\n", "# To Ignore Warning\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita pertama mau impor modul yang akan digunakan untuk menganalisa data dari CSV tersedia. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Loading and Cleaning" ] }, { "cell_type": "code", "execution_count": 338, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6default_payment_next_month
080000.016154.00.00.00.00.00.0...29296.026210.017643.02545.02208.01336.02232.0542.0348.01
1200000.014149.00.00.00.00.00.0...50146.050235.048984.01689.02164.02500.03480.02500.03000.00
220000.026222.00.00.00.00.00.0...1434.0500.00.04641.01019.0900.00.01500.00.01
3260000.024233.00.00.00.00.00.0...27821.030767.029890.05000.05000.01137.05000.01085.05000.00
4150000.014232.00.00.00.0-1.00.0...150464.0143375.0146411.04019.0146896.0157436.04600.04709.05600.00
..................................................................
296080000.023228.0-1.0-1.0-1.0-2.0-2.0...0.00.00.02800.00.00.00.00.00.00
296150000.023151.0-1.0-1.0-1.0-1.0-2.0...0.00.00.0300.05880.00.00.00.00.01
2962450000.022138.0-2.0-2.0-2.0-2.0-2.0...390.0390.0390.0390.0780.0390.0390.0390.0390.01
296350000.022144.0-2.0-2.0-2.0-2.0-2.0...390.0390.00.0390.0390.0390.0390.00.0780.00
2964290000.022139.01.0-2.0-2.0-2.0-2.0...3184.0390.0390.010000.0800.03184.0390.0390.06617.00
\n", "

2965 rows × 24 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "0 80000.0 1 6 1 54.0 0.0 0.0 \n", "1 200000.0 1 4 1 49.0 0.0 0.0 \n", "2 20000.0 2 6 2 22.0 0.0 0.0 \n", "3 260000.0 2 4 2 33.0 0.0 0.0 \n", "4 150000.0 1 4 2 32.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... \n", "2960 80000.0 2 3 2 28.0 -1.0 -1.0 \n", "2961 50000.0 2 3 1 51.0 -1.0 -1.0 \n", "2962 450000.0 2 2 1 38.0 -2.0 -2.0 \n", "2963 50000.0 2 2 1 44.0 -2.0 -2.0 \n", "2964 290000.0 2 2 1 39.0 1.0 -2.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "0 0.0 0.0 0.0 ... 29296.0 26210.0 17643.0 2545.0 \n", "1 0.0 0.0 0.0 ... 50146.0 50235.0 48984.0 1689.0 \n", "2 0.0 0.0 0.0 ... 1434.0 500.0 0.0 4641.0 \n", "3 0.0 0.0 0.0 ... 27821.0 30767.0 29890.0 5000.0 \n", "4 0.0 -1.0 0.0 ... 150464.0 143375.0 146411.0 4019.0 \n", "... ... ... ... ... ... ... ... ... \n", "2960 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 2800.0 \n", "2961 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 300.0 \n", "2962 -2.0 -2.0 -2.0 ... 390.0 390.0 390.0 390.0 \n", "2963 -2.0 -2.0 -2.0 ... 390.0 390.0 0.0 390.0 \n", "2964 -2.0 -2.0 -2.0 ... 3184.0 390.0 390.0 10000.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "0 2208.0 1336.0 2232.0 542.0 348.0 \n", "1 2164.0 2500.0 3480.0 2500.0 3000.0 \n", "2 1019.0 900.0 0.0 1500.0 0.0 \n", "3 5000.0 1137.0 5000.0 1085.0 5000.0 \n", "4 146896.0 157436.0 4600.0 4709.0 5600.0 \n", "... ... ... ... ... ... \n", "2960 0.0 0.0 0.0 0.0 0.0 \n", "2961 5880.0 0.0 0.0 0.0 0.0 \n", "2962 780.0 390.0 390.0 390.0 390.0 \n", "2963 390.0 390.0 390.0 0.0 780.0 \n", "2964 800.0 3184.0 390.0 390.0 6617.0 \n", "\n", " default_payment_next_month \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 0 \n", "... ... \n", "2960 0 \n", "2961 1 \n", "2962 1 \n", "2963 0 \n", "2964 0 \n", "\n", "[2965 rows x 24 columns]" ] }, "execution_count": 338, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Replace 'path/to/your/csvfile.csv' with the actual path to your CSV file\n", "file_path = '/Users/ryantrisnadi/Desktop/first_project1/p1-ftds017-hck-g5-ryantrisnadi/_P1G5_Set_1_Ryan_Trisnadi.csv'\n", "\n", "# Read the CSV file into a DataFrame\n", "data = pd.read_csv(file_path)\n", "\n", "# Display the first few rows of the DataFrame\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ini kita extract data CSV tentang credit_card_default dari ml_datasets di BigQuery.\n", "\n", "Legend:\n", "\n", "id = Anonymized ID of each client\n", "\n", "limit_balance = Amount of given credit in NT dollars includes individual and family/supplementary credit\n", "\n", "sex = Gender (1=male, 2=female)\n", "\n", "education_level = Education Level (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)\n", "\n", "marital_status = Marital status (1=married, 2=single, 3=others)\n", "\n", "age = Age in years\n", "\n", "pay_0 = Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight \n", "months, 9=payment delay for nine months and above)\n", "\n", "bill_amt_1 = Amount of bill statement in September, 2005 (NT dollar)\n", "\n", "default_payment_next_month = Default payment (1=yes, 0=no)\n" ] }, { "cell_type": "code", "execution_count": 339, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6default_payment_next_month
080000.016154.00.00.00.00.00.0...29296.026210.017643.02545.02208.01336.02232.0542.0348.01
1200000.014149.00.00.00.00.00.0...50146.050235.048984.01689.02164.02500.03480.02500.03000.00
220000.026222.00.00.00.00.00.0...1434.0500.00.04641.01019.0900.00.01500.00.01
3260000.024233.00.00.00.00.00.0...27821.030767.029890.05000.05000.01137.05000.01085.05000.00
4150000.014232.00.00.00.0-1.00.0...150464.0143375.0146411.04019.0146896.0157436.04600.04709.05600.00
5300000.024232.00.00.00.00.00.0...65150.0-450.0700.015235.01491.01303.00.02000.01400.00
6130000.011145.00.00.00.00.00.0...62377.063832.065099.02886.02908.02129.02354.02366.02291.00
7200000.011158.00.00.00.00.00.0...124647.0126921.0129167.07822.04417.04446.04597.04677.04698.00
8500000.011139.00.00.00.00.00.0...174500.0137406.0204975.054209.04607.04603.05224.0207440.07509.00
9230000.011148.00.00.00.00.00.0...105508.0108101.0110094.07000.06607.03773.04290.04164.02000.00
\n", "

10 rows × 24 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "0 80000.0 1 6 1 54.0 0.0 0.0 \n", "1 200000.0 1 4 1 49.0 0.0 0.0 \n", "2 20000.0 2 6 2 22.0 0.0 0.0 \n", "3 260000.0 2 4 2 33.0 0.0 0.0 \n", "4 150000.0 1 4 2 32.0 0.0 0.0 \n", "5 300000.0 2 4 2 32.0 0.0 0.0 \n", "6 130000.0 1 1 1 45.0 0.0 0.0 \n", "7 200000.0 1 1 1 58.0 0.0 0.0 \n", "8 500000.0 1 1 1 39.0 0.0 0.0 \n", "9 230000.0 1 1 1 48.0 0.0 0.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "0 0.0 0.0 0.0 ... 29296.0 26210.0 17643.0 2545.0 \n", "1 0.0 0.0 0.0 ... 50146.0 50235.0 48984.0 1689.0 \n", "2 0.0 0.0 0.0 ... 1434.0 500.0 0.0 4641.0 \n", "3 0.0 0.0 0.0 ... 27821.0 30767.0 29890.0 5000.0 \n", "4 0.0 -1.0 0.0 ... 150464.0 143375.0 146411.0 4019.0 \n", "5 0.0 0.0 0.0 ... 65150.0 -450.0 700.0 15235.0 \n", "6 0.0 0.0 0.0 ... 62377.0 63832.0 65099.0 2886.0 \n", "7 0.0 0.0 0.0 ... 124647.0 126921.0 129167.0 7822.0 \n", "8 0.0 0.0 0.0 ... 174500.0 137406.0 204975.0 54209.0 \n", "9 0.0 0.0 0.0 ... 105508.0 108101.0 110094.0 7000.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "0 2208.0 1336.0 2232.0 542.0 348.0 \n", "1 2164.0 2500.0 3480.0 2500.0 3000.0 \n", "2 1019.0 900.0 0.0 1500.0 0.0 \n", "3 5000.0 1137.0 5000.0 1085.0 5000.0 \n", "4 146896.0 157436.0 4600.0 4709.0 5600.0 \n", "5 1491.0 1303.0 0.0 2000.0 1400.0 \n", "6 2908.0 2129.0 2354.0 2366.0 2291.0 \n", "7 4417.0 4446.0 4597.0 4677.0 4698.0 \n", "8 4607.0 4603.0 5224.0 207440.0 7509.0 \n", "9 6607.0 3773.0 4290.0 4164.0 2000.0 \n", "\n", " default_payment_next_month \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 0 \n", "5 0 \n", "6 0 \n", "7 0 \n", "8 0 \n", "9 0 \n", "\n", "[10 rows x 24 columns]" ] }, "execution_count": 339, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tampil 10 data yang diatas." ] }, { "cell_type": "code", "execution_count": 340, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6default_payment_next_month
2955360000.022226.0-1.0-1.0-1.0-1.0-2.0...0.00.00.0463.02500.00.00.00.00.00
2956100000.013140.00.00.0-1.0-1.0-2.0...0.00.00.02000.02377.040000.00.00.00.00
295730000.023148.01.0-1.0-1.0-2.0-2.0...0.00.00.0200.00.00.00.00.00.00
295880000.023139.0-1.0-1.0-1.0-1.0-2.0...0.00.05000.05000.05000.00.05000.05000.0470.00
295920000.013226.0-1.0-1.0-1.0-2.0-2.0...0.00.00.01560.00.00.00.00.00.00
296080000.023228.0-1.0-1.0-1.0-2.0-2.0...0.00.00.02800.00.00.00.00.00.00
296150000.023151.0-1.0-1.0-1.0-1.0-2.0...0.00.00.0300.05880.00.00.00.00.01
2962450000.022138.0-2.0-2.0-2.0-2.0-2.0...390.0390.0390.0390.0780.0390.0390.0390.0390.01
296350000.022144.0-2.0-2.0-2.0-2.0-2.0...390.0390.00.0390.0390.0390.0390.00.0780.00
2964290000.022139.01.0-2.0-2.0-2.0-2.0...3184.0390.0390.010000.0800.03184.0390.0390.06617.00
\n", "

10 rows × 24 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "2955 360000.0 2 2 2 26.0 -1.0 -1.0 \n", "2956 100000.0 1 3 1 40.0 0.0 0.0 \n", "2957 30000.0 2 3 1 48.0 1.0 -1.0 \n", "2958 80000.0 2 3 1 39.0 -1.0 -1.0 \n", "2959 20000.0 1 3 2 26.0 -1.0 -1.0 \n", "2960 80000.0 2 3 2 28.0 -1.0 -1.0 \n", "2961 50000.0 2 3 1 51.0 -1.0 -1.0 \n", "2962 450000.0 2 2 1 38.0 -2.0 -2.0 \n", "2963 50000.0 2 2 1 44.0 -2.0 -2.0 \n", "2964 290000.0 2 2 1 39.0 1.0 -2.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "2955 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 463.0 \n", "2956 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 2000.0 \n", "2957 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 200.0 \n", "2958 -1.0 -1.0 -2.0 ... 0.0 0.0 5000.0 5000.0 \n", "2959 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 1560.0 \n", "2960 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 2800.0 \n", "2961 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 300.0 \n", "2962 -2.0 -2.0 -2.0 ... 390.0 390.0 390.0 390.0 \n", "2963 -2.0 -2.0 -2.0 ... 390.0 390.0 0.0 390.0 \n", "2964 -2.0 -2.0 -2.0 ... 3184.0 390.0 390.0 10000.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "2955 2500.0 0.0 0.0 0.0 0.0 \n", "2956 2377.0 40000.0 0.0 0.0 0.0 \n", "2957 0.0 0.0 0.0 0.0 0.0 \n", "2958 5000.0 0.0 5000.0 5000.0 470.0 \n", "2959 0.0 0.0 0.0 0.0 0.0 \n", "2960 0.0 0.0 0.0 0.0 0.0 \n", "2961 5880.0 0.0 0.0 0.0 0.0 \n", "2962 780.0 390.0 390.0 390.0 390.0 \n", "2963 390.0 390.0 390.0 0.0 780.0 \n", "2964 800.0 3184.0 390.0 390.0 6617.0 \n", "\n", " default_payment_next_month \n", "2955 0 \n", "2956 0 \n", "2957 0 \n", "2958 0 \n", "2959 0 \n", "2960 0 \n", "2961 1 \n", "2962 1 \n", "2963 0 \n", "2964 0 \n", "\n", "[10 rows x 24 columns]" ] }, "execution_count": 340, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.tail(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tampil 10 data yang dibawah." ] }, { "cell_type": "code", "execution_count": 341, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2965 entries, 0 to 2964\n", "Data columns (total 24 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 limit_balance 2965 non-null float64\n", " 1 sex 2965 non-null int64 \n", " 2 education_level 2965 non-null int64 \n", " 3 marital_status 2965 non-null int64 \n", " 4 age 2965 non-null float64\n", " 5 pay_0 2965 non-null float64\n", " 6 pay_2 2965 non-null float64\n", " 7 pay_3 2965 non-null float64\n", " 8 pay_4 2965 non-null float64\n", " 9 pay_5 2965 non-null float64\n", " 10 pay_6 2965 non-null float64\n", " 11 bill_amt_1 2965 non-null float64\n", " 12 bill_amt_2 2965 non-null float64\n", " 13 bill_amt_3 2965 non-null float64\n", " 14 bill_amt_4 2965 non-null float64\n", " 15 bill_amt_5 2965 non-null float64\n", " 16 bill_amt_6 2965 non-null float64\n", " 17 pay_amt_1 2965 non-null float64\n", " 18 pay_amt_2 2965 non-null float64\n", " 19 pay_amt_3 2965 non-null float64\n", " 20 pay_amt_4 2965 non-null float64\n", " 21 pay_amt_5 2965 non-null float64\n", " 22 pay_amt_6 2965 non-null float64\n", " 23 default_payment_next_month 2965 non-null int64 \n", "dtypes: float64(20), int64(4)\n", "memory usage: 556.1 KB\n", "None\n", " limit_balance sex education_level marital_status \\\n", "count 2965.000000 2965.000000 2965.000000 2965.000000 \n", "mean 163369.308600 1.607757 1.849578 1.559865 \n", "std 125030.415472 0.488333 0.778184 0.522317 \n", "min 10000.000000 1.000000 0.000000 0.000000 \n", "25% 50000.000000 1.000000 1.000000 1.000000 \n", "50% 140000.000000 2.000000 2.000000 2.000000 \n", "75% 230000.000000 2.000000 2.000000 2.000000 \n", "max 800000.000000 2.000000 6.000000 3.000000 \n", "\n", " age pay_0 pay_2 pay_3 pay_4 \\\n", "count 2965.000000 2965.000000 2965.000000 2965.000000 2965.000000 \n", "mean 35.193255 0.005059 -0.122428 -0.141653 -0.185160 \n", "std 9.109439 1.114395 1.180784 1.183630 1.178322 \n", "min 21.000000 -2.000000 -2.000000 -2.000000 -2.000000 \n", "25% 28.000000 -1.000000 -1.000000 -1.000000 -1.000000 \n", "50% 34.000000 0.000000 0.000000 0.000000 0.000000 \n", "75% 41.000000 0.000000 0.000000 0.000000 0.000000 \n", "max 69.000000 8.000000 7.000000 7.000000 8.000000 \n", "\n", " pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 \\\n", "count 2965.000000 ... 2965.000000 2965.000000 2965.000000 \n", "mean -0.225295 ... 44089.683305 40956.080607 39773.072513 \n", "std 1.159003 ... 61907.454056 58271.904751 57303.488981 \n", "min -2.000000 ... -46627.000000 -46627.000000 -73895.000000 \n", "25% -1.000000 ... 2582.000000 1958.000000 1430.000000 \n", "50% 0.000000 ... 19894.000000 18814.000000 18508.000000 \n", "75% 0.000000 ... 58622.000000 53373.000000 52287.000000 \n", "max 7.000000 ... 488808.000000 441981.000000 436172.000000 \n", "\n", " pay_amt_1 pay_amt_2 pay_amt_3 pay_amt_4 \\\n", "count 2965.000000 2.965000e+03 2965.000000 2965.000000 \n", "mean 6348.902867 6.272494e+03 5150.497133 4561.376054 \n", "std 20885.735336 2.887967e+04 14287.079982 13281.499599 \n", "min 0.000000 0.000000e+00 0.000000 0.000000 \n", "25% 1013.000000 9.900000e+02 477.000000 313.000000 \n", "50% 2234.000000 2.175000e+03 1994.000000 1600.000000 \n", "75% 5087.000000 5.000000e+03 4500.000000 4000.000000 \n", "max 493358.000000 1.227082e+06 199209.000000 202076.000000 \n", "\n", " pay_amt_5 pay_amt_6 default_payment_next_month \n", "count 2965.000000 2965.000000 2965.000000 \n", "mean 4913.286678 5382.701518 0.214165 \n", "std 16734.340778 17275.953029 0.410311 \n", "min 0.000000 0.000000 0.000000 \n", "25% 323.000000 173.000000 0.000000 \n", "50% 1646.000000 1615.000000 0.000000 \n", "75% 4021.000000 4081.000000 0.000000 \n", "max 388071.000000 403500.000000 1.000000 \n", "\n", "[8 rows x 24 columns]\n" ] } ], "source": [ "# Check the structure and basic statistics of the data\n", "print(data.info())\n", "print(data.describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ini untuk chek tipe dari semua kolom di data CSV. " ] }, { "cell_type": "code", "execution_count": 342, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['limit_balance', 'sex', 'education_level', 'marital_status', 'age',\n", " 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt_1',\n", " 'bill_amt_2', 'bill_amt_3', 'bill_amt_4', 'bill_amt_5', 'bill_amt_6',\n", " 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5',\n", " 'pay_amt_6', 'default_payment_next_month'],\n", " dtype='object')" ] }, "execution_count": 342, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lihat semua kolom dari table tersebut." ] }, { "cell_type": "code", "execution_count": 343, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6default_payment_next_month
080000.016154.00.00.00.00.00.0...29296.026210.017643.02545.02208.01336.02232.0542.0348.01
1200000.014149.00.00.00.00.00.0...50146.050235.048984.01689.02164.02500.03480.02500.03000.00
220000.026222.00.00.00.00.00.0...1434.0500.00.04641.01019.0900.00.01500.00.01
3260000.024233.00.00.00.00.00.0...27821.030767.029890.05000.05000.01137.05000.01085.05000.00
4150000.014232.00.00.00.0-1.00.0...150464.0143375.0146411.04019.0146896.0157436.04600.04709.05600.00
..................................................................
296080000.023228.0-1.0-1.0-1.0-2.0-2.0...0.00.00.02800.00.00.00.00.00.00
296150000.023151.0-1.0-1.0-1.0-1.0-2.0...0.00.00.0300.05880.00.00.00.00.01
2962450000.022138.0-2.0-2.0-2.0-2.0-2.0...390.0390.0390.0390.0780.0390.0390.0390.0390.01
296350000.022144.0-2.0-2.0-2.0-2.0-2.0...390.0390.00.0390.0390.0390.0390.00.0780.00
2964290000.022139.01.0-2.0-2.0-2.0-2.0...3184.0390.0390.010000.0800.03184.0390.0390.06617.00
\n", "

2964 rows × 24 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "0 80000.0 1 6 1 54.0 0.0 0.0 \n", "1 200000.0 1 4 1 49.0 0.0 0.0 \n", "2 20000.0 2 6 2 22.0 0.0 0.0 \n", "3 260000.0 2 4 2 33.0 0.0 0.0 \n", "4 150000.0 1 4 2 32.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... \n", "2960 80000.0 2 3 2 28.0 -1.0 -1.0 \n", "2961 50000.0 2 3 1 51.0 -1.0 -1.0 \n", "2962 450000.0 2 2 1 38.0 -2.0 -2.0 \n", "2963 50000.0 2 2 1 44.0 -2.0 -2.0 \n", "2964 290000.0 2 2 1 39.0 1.0 -2.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "0 0.0 0.0 0.0 ... 29296.0 26210.0 17643.0 2545.0 \n", "1 0.0 0.0 0.0 ... 50146.0 50235.0 48984.0 1689.0 \n", "2 0.0 0.0 0.0 ... 1434.0 500.0 0.0 4641.0 \n", "3 0.0 0.0 0.0 ... 27821.0 30767.0 29890.0 5000.0 \n", "4 0.0 -1.0 0.0 ... 150464.0 143375.0 146411.0 4019.0 \n", "... ... ... ... ... ... ... ... ... \n", "2960 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 2800.0 \n", "2961 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 300.0 \n", "2962 -2.0 -2.0 -2.0 ... 390.0 390.0 390.0 390.0 \n", "2963 -2.0 -2.0 -2.0 ... 390.0 390.0 0.0 390.0 \n", "2964 -2.0 -2.0 -2.0 ... 3184.0 390.0 390.0 10000.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "0 2208.0 1336.0 2232.0 542.0 348.0 \n", "1 2164.0 2500.0 3480.0 2500.0 3000.0 \n", "2 1019.0 900.0 0.0 1500.0 0.0 \n", "3 5000.0 1137.0 5000.0 1085.0 5000.0 \n", "4 146896.0 157436.0 4600.0 4709.0 5600.0 \n", "... ... ... ... ... ... \n", "2960 0.0 0.0 0.0 0.0 0.0 \n", "2961 5880.0 0.0 0.0 0.0 0.0 \n", "2962 780.0 390.0 390.0 390.0 390.0 \n", "2963 390.0 390.0 390.0 0.0 780.0 \n", "2964 800.0 3184.0 390.0 390.0 6617.0 \n", "\n", " default_payment_next_month \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 0 \n", "... ... \n", "2960 0 \n", "2961 1 \n", "2962 1 \n", "2963 0 \n", "2964 0 \n", "\n", "[2964 rows x 24 columns]" ] }, "execution_count": 343, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 344, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "limit_balance 0\n", "sex 0\n", "education_level 0\n", "marital_status 0\n", "age 0\n", "pay_0 0\n", "pay_2 0\n", "pay_3 0\n", "pay_4 0\n", "pay_5 0\n", "pay_6 0\n", "bill_amt_1 0\n", "bill_amt_2 0\n", "bill_amt_3 0\n", "bill_amt_4 0\n", "bill_amt_5 0\n", "bill_amt_6 0\n", "pay_amt_1 0\n", "pay_amt_2 0\n", "pay_amt_3 0\n", "pay_amt_4 0\n", "pay_amt_5 0\n", "pay_amt_6 0\n", "default_payment_next_month 0\n", "dtype: int64\n" ] } ], "source": [ "# Check for missing values\n", "print(data.isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau lihat jika ada data yang null atau kosong. Kelihatan semua kolom termasuk values dan terisi." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis" ] }, { "cell_type": "code", "execution_count": 345, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6default_payment_next_month
1200000.014149.00.00.00.00.00.0...50146.050235.048984.01689.02164.02500.03480.02500.03000.00
3260000.024233.00.00.00.00.00.0...27821.030767.029890.05000.05000.01137.05000.01085.05000.00
4150000.014232.00.00.00.0-1.00.0...150464.0143375.0146411.04019.0146896.0157436.04600.04709.05600.00
5300000.024232.00.00.00.00.00.0...65150.0-450.0700.015235.01491.01303.00.02000.01400.00
6130000.011145.00.00.00.00.00.0...62377.063832.065099.02886.02908.02129.02354.02366.02291.00
..................................................................
295880000.023139.0-1.0-1.0-1.0-1.0-2.0...0.00.05000.05000.05000.00.05000.05000.0470.00
295920000.013226.0-1.0-1.0-1.0-2.0-2.0...0.00.00.01560.00.00.00.00.00.00
296080000.023228.0-1.0-1.0-1.0-2.0-2.0...0.00.00.02800.00.00.00.00.00.00
296350000.022144.0-2.0-2.0-2.0-2.0-2.0...390.0390.00.0390.0390.0390.0390.00.0780.00
2964290000.022139.01.0-2.0-2.0-2.0-2.0...3184.0390.0390.010000.0800.03184.0390.0390.06617.00
\n", "

2330 rows × 24 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "1 200000.0 1 4 1 49.0 0.0 0.0 \n", "3 260000.0 2 4 2 33.0 0.0 0.0 \n", "4 150000.0 1 4 2 32.0 0.0 0.0 \n", "5 300000.0 2 4 2 32.0 0.0 0.0 \n", "6 130000.0 1 1 1 45.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... \n", "2958 80000.0 2 3 1 39.0 -1.0 -1.0 \n", "2959 20000.0 1 3 2 26.0 -1.0 -1.0 \n", "2960 80000.0 2 3 2 28.0 -1.0 -1.0 \n", "2963 50000.0 2 2 1 44.0 -2.0 -2.0 \n", "2964 290000.0 2 2 1 39.0 1.0 -2.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "1 0.0 0.0 0.0 ... 50146.0 50235.0 48984.0 1689.0 \n", "3 0.0 0.0 0.0 ... 27821.0 30767.0 29890.0 5000.0 \n", "4 0.0 -1.0 0.0 ... 150464.0 143375.0 146411.0 4019.0 \n", "5 0.0 0.0 0.0 ... 65150.0 -450.0 700.0 15235.0 \n", "6 0.0 0.0 0.0 ... 62377.0 63832.0 65099.0 2886.0 \n", "... ... ... ... ... ... ... ... ... \n", "2958 -1.0 -1.0 -2.0 ... 0.0 0.0 5000.0 5000.0 \n", "2959 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 1560.0 \n", "2960 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 2800.0 \n", "2963 -2.0 -2.0 -2.0 ... 390.0 390.0 0.0 390.0 \n", "2964 -2.0 -2.0 -2.0 ... 3184.0 390.0 390.0 10000.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "1 2164.0 2500.0 3480.0 2500.0 3000.0 \n", "3 5000.0 1137.0 5000.0 1085.0 5000.0 \n", "4 146896.0 157436.0 4600.0 4709.0 5600.0 \n", "5 1491.0 1303.0 0.0 2000.0 1400.0 \n", "6 2908.0 2129.0 2354.0 2366.0 2291.0 \n", "... ... ... ... ... ... \n", "2958 5000.0 0.0 5000.0 5000.0 470.0 \n", "2959 0.0 0.0 0.0 0.0 0.0 \n", "2960 0.0 0.0 0.0 0.0 0.0 \n", "2963 390.0 390.0 390.0 0.0 780.0 \n", "2964 800.0 3184.0 390.0 390.0 6617.0 \n", "\n", " default_payment_next_month \n", "1 0 \n", "3 0 \n", "4 0 \n", "5 0 \n", "6 0 \n", "... ... \n", "2958 0 \n", "2959 0 \n", "2960 0 \n", "2963 0 \n", "2964 0 \n", "\n", "[2330 rows x 24 columns]" ] }, "execution_count": 345, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_0 = data[data['default_payment_next_month'] == 0]\n", "group_0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disini, kita mau coba group data dengan yang nilai \"0\" di \"default_payment_next_month\". Ini adalah clients yang masih lunas dalam tahap pembayaran mereka dan belum \"default\". " ] }, { "cell_type": "code", "execution_count": 346, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6default_payment_next_month
080000.016154.00.00.00.00.00.0...29296.026210.017643.02545.02208.01336.02232.0542.0348.01
220000.026222.00.00.00.00.00.0...1434.0500.00.04641.01019.0900.00.01500.00.01
18360000.011146.00.00.00.00.00.0...13780.015077.014009.03005.03024.04004.04008.03008.02024.01
5220000.011222.00.00.00.00.00.0...22231.022301.021687.01331.01675.01327.0800.0783.0777.01
59100000.011230.00.00.00.00.00.0...97862.079099.079812.04511.03711.03685.02797.02897.03046.01
..................................................................
2942430000.012132.01.0-1.0-1.0-2.0-2.0...0.00.00.02500.00.00.00.00.00.01
294420000.022138.01.0-1.0-1.0-2.0-2.0...0.00.00.02000.00.00.00.00.00.01
295210000.012130.0-1.0-1.0-1.0-1.0-2.0...0.00.00.00.0780.00.00.00.00.01
296150000.023151.0-1.0-1.0-1.0-1.0-2.0...0.00.00.0300.05880.00.00.00.00.01
2962450000.022138.0-2.0-2.0-2.0-2.0-2.0...390.0390.0390.0390.0780.0390.0390.0390.0390.01
\n", "

635 rows × 24 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "0 80000.0 1 6 1 54.0 0.0 0.0 \n", "2 20000.0 2 6 2 22.0 0.0 0.0 \n", "18 360000.0 1 1 1 46.0 0.0 0.0 \n", "52 20000.0 1 1 2 22.0 0.0 0.0 \n", "59 100000.0 1 1 2 30.0 0.0 0.0 \n", "... ... ... ... ... ... ... ... \n", "2942 430000.0 1 2 1 32.0 1.0 -1.0 \n", "2944 20000.0 2 2 1 38.0 1.0 -1.0 \n", "2952 10000.0 1 2 1 30.0 -1.0 -1.0 \n", "2961 50000.0 2 3 1 51.0 -1.0 -1.0 \n", "2962 450000.0 2 2 1 38.0 -2.0 -2.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "0 0.0 0.0 0.0 ... 29296.0 26210.0 17643.0 2545.0 \n", "2 0.0 0.0 0.0 ... 1434.0 500.0 0.0 4641.0 \n", "18 0.0 0.0 0.0 ... 13780.0 15077.0 14009.0 3005.0 \n", "52 0.0 0.0 0.0 ... 22231.0 22301.0 21687.0 1331.0 \n", "59 0.0 0.0 0.0 ... 97862.0 79099.0 79812.0 4511.0 \n", "... ... ... ... ... ... ... ... ... \n", "2942 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 2500.0 \n", "2944 -1.0 -2.0 -2.0 ... 0.0 0.0 0.0 2000.0 \n", "2952 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 0.0 \n", "2961 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 300.0 \n", "2962 -2.0 -2.0 -2.0 ... 390.0 390.0 390.0 390.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "0 2208.0 1336.0 2232.0 542.0 348.0 \n", "2 1019.0 900.0 0.0 1500.0 0.0 \n", "18 3024.0 4004.0 4008.0 3008.0 2024.0 \n", "52 1675.0 1327.0 800.0 783.0 777.0 \n", "59 3711.0 3685.0 2797.0 2897.0 3046.0 \n", "... ... ... ... ... ... \n", "2942 0.0 0.0 0.0 0.0 0.0 \n", "2944 0.0 0.0 0.0 0.0 0.0 \n", "2952 780.0 0.0 0.0 0.0 0.0 \n", "2961 5880.0 0.0 0.0 0.0 0.0 \n", "2962 780.0 390.0 390.0 390.0 390.0 \n", "\n", " default_payment_next_month \n", "0 1 \n", "2 1 \n", "18 1 \n", "52 1 \n", "59 1 \n", "... ... \n", "2942 1 \n", "2944 1 \n", "2952 1 \n", "2961 1 \n", "2962 1 \n", "\n", "[635 rows x 24 columns]" ] }, "execution_count": 346, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_1 = data[data['default_payment_next_month'] == 1]\n", "group_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disini, kita mau coba group data dengan yang nilai \"1\" di \"default_payment_next_month\". Ini adalah clients yang sudah \"default\" dan harus ditagih untuk pembayaran yang terlambat. " ] }, { "cell_type": "code", "execution_count": 347, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2])" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['sex'].unique()" ] }, { "cell_type": "code", "execution_count": 348, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3, 0])" ] }, "execution_count": 348, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['marital_status'].unique()" ] }, { "cell_type": "code", "execution_count": 349, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([6, 4, 1, 2, 3, 5, 0])" ] }, "execution_count": 349, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['education_level'].unique()" ] }, { "cell_type": "code", "execution_count": 350, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([54., 49., 22., 33., 32., 45., 58., 39., 48., 34., 47., 46., 30.,\n", " 35., 55., 42., 56., 31., 53., 40., 36., 51., 37., 44., 24., 38.,\n", " 26., 25., 23., 27., 28., 29., 41., 63., 50., 43., 66., 61., 52.,\n", " 62., 69., 21., 65., 57., 64., 67., 60., 59., 68.])" ] }, "execution_count": 350, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['age'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau cek unique values dari beberapa kolom untuk melihat cardinality/variance dari setiap values. " ] }, { "cell_type": "code", "execution_count": 351, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.countplot(x='default_payment_next_month', data=data)\n", "plt.title('Distribution of Default Payment Next Month')\n", "plt.xlabel('Default Payment Next Month')\n", "plt.ylabel('Count')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau cek beberapa client yang lulus bayar utang \"0\", dan yang gagal bayar utang \"1\" dari di semua dataset. " ] }, { "cell_type": "code", "execution_count": 352, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.histplot(data=data, x='age', hue='default_payment_next_month', bins=20, kde=True)\n", "plt.title('Distribution of Age by Default Payment Next Month')\n", "plt.xlabel('Age')\n", "plt.ylabel('Frequency')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Juga mau cek distribusi dari semua client yang lunas atau gagal bayar utang dari keseluruhan dataset. Dari distribusi yang skewed positive, client cenderung kurang dari umur 40-an daripada yang lebih tua dari 40 tahun. " ] }, { "cell_type": "code", "execution_count": 353, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.boxplot(x='default_payment_next_month', y='limit_balance', data=data)\n", "plt.title('Limit Balance by Default Payment Next Month')\n", "plt.xlabel('Default Payment Next Month')\n", "plt.ylabel('Limit Balance')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disini, kita mau lihat perbedaan client yang lunas dan gagal bayar di variable limit balance dengan box plot. Kelihatan client yang lunas bayar tagihan memiliki Limit_balance yang lebih tinggi daripada yang gagal bayar. Outlier untuk client yang mampu bayar tagihan juga lebih luas daripada yang gagal bayar. Ini artinya client-client yang tidak pernah gagal payar lebih mampu untuk ambil pinjaman dari bank dengan balance yang lebih tinggi, juga kemungkinan besar mereka ditawarkan \"loan\" dengan quota lebih besar setiap bulan. " ] }, { "cell_type": "code", "execution_count": 354, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.countplot(x='education_level', hue='default_payment_next_month', data=data)\n", "plt.title('Education Level by Default Payment Next Month')\n", "plt.xlabel('Education Level')\n", "plt.ylabel('Count')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kelihatan bahwa kebanyakan client dari bank adalah lulusan \"graduate school\" dan \"university level\". Ini artinya proporsi dari populasi yang memiliki sarjana \"S1\" atau lebih tinggi sangat representatif dari semua client. Juga kelihatan bahwa yang jika pendidikan client lebih tinggi, maka mereka punya kemunkinan untuk bayar balik utang itu juga lebih tinggi. " ] }, { "cell_type": "code", "execution_count": 355, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.pairplot(data, vars=['limit_balance', 'age', 'marital_status', 'education_level', 'sex'], hue='default_payment_next_month')\n", "#plt.title('Pair Plot of Selected Features')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disini kita mau lihat jika ada linear relationship antar variable 'age', 'marital_status', dan 'education_level'. Kelihatan data memiliki variable yang sangat luas dengan yang discrete dan continuous, jadi susah diprediksi apa saja yang memiliki linear relationship dari plot tersebut." ] }, { "cell_type": "code", "execution_count": 356, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.scatterplot(x='limit_balance', y='bill_amt_1', hue='default_payment_next_month', data=data)\n", "plt.title('Limit Balance vs Bill Amount 1')\n", "plt.xlabel('Limit Balance')\n", "plt.ylabel('Bill Amount 1')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Antar variable Limit balance data Bill_amount pertama, bisa kelihatan linear relationship yang positif antar client yang harus utang pembayaran bulan itu dengan total limit balance yang mereka punya di akun mereka. Bisa dibilang bahwa kebanyakan client cenderung memiliki limit baalance dibawah 500000, dan bill_amount mereka naik proporsional jika limit_balance juga lebih tinggi. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split Train Test" ] }, { "cell_type": "code", "execution_count": 357, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Define features and target variable\n", "X = data.drop('default_payment_next_month', axis=1)\n", "y = data['default_payment_next_month'] #yang di prediksi\n", "\n", "# Split data into training and testing sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Goal kita untuk mencoba Split-train test itu untuk mengevaluasi skala \"performance\" dari model yang kita gunakan. Kita mau lihat beberapa baik model kita yang baru di-train akan reaksi terhadap data yang baru. Dengan test tersebut, kita akan bisa mengolah data dengan detecting over/underfitting atau menyatakan parameter di dalam training set. \n", "\n", "Kita juga mengunakan test size = 0.2, artinya 80% dari data akan digunakan untuk model training, dan 20% untuk cek \"performance\"." ] }, { "cell_type": "code", "execution_count": 358, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_3bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6
21290000.011156.00.00.00.00.00.0...415700.0232732.0220460.0224780.010013.08501.015018.010001.07997.07624.0
1388260000.012228.00.00.02.00.00.0...163036.0159348.0160198.0104389.0107000.00.05000.05000.06000.060000.0
56760000.012253.00.00.00.00.00.0...57432.027126.027579.029085.03000.04000.02000.02000.02000.02000.0
1594150000.022223.0-2.0-1.0-1.00.00.0...151996.0152753.0153844.0151252.010096.0156292.04700.05019.05300.05002.0
1017130000.012228.02.02.00.00.00.0...70184.08518.011296.06514.00.03000.02000.03000.02000.00.0
..................................................................
201450000.021227.03.02.02.07.07.0...300.0300.0300.0300.00.00.00.00.00.00.0
215790000.023168.0-2.0-2.0-2.0-2.0-1.0...1000.01000.01052.069237.00.01000.01000.01052.071062.03000.0
1931150000.023148.02.02.02.02.02.0...62650.059255.045983.052986.05950.00.010000.00.020000.00.0
1504160000.013238.01.0-2.0-2.0-1.00.0...0.0700.0700.00.00.00.0700.00.00.00.0
171210000.021227.00.00.00.02.02.0...10255.09389.08345.08572.01400.02500.0500.00.0500.02000.0
\n", "

2372 rows × 23 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "21 290000.0 1 1 1 56.0 0.0 0.0 \n", "1388 260000.0 1 2 2 28.0 0.0 0.0 \n", "567 60000.0 1 2 2 53.0 0.0 0.0 \n", "1594 150000.0 2 2 2 23.0 -2.0 -1.0 \n", "1017 130000.0 1 2 2 28.0 2.0 2.0 \n", "... ... ... ... ... ... ... ... \n", "2014 50000.0 2 1 2 27.0 3.0 2.0 \n", "2157 90000.0 2 3 1 68.0 -2.0 -2.0 \n", "1931 150000.0 2 3 1 48.0 2.0 2.0 \n", "1504 160000.0 1 3 2 38.0 1.0 -2.0 \n", "1712 10000.0 2 1 2 27.0 0.0 0.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_3 bill_amt_4 bill_amt_5 \\\n", "21 0.0 0.0 0.0 ... 415700.0 232732.0 220460.0 \n", "1388 2.0 0.0 0.0 ... 163036.0 159348.0 160198.0 \n", "567 0.0 0.0 0.0 ... 57432.0 27126.0 27579.0 \n", "1594 -1.0 0.0 0.0 ... 151996.0 152753.0 153844.0 \n", "1017 0.0 0.0 0.0 ... 70184.0 8518.0 11296.0 \n", "... ... ... ... ... ... ... ... \n", "2014 2.0 7.0 7.0 ... 300.0 300.0 300.0 \n", "2157 -2.0 -2.0 -1.0 ... 1000.0 1000.0 1052.0 \n", "1931 2.0 2.0 2.0 ... 62650.0 59255.0 45983.0 \n", "1504 -2.0 -1.0 0.0 ... 0.0 700.0 700.0 \n", "1712 0.0 2.0 2.0 ... 10255.0 9389.0 8345.0 \n", "\n", " bill_amt_6 pay_amt_1 pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 \\\n", "21 224780.0 10013.0 8501.0 15018.0 10001.0 7997.0 \n", "1388 104389.0 107000.0 0.0 5000.0 5000.0 6000.0 \n", "567 29085.0 3000.0 4000.0 2000.0 2000.0 2000.0 \n", "1594 151252.0 10096.0 156292.0 4700.0 5019.0 5300.0 \n", "1017 6514.0 0.0 3000.0 2000.0 3000.0 2000.0 \n", "... ... ... ... ... ... ... \n", "2014 300.0 0.0 0.0 0.0 0.0 0.0 \n", "2157 69237.0 0.0 1000.0 1000.0 1052.0 71062.0 \n", "1931 52986.0 5950.0 0.0 10000.0 0.0 20000.0 \n", "1504 0.0 0.0 0.0 700.0 0.0 0.0 \n", "1712 8572.0 1400.0 2500.0 500.0 0.0 500.0 \n", "\n", " pay_amt_6 \n", "21 7624.0 \n", "1388 60000.0 \n", "567 2000.0 \n", "1594 5002.0 \n", "1017 0.0 \n", "... ... \n", "2014 0.0 \n", "2157 3000.0 \n", "1931 0.0 \n", "1504 0.0 \n", "1712 2000.0 \n", "\n", "[2372 rows x 23 columns]" ] }, "execution_count": 358, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ini dari bagian X-train yang akan digunakan oleh machine learning. " ] }, { "cell_type": "code", "execution_count": 359, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21 0\n", "1388 0\n", "567 0\n", "1594 0\n", "1017 1\n", " ..\n", "2014 1\n", "2157 0\n", "1931 1\n", "1504 0\n", "1712 0\n", "Name: default_payment_next_month, Length: 2372, dtype: int64" ] }, "execution_count": 359, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ini dari kolom \"default_payment_next_month\" yang akan di-train dengan Machine Learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Missing Values" ] }, { "cell_type": "code", "execution_count": 360, "metadata": {}, "outputs": [], "source": [ "X_train.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": 361, "metadata": {}, "outputs": [], "source": [ "X_test.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": 362, "metadata": {}, "outputs": [], "source": [ "y_train.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": 363, "metadata": {}, "outputs": [], "source": [ "y_test.dropna(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tadi kita cek tidak ada missing values di keseluruhan data, tapi kita cuman mau cek lagi jika ada missing values yang perlu di-drop lagi. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Outliers" ] }, { "cell_type": "code", "execution_count": 364, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Categorical Columns: []\n", "Numerical Columns: ['limit_balance', 'sex', 'education_level', 'marital_status', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt_1', 'bill_amt_2', 'bill_amt_3', 'bill_amt_4', 'bill_amt_5', 'bill_amt_6', 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5', 'pay_amt_6']\n" ] } ], "source": [ "# Selecting numerical and categorical columns\n", "num_columns = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()\n", "cat_columns = X_train.select_dtypes(include=['object', 'category']).columns.tolist()\n", "\n", "print('Categorical Columns: ', cat_columns)\n", "print('Numerical Columns: ', num_columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk handle outliers, kita mau bagi dataset menjadi dua tipe kolom numerikal dan kategorikal. Ini untuk memastikan bisa diolah dengan training model dengan technique feature selection yang benar. Data yang berisi dengan integer atau float akan dimasukan ke kolom list num_columns, dan data dengan object atau category sebagai tipe akan dimasukan sebagai cat_columns. " ] }, { "cell_type": "code", "execution_count": 365, "metadata": {}, "outputs": [], "source": [ "# Making data and columns for normal distribution\n", "data_normal = []\n", "column_normal = []\n", "\n", "# Making data and columns for skewed distribution\n", "data_skewed = []\n", "column_skewed = []\n", "\n", "# For loop in every numerical column to filter the data distribution into either normally distributed or skewed columns\n", "for num in num_columns:\n", " skewness = X_train[num].skew()\n", "\n", " # If the data is normally distributed\n", " if skewness <= 0.5 and skewness >= -0.5:\n", " column_normal.append(num)\n", " data_normal.append([num, skewness])\n", "\n", " # If the data has low negative skewness\n", " elif skewness < -1:\n", " column_skewed.append(num)\n", " data_skewed.append([num, skewness, 'high'])\n", "\n", " # If the data has low positive skewness\n", " elif skewness > 1:\n", " column_skewed.append(num)\n", " data_skewed.append([num, skewness, 'high'])\n", "\n", " # If the data has moderate negative skewness\n", " elif skewness <= -0.5 and skewness > -1:\n", " column_skewed.append(num)\n", " data_skewed.append([num, skewness, 'low'])\n", "\n", " # If the data has moderate positive skewness\n", " elif skewness >= 0.5 and skewness < 1:\n", " column_skewed.append(num)\n", " data_skewed.append([num, skewness, 'low'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau perbedakan variable yang memiliki skew normal dan yang tidak normal." ] }, { "cell_type": "code", "execution_count": 366, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
normal_distributionskewness
0sex-0.451482
1marital_status-0.041328
\n", "
" ], "text/plain": [ " normal_distribution skewness\n", "0 sex -0.451482\n", "1 marital_status -0.041328" ] }, "execution_count": 366, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Showing normally distributed columns\n", "pd.DataFrame(data=data_normal, columns=['normal_distribution', 'skewness'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kelihatan cuman sex dan marital_status memiliki skew yang normal." ] }, { "cell_type": "code", "execution_count": 367, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
skewed_distributionskewnessrate
0limit_balance1.013432high
1education_level0.913215low
2age0.757115low
3pay_00.910223low
4pay_20.852870low
5pay_30.970325low
6pay_41.153104high
7pay_51.079227high
8pay_61.011001high
9bill_amt_12.421285high
10bill_amt_22.438696high
11bill_amt_32.625471high
12bill_amt_42.508942high
13bill_amt_52.464289high
14bill_amt_62.503029high
15pay_amt_112.116671high
16pay_amt_228.106058high
17pay_amt_38.051309high
18pay_amt_48.888920high
19pay_amt_512.525019high
20pay_amt_610.277112high
\n", "
" ], "text/plain": [ " skewed_distribution skewness rate\n", "0 limit_balance 1.013432 high\n", "1 education_level 0.913215 low\n", "2 age 0.757115 low\n", "3 pay_0 0.910223 low\n", "4 pay_2 0.852870 low\n", "5 pay_3 0.970325 low\n", "6 pay_4 1.153104 high\n", "7 pay_5 1.079227 high\n", "8 pay_6 1.011001 high\n", "9 bill_amt_1 2.421285 high\n", "10 bill_amt_2 2.438696 high\n", "11 bill_amt_3 2.625471 high\n", "12 bill_amt_4 2.508942 high\n", "13 bill_amt_5 2.464289 high\n", "14 bill_amt_6 2.503029 high\n", "15 pay_amt_1 12.116671 high\n", "16 pay_amt_2 28.106058 high\n", "17 pay_amt_3 8.051309 high\n", "18 pay_amt_4 8.888920 high\n", "19 pay_amt_5 12.525019 high\n", "20 pay_amt_6 10.277112 high" ] }, "execution_count": 367, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Showing skewed columns\n", "pd.DataFrame(data=data_skewed, columns=['skewed_distribution', 'skewness', 'rate'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Banyak variable di dataset yang memiliki skew yang tinggi daripada yang rendah. Ini artinya dataset memiliki skew yang cenderung positif. " ] }, { "cell_type": "code", "execution_count": 368, "metadata": {}, "outputs": [], "source": [ "# Capping Method for Normal Distribution\n", "winsorizer_normal = Winsorizer(capping_method='gaussian',\n", " tail='both',\n", " fold=3,\n", " variables=column_normal,\n", " missing_values='ignore')\n", "\n", "# Fit & Transforming X_train\n", "X_train_capped = winsorizer_normal.fit_transform(X_train)\n", "\n", "# Transforming X_test\n", "X_test_capped = winsorizer_normal.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau coba \"cap\" outlier yang memiliki skew normal. Ini jadi dataset tidak akan terlalu sensitif terhadap outlier yang akan diubah. " ] }, { "cell_type": "code", "execution_count": 369, "metadata": {}, "outputs": [], "source": [ "# Capping Method for Skewed Distribution\n", "winsorizer_skewed = Winsorizer(capping_method='iqr',\n", " tail='both',\n", " fold=3,\n", " variables=column_skewed)\n", "\n", "# Fit & Transforming X_train\n", "X_train_capped = winsorizer_skewed.fit_transform(X_train_capped)\n", "\n", "# Transforming X_test\n", "X_test_capped = winsorizer_skewed.transform(X_test_capped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau coba \"cap\" outlier yang memiliki skew tidak normal. Ini jadi dataset tidak akan terlalu sensitif terhadap outlier yang akan diubah. " ] }, { "cell_type": "code", "execution_count": 370, "metadata": {}, "outputs": [], "source": [ "# Plot Distribution Comparison\n", "def outlier_handling_plot_comparison(df_before, df_after, variable):\n", "\n", " # Figure Size, and Super Title based on variable\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n", " fig.suptitle(f'{variable} - Distribution Before and After Outlier Handling')\n", "\n", " # Plot Histogram Before\n", " sns.histplot(df_before[variable], bins=30, ax=axes[0, 0], color='orange')\n", " axes[0, 0].set_title('Histogram Before')\n", "\n", " # Plot Boxplot Before\n", " sns.boxplot(y=df_before[variable], ax=axes[1, 0])\n", " axes[1, 0].set_title('Boxplot Before')\n", "\n", " # Plot Histogram After\n", " sns.histplot(df_after[variable], bins=30, ax=axes[0, 1], color='orange')\n", " axes[0, 1].set_title('Histogram After')\n", "\n", " # Plot Boxplot After\n", " sns.boxplot(y=df_after[variable], ax=axes[1, 1])\n", " axes[1, 1].set_title('Boxplot After')\n", "\n", " plt.tight_layout(rect=[0, 0.03, 1, 0.95])\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ini kita mau coba illustrasi variable apa saja yang telah diubah dengan beberapa \"capping\" metode. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering" ] }, { "cell_type": "code", "execution_count": 371, "metadata": {}, "outputs": [], "source": [ "\n", "list_num_col = ['limit_balance', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', \n", " 'bill_amt_1', 'bill_amt_2', 'bill_amt_3', 'bill_amt_4', 'bill_amt_5', 'bill_amt_6', \n", " 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5', 'pay_amt_6']\n", "list_cat_col = ['sex', 'education_level', 'marital_status']\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk di feature engineering, kita mau define beberapa variable category secara manual karena beberapa variable kategori sudah di encode. Untuk mempermudah machine learning nanti, kita akan coba perbedaan variable numerikal dan kategorikal. " ] }, { "cell_type": "code", "execution_count": 372, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancesexeducation_levelmarital_statusagepay_0pay_2pay_3pay_4pay_5...bill_amt_3bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6
21290000.0111.056.00.00.00.00.00.0...240879.75223294.75207163.0204697.2510013.008501.0015018.010001.07997.007624.0
1388260000.0122.028.00.00.02.00.00.0...163036.00159348.00160198.0104389.0017120.750.005000.05000.06000.0015427.0
56760000.0122.053.00.00.00.00.00.0...57432.0027126.0027579.029085.003000.004000.002000.02000.02000.002000.0
1594150000.0222.023.0-2.0-1.0-1.00.00.0...151996.00152753.00153844.0151252.0010096.0017215.254700.05019.05300.005002.0
1017130000.0122.028.02.02.00.00.00.0...70184.008518.0011296.06514.000.003000.002000.03000.02000.000.0
..................................................................
201450000.0212.027.03.02.02.03.03.0...300.00300.00300.0300.000.000.000.00.00.000.0
215790000.0231.068.0-2.0-2.0-2.0-2.0-1.0...1000.001000.001052.069237.000.001000.001000.01052.014871.253000.0
1931150000.0231.048.02.02.02.02.02.0...62650.0059255.0045983.052986.005950.000.0010000.00.014871.250.0
1504160000.0132.038.01.0-2.0-2.0-1.00.0...0.00700.00700.00.000.000.00700.00.00.000.0
171210000.0212.027.00.00.00.02.02.0...10255.009389.008345.08572.001400.002500.00500.00.0500.002000.0
\n", "

2372 rows × 23 columns

\n", "
" ], "text/plain": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "21 290000.0 1 1 1.0 56.0 0.0 0.0 \n", "1388 260000.0 1 2 2.0 28.0 0.0 0.0 \n", "567 60000.0 1 2 2.0 53.0 0.0 0.0 \n", "1594 150000.0 2 2 2.0 23.0 -2.0 -1.0 \n", "1017 130000.0 1 2 2.0 28.0 2.0 2.0 \n", "... ... ... ... ... ... ... ... \n", "2014 50000.0 2 1 2.0 27.0 3.0 2.0 \n", "2157 90000.0 2 3 1.0 68.0 -2.0 -2.0 \n", "1931 150000.0 2 3 1.0 48.0 2.0 2.0 \n", "1504 160000.0 1 3 2.0 38.0 1.0 -2.0 \n", "1712 10000.0 2 1 2.0 27.0 0.0 0.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_3 bill_amt_4 bill_amt_5 \\\n", "21 0.0 0.0 0.0 ... 240879.75 223294.75 207163.0 \n", "1388 2.0 0.0 0.0 ... 163036.00 159348.00 160198.0 \n", "567 0.0 0.0 0.0 ... 57432.00 27126.00 27579.0 \n", "1594 -1.0 0.0 0.0 ... 151996.00 152753.00 153844.0 \n", "1017 0.0 0.0 0.0 ... 70184.00 8518.00 11296.0 \n", "... ... ... ... ... ... ... ... \n", "2014 2.0 3.0 3.0 ... 300.00 300.00 300.0 \n", "2157 -2.0 -2.0 -1.0 ... 1000.00 1000.00 1052.0 \n", "1931 2.0 2.0 2.0 ... 62650.00 59255.00 45983.0 \n", "1504 -2.0 -1.0 0.0 ... 0.00 700.00 700.0 \n", "1712 0.0 2.0 2.0 ... 10255.00 9389.00 8345.0 \n", "\n", " bill_amt_6 pay_amt_1 pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 \\\n", "21 204697.25 10013.00 8501.00 15018.0 10001.0 7997.00 \n", "1388 104389.00 17120.75 0.00 5000.0 5000.0 6000.00 \n", "567 29085.00 3000.00 4000.00 2000.0 2000.0 2000.00 \n", "1594 151252.00 10096.00 17215.25 4700.0 5019.0 5300.00 \n", "1017 6514.00 0.00 3000.00 2000.0 3000.0 2000.00 \n", "... ... ... ... ... ... ... \n", "2014 300.00 0.00 0.00 0.0 0.0 0.00 \n", "2157 69237.00 0.00 1000.00 1000.0 1052.0 14871.25 \n", "1931 52986.00 5950.00 0.00 10000.0 0.0 14871.25 \n", "1504 0.00 0.00 0.00 700.0 0.0 0.00 \n", "1712 8572.00 1400.00 2500.00 500.0 0.0 500.00 \n", "\n", " pay_amt_6 \n", "21 7624.0 \n", "1388 15427.0 \n", "567 2000.0 \n", "1594 5002.0 \n", "1017 0.0 \n", "... ... \n", "2014 0.0 \n", "2157 3000.0 \n", "1931 0.0 \n", "1504 0.0 \n", "1712 2000.0 \n", "\n", "[2372 rows x 23 columns]" ] }, "execution_count": 372, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_capped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Coba lihat data dari X_train yang sudah diubah outlier-nya. " ] }, { "cell_type": "code", "execution_count": 373, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balanceagepay_0pay_2pay_3pay_4pay_5pay_6bill_amt_1bill_amt_2bill_amt_3bill_amt_4bill_amt_5bill_amt_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6
21290000.056.00.00.00.00.00.00.0222000.0226917.0240879.75223294.75207163.0204697.2510013.008501.0015018.010001.07997.07624.0
1388260000.028.00.00.02.00.00.00.0149814.0184419.0163036.00159348.00160198.0104389.0017120.750.005000.05000.06000.015427.0
56760000.053.00.00.00.00.00.00.056765.057849.057432.0027126.0027579.029085.003000.004000.002000.02000.02000.02000.0
1594150000.023.0-2.0-1.0-1.00.00.00.027414.010053.0151996.00152753.00153844.0151252.0010096.0017215.254700.05019.05300.05002.0
1017130000.028.02.02.00.00.00.00.070952.069264.070184.008518.0011296.06514.000.003000.002000.03000.02000.00.0
\n", "
" ], "text/plain": [ " limit_balance age pay_0 pay_2 pay_3 pay_4 pay_5 pay_6 \\\n", "21 290000.0 56.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1388 260000.0 28.0 0.0 0.0 2.0 0.0 0.0 0.0 \n", "567 60000.0 53.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1594 150000.0 23.0 -2.0 -1.0 -1.0 0.0 0.0 0.0 \n", "1017 130000.0 28.0 2.0 2.0 0.0 0.0 0.0 0.0 \n", "\n", " bill_amt_1 bill_amt_2 bill_amt_3 bill_amt_4 bill_amt_5 bill_amt_6 \\\n", "21 222000.0 226917.0 240879.75 223294.75 207163.0 204697.25 \n", "1388 149814.0 184419.0 163036.00 159348.00 160198.0 104389.00 \n", "567 56765.0 57849.0 57432.00 27126.00 27579.0 29085.00 \n", "1594 27414.0 10053.0 151996.00 152753.00 153844.0 151252.00 \n", "1017 70952.0 69264.0 70184.00 8518.00 11296.0 6514.00 \n", "\n", " pay_amt_1 pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \n", "21 10013.00 8501.00 15018.0 10001.0 7997.0 7624.0 \n", "1388 17120.75 0.00 5000.0 5000.0 6000.0 15427.0 \n", "567 3000.00 4000.00 2000.0 2000.0 2000.0 2000.0 \n", "1594 10096.00 17215.25 4700.0 5019.0 5300.0 5002.0 \n", "1017 0.00 3000.00 2000.0 3000.0 2000.0 0.0 " ] }, "execution_count": 373, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Splitting the train and test features into categorical and numerical columns\n", "X_train_num = X_train_capped[list_num_col]\n", "X_train_cat = X_train_capped[list_cat_col]\n", "\n", "X_test_num = X_test_capped[list_num_col]\n", "X_test_cat = X_test_capped[list_cat_col]\n", "\n", "X_train_num.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau assign numerikal dan kategorikal kolom kepada variabble baru untuk diuji korelasi. " ] }, { "cell_type": "code", "execution_count": 374, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameCorrelation CoefficientP-valueCorrelation
0sex-0.0263330.199759Not Significant
1education_level0.0632560.001185Significant
2marital_status-0.0328350.107643Not Significant
\n", "
" ], "text/plain": [ " Column Name Correlation Coefficient P-value Correlation\n", "0 sex -0.026333 0.199759 Not Significant\n", "1 education_level 0.063256 0.001185 Significant\n", "2 marital_status -0.032835 0.107643 Not Significant" ] }, "execution_count": 374, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Finding the correlation between categorical columns and Y Train using Kendall Tau's correlation\n", "p_values = []\n", "interpretation = []\n", "cols = []\n", "corr = []\n", "selected_cat_cols = []\n", "\n", "for col in X_train_cat.columns:\n", " corr_coef, p_value = kendalltau(X_train_cat[col], y_train)\n", "\n", " p_values.append(p_value)\n", " cols.append(col)\n", " corr.append(corr_coef)\n", "\n", " if p_value < 0.05:\n", " interpretation.append('Significant')\n", " selected_cat_cols.append(col)\n", " else :\n", " interpretation.append('Not Significant')\n", "\n", "pd.DataFrame({'Column Name':cols,\n", " 'Correlation Coefficient' : corr,\n", " 'P-value':p_values,\n", " 'Correlation': interpretation })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gunakan KendallTau test terhadap kategorikal kolom untuk melihat p-value yang signifikan." ] }, { "cell_type": "code", "execution_count": 375, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column NameCorrelation CoefficientP-valueCorrelation
0limit_balance-0.1704286.420760e-17Significant
1age0.0291201.562476e-01Not Significant
2pay_00.3587165.899447e-73Significant
3pay_20.2376268.379995e-32Significant
4pay_30.2229754.151099e-28Significant
5pay_40.2531775.201433e-36Significant
6pay_50.2518011.260071e-35Significant
7pay_60.2306255.259622e-30Significant
8bill_amt_1-0.0082146.892859e-01Not Significant
9bill_amt_20.0037058.568704e-01Not Significant
10bill_amt_30.0053527.944745e-01Not Significant
11bill_amt_40.0058197.769890e-01Not Significant
12bill_amt_50.0113195.816419e-01Not Significant
13bill_amt_60.0203263.223980e-01Not Significant
14pay_amt_1-0.1442921.655782e-12Significant
15pay_amt_2-0.1478314.627461e-13Significant
16pay_amt_3-0.1179318.375614e-09Significant
17pay_amt_4-0.1289782.888231e-10Significant
18pay_amt_5-0.0955803.112395e-06Significant
19pay_amt_6-0.1370282.058593e-11Significant
\n", "
" ], "text/plain": [ " Column Name Correlation Coefficient P-value Correlation\n", "0 limit_balance -0.170428 6.420760e-17 Significant\n", "1 age 0.029120 1.562476e-01 Not Significant\n", "2 pay_0 0.358716 5.899447e-73 Significant\n", "3 pay_2 0.237626 8.379995e-32 Significant\n", "4 pay_3 0.222975 4.151099e-28 Significant\n", "5 pay_4 0.253177 5.201433e-36 Significant\n", "6 pay_5 0.251801 1.260071e-35 Significant\n", "7 pay_6 0.230625 5.259622e-30 Significant\n", "8 bill_amt_1 -0.008214 6.892859e-01 Not Significant\n", "9 bill_amt_2 0.003705 8.568704e-01 Not Significant\n", "10 bill_amt_3 0.005352 7.944745e-01 Not Significant\n", "11 bill_amt_4 0.005819 7.769890e-01 Not Significant\n", "12 bill_amt_5 0.011319 5.816419e-01 Not Significant\n", "13 bill_amt_6 0.020326 3.223980e-01 Not Significant\n", "14 pay_amt_1 -0.144292 1.655782e-12 Significant\n", "15 pay_amt_2 -0.147831 4.627461e-13 Significant\n", "16 pay_amt_3 -0.117931 8.375614e-09 Significant\n", "17 pay_amt_4 -0.128978 2.888231e-10 Significant\n", "18 pay_amt_5 -0.095580 3.112395e-06 Significant\n", "19 pay_amt_6 -0.137028 2.058593e-11 Significant" ] }, "execution_count": 375, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Finding the correlation between numerical columns and Y Train using pearsonr and spearmanr correlation\n", "p_values = []\n", "interpretation = []\n", "cols = []\n", "corr = []\n", "selected_num_cols = []\n", "\n", "for col in X_train_num.columns:\n", " if abs(X_train_num[col].skew()) < 0.5:\n", " #For Normally Distributed Columns\n", " corr_coef, p_value = pearsonr(X_train_num[col], y_train)\n", "\n", " p_values.append(p_value)\n", " cols.append(col)\n", " corr.append(corr_coef)\n", "\n", " if p_value < 0.05:\n", " interpretation.append('Significant')\n", " selected_num_cols.append(col)\n", " else :\n", " interpretation.append('Not Significant')\n", " else:\n", " #For Skewed Columns\n", " corr_coef, p_value = spearmanr(X_train_num[col], y_train)\n", "\n", " p_values.append(p_value)\n", " cols.append(col)\n", " corr.append(corr_coef)\n", "\n", " if p_value < 0.05:\n", " interpretation.append('Significant')\n", " selected_num_cols.append(col)\n", " else :\n", " interpretation.append('Not Significant')\n", "\n", "pd.DataFrame({'Column Name':cols,\n", " 'Correlation Coefficient' : corr,\n", " 'P-value':p_values,\n", " 'Correlation': interpretation })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gunakan Pearson dan Spearman test terhadap numerikal kolom untuk melihat p-value yang signifikan." ] }, { "cell_type": "code", "execution_count": 376, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['education_level']\n", "['limit_balance', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5', 'pay_amt_6']\n" ] } ], "source": [ "# Show selected columns based on the correlation test\n", "print(selected_cat_cols)\n", "print(selected_num_cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan ambil data yang signifikan dari kategorikal dan numerikal kolom untuk di scaling nanti. Ini akan jadi \"base\" data yang akan digunakan Machine Learning untuk memastikan \"fitting\" terhadap model yang akan dibuat. " ] }, { "cell_type": "code", "execution_count": 377, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
limit_balancepay_0pay_2pay_3pay_4pay_5pay_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6
21290000.00.00.00.00.00.00.010013.008501.0015018.010001.07997.07624.0
1388260000.00.00.02.00.00.00.017120.750.005000.05000.06000.015427.0
56760000.00.00.00.00.00.00.03000.004000.002000.02000.02000.02000.0
1594150000.0-2.0-1.0-1.00.00.00.010096.0017215.254700.05019.05300.05002.0
1017130000.02.02.00.00.00.00.00.003000.002000.03000.02000.00.0
\n", "
" ], "text/plain": [ " limit_balance pay_0 pay_2 pay_3 pay_4 pay_5 pay_6 pay_amt_1 \\\n", "21 290000.0 0.0 0.0 0.0 0.0 0.0 0.0 10013.00 \n", "1388 260000.0 0.0 0.0 2.0 0.0 0.0 0.0 17120.75 \n", "567 60000.0 0.0 0.0 0.0 0.0 0.0 0.0 3000.00 \n", "1594 150000.0 -2.0 -1.0 -1.0 0.0 0.0 0.0 10096.00 \n", "1017 130000.0 2.0 2.0 0.0 0.0 0.0 0.0 0.00 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \n", "21 8501.00 15018.0 10001.0 7997.0 7624.0 \n", "1388 0.00 5000.0 5000.0 6000.0 15427.0 \n", "567 4000.00 2000.0 2000.0 2000.0 2000.0 \n", "1594 17215.25 4700.0 5019.0 5300.0 5002.0 \n", "1017 3000.00 2000.0 3000.0 2000.0 0.0 " ] }, "execution_count": 377, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Updating Numerical and Categorical Columns\n", "X_train_cat = X_train_cat[selected_cat_cols]\n", "X_train_num = X_train_num[selected_num_cols]\n", "\n", "X_test_cat = X_test_cat[selected_cat_cols]\n", "X_test_num = X_test_num[selected_num_cols]\n", "\n", "#Show first five data from the updated X_train\n", "X_train_num.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Letakan kategorikal dan numerikal kolom yang terpilih tadi ke variable baru untuk siap di-training. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scaling" ] }, { "cell_type": "code", "execution_count": 378, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.36842105, 0.4 , 0.4 , ..., 0.66426448, 0.53774901,\n", " 0.49419848],\n", " [0.32894737, 0.4 , 0.4 , ..., 0.33209903, 0.40346306,\n", " 1. ],\n", " [0.06578947, 0.4 , 0.4 , ..., 0.13283961, 0.13448769,\n", " 0.12964283],\n", " ...,\n", " [0.18421053, 0.8 , 0.8 , ..., 0. , 1. ,\n", " 0. ],\n", " [0.19736842, 0.6 , 0. , ..., 0. , 0. ,\n", " 0. ],\n", " [0. , 0.4 , 0.4 , ..., 0. , 0.03362192,\n", " 0.12964283]])" ] }, "execution_count": 378, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Initialize the MinMaxScaler\n", "scaler = MinMaxScaler()\n", "\n", "#Fit_transform for X_train, transform for X_test\n", "X_train_scaled = scaler.fit_transform(X_train_num)\n", "X_test_scaled = scaler.transform(X_test_num)\n", "\n", "X_train_scaled" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk numerikal data, kita akan mengunakan MinMaxScaler daripada Standard atau Robust scaler karena kita mau coba simpan distribusi original data tanpa mengubah perbedaan \"range\" antar values yang telah ditampilkan. Juga tadi outlier sudah di \"cap\" dengan Winsoriser, jadi sekarang data numerikal yang akan di-scale akan lebih sensitif terhadap perbedaan values. Juga distribusi dari numerikal values juga tidak normal, jadi StandardScaler tidak sesuai untuk variasi yang terlalu skewed. " ] }, { "cell_type": "code", "execution_count": 379, "metadata": {}, "outputs": [], "source": [ "X_train_final = np.concatenate([X_train_cat, X_train_scaled], axis=1)\n", "X_test_final = np.concatenate([X_test_cat, X_test_scaled], axis=1)\n", "\n", "# Get the column names\n", "column_names = list(X_train_cat.columns) + list(X_train_num.columns)\n", "\n", "# Create DataFrames with column names\n", "X_train_final = pd.DataFrame(X_train_final, columns=column_names)\n", "X_test_final = pd.DataFrame(X_test_final, columns=column_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan taruh balik dari X_train dan X_test variables dengan yang sudah di \"scaled\" untuk membuat list baru di variable \"final\". X_train_final dan X_test_final akan digunakan untuk model yang akan diuji. " ] }, { "cell_type": "code", "execution_count": 380, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
education_levellimit_balancepay_0pay_2pay_3pay_4pay_5pay_6pay_amt_1pay_amt_2pay_amt_3pay_amt_4pay_amt_5pay_amt_6
01.00.3684210.40.40.40.40.40.40.5848460.4938060.9164300.6642640.5377490.494198
12.00.3289470.40.40.80.40.40.41.0000000.0000000.3051110.3320990.4034631.000000
22.00.0657890.40.40.40.40.40.40.1752260.2323520.1220440.1328400.1344880.129643
32.00.1842110.00.20.20.40.40.40.5896941.0000000.2868040.3333610.3563920.324237
42.00.1578950.80.80.40.40.40.40.0000000.1742640.1220440.1992590.1344880.000000
.............................................
23671.00.0526321.00.80.81.01.01.00.0000000.0000000.0000000.0000000.0000000.000000
23683.00.1052630.00.00.00.00.20.20.0000000.0580880.0610220.0698741.0000000.194464
23693.00.1842110.80.80.80.80.80.80.3475320.0000000.6102210.0000001.0000000.000000
23703.00.1973680.60.00.00.20.40.40.0000000.0000000.0427150.0000000.0000000.000000
23711.00.0000000.40.40.40.80.80.40.0817720.1452200.0305110.0000000.0336220.129643
\n", "

2372 rows × 14 columns

\n", "
" ], "text/plain": [ " education_level limit_balance pay_0 pay_2 pay_3 pay_4 pay_5 \\\n", "0 1.0 0.368421 0.4 0.4 0.4 0.4 0.4 \n", "1 2.0 0.328947 0.4 0.4 0.8 0.4 0.4 \n", "2 2.0 0.065789 0.4 0.4 0.4 0.4 0.4 \n", "3 2.0 0.184211 0.0 0.2 0.2 0.4 0.4 \n", "4 2.0 0.157895 0.8 0.8 0.4 0.4 0.4 \n", "... ... ... ... ... ... ... ... \n", "2367 1.0 0.052632 1.0 0.8 0.8 1.0 1.0 \n", "2368 3.0 0.105263 0.0 0.0 0.0 0.0 0.2 \n", "2369 3.0 0.184211 0.8 0.8 0.8 0.8 0.8 \n", "2370 3.0 0.197368 0.6 0.0 0.0 0.2 0.4 \n", "2371 1.0 0.000000 0.4 0.4 0.4 0.8 0.8 \n", "\n", " pay_6 pay_amt_1 pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \n", "0 0.4 0.584846 0.493806 0.916430 0.664264 0.537749 0.494198 \n", "1 0.4 1.000000 0.000000 0.305111 0.332099 0.403463 1.000000 \n", "2 0.4 0.175226 0.232352 0.122044 0.132840 0.134488 0.129643 \n", "3 0.4 0.589694 1.000000 0.286804 0.333361 0.356392 0.324237 \n", "4 0.4 0.000000 0.174264 0.122044 0.199259 0.134488 0.000000 \n", "... ... ... ... ... ... ... ... \n", "2367 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "2368 0.2 0.000000 0.058088 0.061022 0.069874 1.000000 0.194464 \n", "2369 0.8 0.347532 0.000000 0.610221 0.000000 1.000000 0.000000 \n", "2370 0.4 0.000000 0.000000 0.042715 0.000000 0.000000 0.000000 \n", "2371 0.4 0.081772 0.145220 0.030511 0.000000 0.033622 0.129643 \n", "\n", "[2372 rows x 14 columns]" ] }, "execution_count": 380, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_final" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cek lagi jika sudah di-concate dengan benar. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Definition" ] }, { "cell_type": "code", "execution_count": 381, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression()" ] }, "execution_count": 381, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lg = LogisticRegression()\n", "lg.fit(X_train_final, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan mengunkan Logistic Regression daripada Linear karena tipe data yang diujui, \"default\" atau tidak, itu adalah binary dengan dua outcome. Ini artinya output dari probability value itu antara 0 dan 1. " ] }, { "cell_type": "code", "execution_count": 382, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KNeighborsClassifier(n_neighbors=11)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KNeighborsClassifier(n_neighbors=11)" ] }, "execution_count": 382, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn = KNeighborsClassifier(n_neighbors=11)\n", "knn.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan menggunakan KNN test untuk mengklasifikasi dataset yang ada jarak dari K, dan perbedakan klasifikasi dari satu data-point dengan yang lain di sekitarnya. Kita mau lihat nilai rata-rata weighted average dari jarak K value yang kita akan gunakan. " ] }, { "cell_type": "code", "execution_count": 383, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "SVC()" ] }, "execution_count": 383, "metadata": {}, "output_type": "execute_result" } ], "source": [ "svc = SVC()\n", "svc.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SVC akan digunakan untuk klasifisikan beberapa kelas dengan hyperplane dan support vectors. Kita juga akan coba uji dengan Regularization parameter (C) dan Kernel trick untuk lihat mana nilai akurasi, precision, dan recall yang paling sesuai untuk test tersebut. " ] }, { "cell_type": "code", "execution_count": 384, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "SVC()" ] }, "execution_count": 384, "metadata": {}, "output_type": "execute_result" } ], "source": [ "svm_non_scaled = SVC(kernel='rbf')\n", "svm_scaled = SVC(kernel='rbf')\n", "\n", "svm_non_scaled.fit(X_train, y_train)\n", "svm_scaled.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk test SVM_scaled, akan mengunakan RBF karena kita tidak tau data distribusi tersebut, juga kita asumsi data boundary tidak linear. " ] }, { "cell_type": "code", "execution_count": 385, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "SVC()" ] }, "execution_count": 385, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Model Training using different kernels\n", "\n", "svm_linear = SVC(kernel='linear')\n", "svm_poly = SVC(kernel='poly')\n", "svm_rbf = SVC(kernel='rbf')\n", "\n", "svm_linear.fit(X_train_scaled, y_train)\n", "svm_poly.fit(X_train_scaled, y_train)\n", "svm_rbf.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan coba perbandingkan \"linear\", \"polynomial\", dan \"RBF\" kernel untuk lihat nilai test. " ] }, { "cell_type": "code", "execution_count": 386, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SVC(C=500)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "SVC(C=500)" ] }, "execution_count": 386, "metadata": {}, "output_type": "execute_result" } ], "source": [ "svm_rbf_1 = SVC(kernel='rbf', C=0.1)\n", "svm_rbf_500 = SVC(kernel='rbf', C=500)\n", "\n", "svm_rbf_1.fit(X_train_scaled, y_train)\n", "svm_rbf_500.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gunakan C regularization untuk minimalisir classification error yang akan menjadi test \"overfitting\". " ] }, { "cell_type": "code", "execution_count": 387, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
SVC(C=500, gamma=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "SVC(C=500, gamma=100)" ] }, "execution_count": 387, "metadata": {}, "output_type": "execute_result" } ], "source": [ "svm_rbf_500_1 = SVC(kernel='rbf', C=500, gamma=0.1)\n", "svm_rbf_500_100 = SVC(kernel='rbf', C=500, gamma=100)\n", "\n", "svm_rbf_500_1.fit(X_train_scaled, y_train)\n", "svm_rbf_500_100.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Coba Gamma test untuk melihat beberapa influence satu point akan terhadap keseluruhan test. " ] }, { "cell_type": "code", "execution_count": 388, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
DecisionTreeClassifier(max_depth=6, random_state=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "DecisionTreeClassifier(max_depth=6, random_state=10)" ] }, "execution_count": 388, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Training using Decision Tree\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "model_dt = DecisionTreeClassifier(max_depth=6, random_state=10)\n", "model_dt.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gunakan Decision tree untuk mengillustrasi perbedaan dari data points yang telah diklasifikasikan oleh Machine Learning. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Training" ] }, { "cell_type": "code", "execution_count": 389, "metadata": {}, "outputs": [], "source": [ "y_pred_train_lg = lg.predict(X_train_final)\n", "y_pred_test_lg = lg.predict(X_test_final)" ] }, { "cell_type": "code", "execution_count": 390, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.83 0.97 0.90 1863\n", " 1 0.72 0.28 0.40 509\n", "\n", " accuracy 0.82 2372\n", " macro avg 0.77 0.62 0.65 2372\n", "weighted avg 0.81 0.82 0.79 2372\n", "\n", " precision recall f1-score support\n", "\n", " 0 0.84 0.97 0.90 467\n", " 1 0.75 0.31 0.44 126\n", "\n", " accuracy 0.83 593\n", " macro avg 0.79 0.64 0.67 593\n", "weighted avg 0.82 0.83 0.80 593\n", "\n" ] } ], "source": [ "# Model Evaluation - Train Set & Test Set\n", "\n", "print(classification_report(y_train, y_pred_train_lg))\n", "print(classification_report(y_test, y_pred_test_lg))" ] }, { "cell_type": "code", "execution_count": 391, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set Evaluation:\n", "accuracy: 0.8217\n", "precision (weighted avg): 0.8067\n", "recall (weighted avg): 0.8217\n", "f1-score (weighted avg): 0.7893\n", "Test Set Evaluation:\n", "accuracy: 0.8314\n", "precision (weighted avg): 0.8202\n", "recall (weighted avg): 0.8314\n", "f1-score (weighted avg): 0.8025\n" ] } ], "source": [ "# Model Evaluation - Train Set\n", "train_report_dict = classification_report(y_train, y_pred_train_lg, output_dict=True)\n", "train_accuracy = accuracy_score(y_train, y_pred_train_lg)\n", "\n", "print(\"Train Set Evaluation:\")\n", "#print(classification_report(y_train, y_pred_train_lg))\n", "print(f\"accuracy: {train_accuracy:.4f}\")\n", "print(f\"precision (weighted avg): {train_report_dict['weighted avg']['precision']:.4f}\")\n", "print(f\"recall (weighted avg): {train_report_dict['weighted avg']['recall']:.4f}\")\n", "print(f\"f1-score (weighted avg): {train_report_dict['weighted avg']['f1-score']:.4f}\")\n", "\n", "# Model Evaluation - Test Set\n", "test_report_dict = classification_report(y_test, y_pred_test_lg, output_dict=True)\n", "test_accuracy = accuracy_score(y_test, y_pred_test_lg)\n", "\n", "print(\"Test Set Evaluation:\")\n", "#print(classification_report(y_test, y_pred_test_lg))\n", "print(f\"accuracy: {test_accuracy:.4f}\")\n", "print(f\"precision (weighted avg): {test_report_dict['weighted avg']['precision']:.4f}\")\n", "print(f\"recall (weighted avg): {test_report_dict['weighted avg']['recall']:.4f}\")\n", "print(f\"f1-score (weighted avg): {test_report_dict['weighted avg']['f1-score']:.4f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk test Logistic Regression, kelihatan nilai accuracy, precision, recall dan f1-score untuk Test dan Train set itu cukup tepat disekitar 0.8. Untuk values yang \"default\" atau \"1\", test tidak terlalu memiliki performance yang baik dengan skor yang rata-rata rendah karena sample size lebih kecil, oleh karena itu training untuk data yang lulus bayar lebih baik daripada yang gagal bayar. " ] }, { "cell_type": "code", "execution_count": 392, "metadata": {}, "outputs": [], "source": [ "# Model Prediction\n", "\n", "y_pred_train_knn = knn.predict(X_train_scaled)\n", "y_pred_test_knn = knn.predict(X_test_scaled)" ] }, { "cell_type": "code", "execution_count": 393, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.85 0.96 0.90 1863\n", " 1 0.73 0.36 0.48 509\n", "\n", " accuracy 0.83 2372\n", " macro avg 0.79 0.66 0.69 2372\n", "weighted avg 0.82 0.83 0.81 2372\n", "\n", " precision recall f1-score support\n", "\n", " 0 0.85 0.97 0.90 467\n", " 1 0.74 0.36 0.48 126\n", "\n", " accuracy 0.84 593\n", " macro avg 0.79 0.66 0.69 593\n", "weighted avg 0.82 0.84 0.81 593\n", "\n" ] } ], "source": [ "# Model Evaluation - Train Set & Test Set\n", "\n", "print(classification_report(y_train, y_pred_train_knn))\n", "print(classification_report(y_test, y_pred_test_knn))" ] }, { "cell_type": "code", "execution_count": 394, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set Evaluation (KNN):\n", "accuracy: 0.8339\n", "precision (weighted avg): 0.8209\n", "recall (weighted avg): 0.8339\n", "f1-score (weighted avg): 0.8116\n", "Test Set Evaluation (KNN):\n", "accuracy: 0.8364\n", "precision (weighted avg): 0.8244\n", "recall (weighted avg): 0.8364\n", "f1-score (weighted avg): 0.8133\n" ] } ], "source": [ "# Model Evaluation - Train Set\n", "train_report_knn = classification_report(y_train, y_pred_train_knn, output_dict=True)\n", "train_accuracy_knn = accuracy_score(y_train, y_pred_train_knn)\n", "\n", "print(\"Train Set Evaluation (KNN):\")\n", "#print(classification_report(y_train, y_pred_train_knn))\n", "print(f\"accuracy: {train_accuracy_knn:.4f}\")\n", "print(f\"precision (weighted avg): {train_report_knn['weighted avg']['precision']:.4f}\")\n", "print(f\"recall (weighted avg): {train_report_knn['weighted avg']['recall']:.4f}\")\n", "print(f\"f1-score (weighted avg): {train_report_knn['weighted avg']['f1-score']:.4f}\")\n", "\n", "# Model Evaluation - Test Set\n", "test_report_knn = classification_report(y_test, y_pred_test_knn, output_dict=True)\n", "test_accuracy_knn = accuracy_score(y_test, y_pred_test_knn)\n", "\n", "print(\"Test Set Evaluation (KNN):\")\n", "#print(classification_report(y_test, y_pred_test_knn))\n", "print(f\"accuracy: {test_accuracy_knn:.4f}\")\n", "print(f\"precision (weighted avg): {test_report_knn['weighted avg']['precision']:.4f}\")\n", "print(f\"recall (weighted avg): {test_report_knn['weighted avg']['recall']:.4f}\")\n", "print(f\"f1-score (weighted avg): {test_report_knn['weighted avg']['f1-score']:.4f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk test KNN, kelihatan nilai accuracy, precision, recall dan f1-score untuk Test dan Train set itu cukup tepat disekitar 0.8. Untuk values yang \"default\" atau \"1\", test tidak terlalu memiliki performance yang baik dengan skor yang rata-rata rendah, sama situasi dengan test sebelumnya. " ] }, { "cell_type": "code", "execution_count": 395, "metadata": {}, "outputs": [], "source": [ "y_pred_train_svc = svc.predict(X_train_scaled)\n", "y_pred_test_svc = svc.predict(X_test_scaled)" ] }, { "cell_type": "code", "execution_count": 396, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.85 0.97 0.91 1863\n", " 1 0.77 0.38 0.51 509\n", "\n", " accuracy 0.84 2372\n", " macro avg 0.81 0.67 0.71 2372\n", "weighted avg 0.83 0.84 0.82 2372\n", "\n", " precision recall f1-score support\n", "\n", " 0 0.86 0.97 0.91 467\n", " 1 0.77 0.40 0.52 126\n", "\n", " accuracy 0.85 593\n", " macro avg 0.81 0.68 0.72 593\n", "weighted avg 0.84 0.85 0.83 593\n", "\n" ] } ], "source": [ "print(classification_report(y_train, y_pred_train_svc))\n", "print(classification_report(y_test, y_pred_test_svc))" ] }, { "cell_type": "code", "execution_count": 397, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set Evaluation (SVC):\n", "accuracy: 0.8423\n", "precision (weighted avg): 0.8336\n", "recall (weighted avg): 0.8423\n", "f1-score (weighted avg): 0.8204\n", "Test Set Evaluation (SVC):\n", "accuracy: 0.8465\n", "precision (weighted avg): 0.8376\n", "recall (weighted avg): 0.8465\n", "f1-score (weighted avg): 0.8267\n" ] } ], "source": [ "# Model Evaluation - Train Set\n", "train_report_svc = classification_report(y_train, y_pred_train_svc, output_dict=True)\n", "train_accuracy_svc = accuracy_score(y_train, y_pred_train_svc)\n", "\n", "print(\"Train Set Evaluation (SVC):\")\n", "#print(classification_report(y_train, y_pred_train_svc))\n", "print(f\"accuracy: {train_accuracy_svc:.4f}\")\n", "print(f\"precision (weighted avg): {train_report_svc['weighted avg']['precision']:.4f}\")\n", "print(f\"recall (weighted avg): {train_report_svc['weighted avg']['recall']:.4f}\")\n", "print(f\"f1-score (weighted avg): {train_report_svc['weighted avg']['f1-score']:.4f}\")\n", "\n", "# Model Evaluation - Test Set\n", "test_report_svc = classification_report(y_test, y_pred_test_svc, output_dict=True)\n", "test_accuracy_svc = accuracy_score(y_test, y_pred_test_svc)\n", "\n", "print(\"Test Set Evaluation (SVC):\")\n", "#print(classification_report(y_test, y_pred_test_svc))\n", "print(f\"accuracy: {test_accuracy_svc:.4f}\")\n", "print(f\"precision (weighted avg): {test_report_svc['weighted avg']['precision']:.4f}\")\n", "print(f\"recall (weighted avg): {test_report_svc['weighted avg']['recall']:.4f}\")\n", "print(f\"f1-score (weighted avg): {test_report_svc['weighted avg']['f1-score']:.4f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk test SVC, kelihatan nilai accuracy, precision, recall dan f1-score untuk Test dan Train set itu cukup tepat disekitar 0.8. Untuk values yang \"default\" atau \"1\", test tidak terlalu memiliki performance yang baik dengan skor yang rata-rata rendah, sama situasi dengan test sebelumnya. Oleh Karena SVC memiliki akurasi, precision, recall, dan f-1 score yang paling baik, maka kita akan coba cross-validate dengan SVC dataset. " ] }, { "cell_type": "code", "execution_count": 398, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Non Scaled SVM\n", "Train : 0.00392156862745098\n", "Test : 0.0\n", "\n", "Scaled SVM\n", "Train : 0.5065963060686016\n", "Test : 0.5235602094240838\n" ] } ], "source": [ "# Model Evaluation\n", "\n", "def performance_check(clf, X, y):\n", " y_pred = clf.predict(X)\n", " return f1_score(y, y_pred)\n", "\n", "print('Non Scaled SVM')\n", "print('Train : ', performance_check(svm_non_scaled, X_train, y_train))\n", "print('Test : ', performance_check(svm_non_scaled, X_test, y_test))\n", "print('')\n", "\n", "print('Scaled SVM')\n", "print('Train : ', performance_check(svm_scaled, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_scaled, X_test_scaled, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bisa diperbandinkan untuk performance chek f-1 score antar train dan test values di SVM. Skor lebih rendah daripada classification report, ini artinya ada masalah fitting atau gunakan hyperparameter tuning nanti. " ] }, { "cell_type": "code", "execution_count": 399, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SVM - Linear\n", "Train : 0.472258064516129\n", "Test : 0.5025641025641026\n", "\n", "SVM - Polynomial\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train : 0.5301837270341208\n", "Test : 0.5157894736842106\n", "\n", "SVM - RBF\n", "Train : 0.5065963060686016\n", "Test : 0.5235602094240838\n", "\n" ] } ], "source": [ "# Model Evaluation\n", "\n", "print('SVM - Linear')\n", "print('Train : ', performance_check(svm_linear, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_linear, X_test_scaled, y_test))\n", "print('')\n", "\n", "print('SVM - Polynomial')\n", "print('Train : ', performance_check(svm_poly, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_poly, X_test_scaled, y_test))\n", "print('')\n", "\n", "print('SVM - RBF')\n", "print('Train : ', performance_check(svm_rbf, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_rbf, X_test_scaled, y_test))\n", "print('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance antara Linear, Poly, dan RBF sangat dekat, walaupun yang Polynomial paling baik nilai yang cenderung tinggi untuk Train dan Test. " ] }, { "cell_type": "code", "execution_count": 400, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SVM - C=0.1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train : 0.3203732503888025\n", "Test : 0.3712574850299401\n", "\n", "SVM - C=500\n", "Train : 0.831140350877193\n", "Test : 0.4773662551440329\n", "\n" ] } ], "source": [ "# Model Evaluation\n", "\n", "print('SVM - C=0.1')\n", "print('Train : ', performance_check(svm_rbf_1, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_rbf_1, X_test_scaled, y_test))\n", "print('')\n", "\n", "print('SVM - C=500')\n", "print('Train : ', performance_check(svm_rbf_500, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_rbf_500, X_test_scaled, y_test))\n", "print('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk C Regularization test, bisa kelihatan C value yang lebih tinggi artinya akan punya penalty terhadap data yang \"misclassified\" jadi toleransi terhadap data yang salah lebih kecil. Oleh karena itu, SVM yang punya C value yang besar akan cenderung punya akurasi yang baik. " ] }, { "cell_type": "code", "execution_count": 401, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SVM - gamma=0.1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train : 0.5137614678899083\n", "Test : 0.5284974093264249\n", "\n", "SVM - gamma=100\n", "Train : 0.9891411648568608\n", "Test : 0.1566265060240964\n", "\n" ] } ], "source": [ "print('SVM - gamma=0.1')\n", "print('Train : ', performance_check(svm_rbf_500_1, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_rbf_500_1, X_test_scaled, y_test))\n", "print('')\n", "\n", "print('SVM - gamma=100')\n", "print('Train : ', performance_check(svm_rbf_500_100, X_train_scaled, y_train))\n", "print('Test : ', performance_check(svm_rbf_500_100, X_test_scaled, y_test))\n", "print('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk gamma, kelihatan juga value gamma yang lebih besar akan berdampak \"overfitting\" terhadap training dataset. " ] }, { "cell_type": "code", "execution_count": 402, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Decision Tree - Train\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.87 0.95 0.91 1863\n", " 1 0.75 0.50 0.60 509\n", "\n", " accuracy 0.86 2372\n", " macro avg 0.81 0.72 0.75 2372\n", "weighted avg 0.85 0.86 0.84 2372\n", "\n", "\n", "Decision Tree - Test\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.86 0.95 0.90 467\n", " 1 0.69 0.40 0.51 126\n", "\n", " accuracy 0.83 593\n", " macro avg 0.77 0.68 0.71 593\n", "weighted avg 0.82 0.83 0.82 593\n", "\n" ] } ], "source": [ "# Model Evaluation\n", "\n", "def performance_check(clf, X, y):\n", " y_pred = clf.predict(X)\n", " cm = confusion_matrix(y, y_pred)\n", " disp = ConfusionMatrixDisplay(confusion_matrix=cm)\n", " disp.plot()\n", " plt.show()\n", " print(classification_report(y, y_pred))\n", "\n", "print('Decision Tree - Train')\n", "performance_check(model_dt, X_train_scaled, y_train)\n", "print('')\n", "\n", "print('Decision Tree - Test')\n", "performance_check(model_dt, X_test_scaled, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita coba buat decision tree untuk melihat rangka model performance seperti TP, TN, FP, dan FN. Kelihatan bahwa data yang True Negative adalah mayoritas untuk Training dan Test Dataset. Ini artinya client yang lunas bayar utang benar diprediksi untuk bayar utang dengan tepat waktu. Di sisi lain, True positive juga cukup signifikan oleh karena banyak orang di train dan test dataset tidak bayar dengan lunas, dan model prediksi dengan benar.\n", "\n", "Untuk masalah yang parah, False Negative adalah situasi jika model tidak prediksi dengan benar jika client tidak bisa bayar utang atau \"default\", tetapi model prediksi bisa lunas bayar. Di train dataset memiliki sekitar 250-an sample yang False Negative, ini sangat tinggi terhadap resiko bank untuk kasih pinjaman atau \"loan\" terhadap client yang tidak mampu bayar.\n", "\n", "Juga, dataset memiliki False Positif yang cukup tinggi juga jika model salah prediksi client yang kelihatannya tidak bisa bayar, tetapi mampu bayar. Ini juga masalah dari model karena client tersebut bisa di \"accuse\" untuk tidak bisa bayar, dan reputasi dari bank bisa terdampak negatif. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Evaluation" ] }, { "cell_type": "code", "execution_count": 403, "metadata": {}, "outputs": [], "source": [ "# Calculate predictions for training and test sets\n", "y_train_pred = lg.predict(X_train_final)\n", "y_test_pred = lg.predict(X_test_final)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Untuk Model Evaluasi, Kita akan ambil dua X_train dan X_test file dan akan coba prediksi untuk di-test dengan sample file baru nanti. Kita akan taruh dua file kedalam dua variable baru. " ] }, { "cell_type": "code", "execution_count": 404, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Evaluation Metrics:\n", "\n", "Mean Absolute Error (MAE)\n", " - Train Set: 0.17833052276559866\n", " - Test Set: 0.16863406408094436\n", "\n", "\n", "Mean Squared Error (MSE)\n", " - Train Set: 0.17833052276559866\n", " - Test Set: 0.16863406408094436\n", "\n", "\n", "Root Mean Squared Error (RMSE)\n", " - Train Set: 0.4222919875697367\n", " - Test Set: 0.41065078117659093\n", "\n", "\n", "R^2 Score\n", " - Train Set: -0.058094397464005354\n", " - Test Set: -0.007783555963427391\n" ] } ], "source": [ "# Evaluate using metrics MAE, MSE, and Rsquared\n", "mae_train = mean_absolute_error(y_train, y_train_pred)\n", "mae_test = mean_absolute_error(y_test, y_test_pred)\n", "\n", "mse_train = mean_squared_error(y_train, y_train_pred)\n", "mse_test = mean_squared_error(y_test, y_test_pred)\n", "\n", "rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)\n", "rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)\n", "\n", "r2_train = r2_score(y_train, y_train_pred)\n", "r2_test = r2_score(y_test, y_test_pred)\n", "\n", "# Print the evaluation report\n", "print(\"Model Evaluation Metrics:\\n\")\n", "print(f\"Mean Absolute Error (MAE)\\n - Train Set: {mae_train}\\n - Test Set: {mae_test}\\n\")\n", "print()\n", "print(f\"Mean Squared Error (MSE)\\n - Train Set: {mse_train}\\n - Test Set: {mse_test}\\n\")\n", "print()\n", "print(f\"Root Mean Squared Error (RMSE)\\n - Train Set: {rmse_train}\\n - Test Set: {rmse_test}\\n\")\n", "print()\n", "print(f\"R^2 Score\\n - Train Set: {r2_train}\\n - Test Set: {r2_test}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Di test ini, kita akan nge-test beberapa metrik yang akan mendapatkan insight terhadap dataset Y_train dengan Y_train_pred (akan diprediksi):\n", "\n", "-Mean Absolute Error memiliki akurasi error sebesar 0.17 untuk Train dan Test set. Value lebih rendah lebih akurat. \n", "-Mean Squared Error memiliki akurasi error sebesar 0.17 untuk Train dan Test set. Value lebih rendah lebih akurat. \n", "-Root Mean Squared Error memiliki akurasi error sebesar 0.42 untuk Train dan Test set. Value lebih rendah lebih akurat. \n", "\n", "-R^2 adalah proporsi dari variance target yang bisa dijelaskan oleh karena model tersebut. Disini, value tersebut adalah negatif yang di-train dan test set, artinya model tidak bisa menjelaskan variasi dari target variable \"default_payment_next_month\".\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross Validation and Parameter Tuning" ] }, { "cell_type": "code", "execution_count": 405, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomForestClassifier()" ] }, "execution_count": 405, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf = RandomForestClassifier()\n", "rf.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan coba menggunakan Random Forest classifier sebagai baseline model." ] }, { "cell_type": "code", "execution_count": 406, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Train Set : 0.9911330049261083 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 1863\n", " 1 0.99 0.99 0.99 509\n", "\n", " accuracy 1.00 2372\n", " macro avg 1.00 0.99 0.99 2372\n", "weighted avg 1.00 1.00 1.00 2372\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "y_pred_train = rf.predict(X_train_scaled)\n", "\n", "print('F1 Score - Train Set : ', f1_score(y_train, y_pred_train), '\\n')\n", "print('Classification Report : \\n', classification_report(y_train, y_pred_train), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(rf, X_train_scaled, y_train, cmap='Reds'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dari classification report, kelihatan values accuracy, precision, dan recall sangat tinggi dan adalah overfitting. Kita bisa coba paramaeter tuning untuk memperbaik nilainya. " ] }, { "cell_type": "code", "execution_count": 407, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - All - Cross Validation : [0.49618321 0.41732283 0.4911032 ]\n", "F1 Score - Mean - Cross Validation : 0.46820308119983817\n", "F1 Score - Std - Cross Validation : 0.036037491823006584\n", "F1 Score - Range of Test-Set : 0.4321655893768316 - 0.5042405730228448\n" ] } ], "source": [ "\n", "f1_train_cross_val = cross_val_score(rf,\n", " X_train_scaled,\n", " y_train,\n", " cv=3,\n", " scoring=\"f1\")\n", "\n", "print('F1 Score - All - Cross Validation : ', f1_train_cross_val)\n", "print('F1 Score - Mean - Cross Validation : ', f1_train_cross_val.mean())\n", "print('F1 Score - Std - Cross Validation : ', f1_train_cross_val.std())\n", "print('F1 Score - Range of Test-Set : ', (f1_train_cross_val.mean()-f1_train_cross_val.std()) , '-', (f1_train_cross_val.mean()+f1_train_cross_val.std()))" ] }, { "cell_type": "code", "execution_count": 408, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Test Set : 0.5471698113207547 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 0.87 0.94 0.90 467\n", " 1 0.67 0.46 0.55 126\n", "\n", " accuracy 0.84 593\n", " macro avg 0.77 0.70 0.72 593\n", "weighted avg 0.83 0.84 0.83 593\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "y_pred_test = rf.predict(X_test_scaled)\n", "\n", "print('F1 Score - Test Set : ', f1_score(y_test, y_pred_test), '\\n')\n", "print('Classification Report : \\n', classification_report(y_test, y_pred_test), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(rf, X_test_scaled, y_test, cmap='Reds'))" ] }, { "cell_type": "code", "execution_count": 409, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Baseline (Default Hyperparameter)\n", "train - precision 0.771084\n", "train - recall 0.377210\n", "train - accuracy 0.842327\n", "train - f1_score 0.506596\n", "test - precision 0.769231\n", "test - recall 0.396825\n", "test - accuracy 0.846543\n", "test - f1_score 0.523560\n" ] } ], "source": [ "\n", "all_reports = {}\n", "\n", "def performance_report(all_reports, y_train, y_pred_train_svc, y_test, y_pred_test_svc, name):\n", " score_reports = {\n", " 'train - precision': precision_score(y_train, y_pred_train_svc),\n", " 'train - recall': recall_score(y_train, y_pred_train_svc),\n", " 'train - accuracy': accuracy_score(y_train, y_pred_train_svc),\n", " 'train - f1_score': f1_score(y_train, y_pred_train_svc),\n", " 'test - precision': precision_score(y_test, y_pred_test_svc),\n", " 'test - recall': recall_score(y_test, y_pred_test_svc),\n", " 'test - accuracy': accuracy_score(y_test, y_pred_test_svc),\n", " 'test - f1_score': f1_score(y_test, y_pred_test_svc),\n", " }\n", " all_reports[name] = score_reports\n", " return all_reports\n", "\n", "# Example usage (ensure y_pred_train_svc and y_pred_test_svc are defined)\n", "all_reports = performance_report(all_reports, y_train, y_pred_train_svc, y_test, y_pred_test_svc, 'Baseline (Default Hyperparameter)')\n", "report_df = pd.DataFrame(all_reports)\n", "print(report_df)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan coba save baseline model kedalam dictionary \"all_report\" untuk menumpang data dari random dan grid search. " ] }, { "cell_type": "code", "execution_count": 410, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/miniconda3/lib/python3.12/site-packages/threadpoolctl.py:1214: RuntimeWarning: \n", "Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at\n", "the same time. Both libraries are known to be incompatible and this\n", "can cause random crashes or deadlocks on Linux when loaded in the\n", "same Python program.\n", "Using threadpoolctl may cause crashes or deadlocks. For more\n", "information and possible workarounds, please see\n", " https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md\n", "\n", " warnings.warn(msg, RuntimeWarning)\n", "/opt/miniconda3/lib/python3.12/site-packages/threadpoolctl.py:1214: RuntimeWarning: \n", "Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at\n", "the same time. Both libraries are known to be incompatible and this\n", "can cause random crashes or deadlocks on Linux when loaded in the\n", "same Python program.\n", "Using threadpoolctl may cause crashes or deadlocks. For more\n", "information and possible workarounds, please see\n", " https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md\n", "\n", " warnings.warn(msg, RuntimeWarning)\n", "/opt/miniconda3/lib/python3.12/site-packages/threadpoolctl.py:1214: RuntimeWarning: \n", "Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at\n", "the same time. Both libraries are known to be incompatible and this\n", "can cause random crashes or deadlocks on Linux when loaded in the\n", "same Python program.\n", "Using threadpoolctl may cause crashes or deadlocks. For more\n", "information and possible workarounds, please see\n", " https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md\n", "\n", " warnings.warn(msg, RuntimeWarning)\n", "/opt/miniconda3/lib/python3.12/site-packages/threadpoolctl.py:1214: RuntimeWarning: \n", "Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at\n", "the same time. Both libraries are known to be incompatible and this\n", "can cause random crashes or deadlocks on Linux when loaded in the\n", "same Python program.\n", "Using threadpoolctl may cause crashes or deadlocks. For more\n", "information and possible workarounds, please see\n", " https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md\n", "\n", " warnings.warn(msg, RuntimeWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best Parameters: {'bootstrap': False, 'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 7, 'min_samples_split': 8, 'n_estimators': 159}\n", "Best Score: 0.4781662991940268\n", " precision recall f1-score support\n", "\n", " 0 0.86 0.96 0.91 467\n", " 1 0.73 0.44 0.55 126\n", "\n", " accuracy 0.85 593\n", " macro avg 0.80 0.70 0.73 593\n", "weighted avg 0.84 0.85 0.83 593\n", "\n" ] } ], "source": [ "\n", "# Define the parameter distribution\n", "random_search_params = {\n", " 'n_estimators': randint(50, 200),\n", " 'max_features': ['auto', 'sqrt', 'log2'],\n", " 'max_depth': [None] + list(range(10, 110, 10)),\n", " 'min_samples_split': randint(2, 11),\n", " 'min_samples_leaf': randint(1, 11),\n", " 'bootstrap': [True, False]\n", "}\n", "\n", "\n", "# Initialize RandomizedSearchCV with reduced iterations and fewer CV folds\n", "rf_randomcv = RandomizedSearchCV(estimator=RandomForestClassifier(),\n", " param_distributions=random_search_params,\n", " n_iter=20, # Reduced number of iterations\n", " cv=3, # Reduced number of CV folds\n", " random_state=46,\n", " n_jobs=-1,\n", " scoring='f1')\n", "\n", "# Fit the model\n", "rf_randomcv.fit(X_train_scaled, y_train)\n", "\n", "# Print the best parameters and the best score\n", "print(\"Best Parameters:\", rf_randomcv.best_params_)\n", "print(\"Best Score:\", rf_randomcv.best_score_)\n", "\n", "# Make predictions with the best estimator\n", "y_pred = rf_randomcv.predict(X_test_scaled)\n", "print(classification_report(y_test, y_pred))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan membuat parameter untuk random search dengan model RandomizedSearchCV untuk mengoptimisasi performance model. Selection dari n_iter atau CV bisa diganti untuk ganti random combination yang akan digunakan dalam search process, dan juga mengoptimisasi beberapa K-folds yang akan di validate dalam beberapa set. \n", "\n" ] }, { "cell_type": "code", "execution_count": 411, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': False,\n", " 'max_depth': 30,\n", " 'max_features': 'sqrt',\n", " 'min_samples_leaf': 7,\n", " 'min_samples_split': 8,\n", " 'n_estimators': 159}" ] }, "execution_count": 411, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_randomcv.best_params_" ] }, { "cell_type": "code", "execution_count": 412, "metadata": {}, "outputs": [], "source": [ "rf_randomcv_best = rf_randomcv.best_estimator_" ] }, { "cell_type": "code", "execution_count": 413, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Test Set : 0.5517241379310345 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 0.86 0.96 0.91 467\n", " 1 0.73 0.44 0.55 126\n", "\n", " accuracy 0.85 593\n", " macro avg 0.80 0.70 0.73 593\n", "weighted avg 0.84 0.85 0.83 593\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Check Performance Model against Test-Set\n", "\n", "y_pred_test = rf_randomcv_best.predict(X_test_scaled)\n", "\n", "print('F1 Score - Test Set : ', f1_score(y_test, y_pred_test), '\\n')\n", "print('Classification Report : \\n', classification_report(y_test, y_pred_test), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(rf_randomcv_best, X_test_scaled, y_test, cmap='Reds'))" ] }, { "cell_type": "code", "execution_count": 414, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Baseline (Default Hyperparameter)Random Search
train - precision0.7710840.771084
train - recall0.3772100.377210
train - accuracy0.8423270.842327
train - f1_score0.5065960.506596
test - precision0.7692310.769231
test - recall0.3968250.396825
test - accuracy0.8465430.846543
test - f1_score0.5235600.523560
\n", "
" ], "text/plain": [ " Baseline (Default Hyperparameter) Random Search\n", "train - precision 0.771084 0.771084\n", "train - recall 0.377210 0.377210\n", "train - accuracy 0.842327 0.842327\n", "train - f1_score 0.506596 0.506596\n", "test - precision 0.769231 0.769231\n", "test - recall 0.396825 0.396825\n", "test - accuracy 0.846543 0.846543\n", "test - f1_score 0.523560 0.523560" ] }, "execution_count": 414, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save Classification Report into a Dictionary\n", "\n", "all_reports = performance_report(all_reports, y_train, y_pred_train_svc, y_test, y_pred_test_svc, 'Random Search')\n", "pd.DataFrame(all_reports)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bisa kelihatan dari report bahwa tidak ada perbedaan dari baseline dan random search. Kita akan coba save model kedalam dictionary. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grid Search" ] }, { "cell_type": "code", "execution_count": 415, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n_estimators': [100, 200],\n", " 'max_features': ['auto', 'sqrt'],\n", " 'max_depth': [10, 20, None],\n", " 'min_samples_split': [2, 5],\n", " 'min_samples_leaf': [1, 2],\n", " 'bootstrap': [True, False]}" ] }, "execution_count": 415, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define a simplified parameter grid\n", "grid_search_params = {\n", " 'n_estimators': [100, 200],\n", " 'max_features': ['auto', 'sqrt'],\n", " 'max_depth': [10, 20, None],\n", " 'min_samples_split': [2, 5],\n", " 'min_samples_leaf': [1, 2],\n", " 'bootstrap': [True, False]\n", "}\n", "\n", "grid_search_params\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan set parameter untuk grid search dengan variables n_estimators, max_features, dan lain lain. " ] }, { "cell_type": "code", "execution_count": 416, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': False,\n", " 'max_depth': 30,\n", " 'max_features': 'sqrt',\n", " 'min_samples_leaf': 7,\n", " 'min_samples_split': 8,\n", " 'n_estimators': 159}" ] }, "execution_count": 416, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_randomcv.best_params_" ] }, { "cell_type": "code", "execution_count": 417, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 96 candidates, totalling 288 fits\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.9s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.9s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 1.9s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.9s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.4s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.4s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.5s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.5s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.4s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.7s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.7s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.8s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 3.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.7s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.4s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.6s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.8s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 3.1s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 3.4s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 3.3s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.1s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 2.2s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.4s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.7s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.6s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.6s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.1s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.4s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.9s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 1.9s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 1.9s\n", "[CV] END bootstrap=True, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 1.9s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.5s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 1.6s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 1.4s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 1.4s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.7s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 1.2s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.7s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 1.4s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 3.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 2.1s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 3.2s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 3.5s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.9s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.7s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 3.2s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.5s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 3.3s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.4s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.7s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 2.7s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.5s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.9s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.8s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.9s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.5s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.4s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.1s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.7s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 1.1s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 1.0s\n", "[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 1.0s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.7s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.7s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.6s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 1.4s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 1.7s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.1s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 2.4s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 2.4s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 0.0s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 2.2s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 3.4s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 3.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 3.4s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time= 1.2s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.3s\n", "[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 2.7s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.5s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time= 1.8s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 3.2s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.4s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time= 3.5s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.3s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.6s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.2s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time= 1.6s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time= 3.4s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 3.3s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 3.6s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time= 3.5s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.5s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.6s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time= 1.4s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 2.3s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 2.2s\n", "[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time= 2.1s\n" ] }, { "data": { "text/html": [ "
GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,\n",
       "             param_grid={'bootstrap': [True, False],\n",
       "                         'max_depth': [10, 20, None],\n",
       "                         'max_features': ['auto', 'sqrt'],\n",
       "                         'min_samples_leaf': [1, 2],\n",
       "                         'min_samples_split': [2, 5],\n",
       "                         'n_estimators': [100, 200]},\n",
       "             scoring='f1', verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,\n", " param_grid={'bootstrap': [True, False],\n", " 'max_depth': [10, 20, None],\n", " 'max_features': ['auto', 'sqrt'],\n", " 'min_samples_leaf': [1, 2],\n", " 'min_samples_split': [2, 5],\n", " 'n_estimators': [100, 200]},\n", " scoring='f1', verbose=2)" ] }, "execution_count": 417, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_gridcv = GridSearchCV(estimator=RandomForestClassifier(),\n", " param_grid=grid_search_params,\n", " cv=3,\n", " n_jobs=-1,\n", " verbose=2,\n", " scoring='f1')\n", "\n", "rf_gridcv.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan coba grid search dengan cross validation yang ditepatkan. " ] }, { "cell_type": "code", "execution_count": 418, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': True,\n", " 'max_depth': None,\n", " 'max_features': 'sqrt',\n", " 'min_samples_leaf': 2,\n", " 'min_samples_split': 2,\n", " 'n_estimators': 100}" ] }, "execution_count": 418, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf_gridcv.best_params_" ] }, { "cell_type": "code", "execution_count": 419, "metadata": {}, "outputs": [], "source": [ "rf_gridcv_best = rf_gridcv.best_estimator_" ] }, { "cell_type": "code", "execution_count": 420, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Test Set : 0.5507246376811594 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 0.87 0.95 0.91 467\n", " 1 0.70 0.45 0.55 126\n", "\n", " accuracy 0.84 593\n", " macro avg 0.78 0.70 0.73 593\n", "weighted avg 0.83 0.84 0.83 593\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "y_pred_test = rf_gridcv_best.predict(X_test_scaled)\n", "\n", "print('F1 Score - Test Set : ', f1_score(y_test, y_pred_test), '\\n')\n", "print('Classification Report : \\n', classification_report(y_test, y_pred_test), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(rf_gridcv_best, X_test_scaled, y_test, cmap='Reds'))" ] }, { "cell_type": "code", "execution_count": 421, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Baseline (Default Hyperparameter)Random SearchGrid Search
train - precision0.7710840.7710840.771084
train - recall0.3772100.3772100.377210
train - accuracy0.8423270.8423270.842327
train - f1_score0.5065960.5065960.506596
test - precision0.7692310.7692310.769231
test - recall0.3968250.3968250.396825
test - accuracy0.8465430.8465430.846543
test - f1_score0.5235600.5235600.523560
\n", "
" ], "text/plain": [ " Baseline (Default Hyperparameter) Random Search \\\n", "train - precision 0.771084 0.771084 \n", "train - recall 0.377210 0.377210 \n", "train - accuracy 0.842327 0.842327 \n", "train - f1_score 0.506596 0.506596 \n", "test - precision 0.769231 0.769231 \n", "test - recall 0.396825 0.396825 \n", "test - accuracy 0.846543 0.846543 \n", "test - f1_score 0.523560 0.523560 \n", "\n", " Grid Search \n", "train - precision 0.771084 \n", "train - recall 0.377210 \n", "train - accuracy 0.842327 \n", "train - f1_score 0.506596 \n", "test - precision 0.769231 \n", "test - recall 0.396825 \n", "test - accuracy 0.846543 \n", "test - f1_score 0.523560 " ] }, "execution_count": 421, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save Classification Report into a Dictionary\n", "\n", "all_reports = performance_report(all_reports, y_train, y_pred_train_svc, y_test, y_pred_test_svc, 'Grid Search')\n", "pd.DataFrame(all_reports)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bisa kelihatan dari report bahwa tidak ada perbedaan dari baseline, random search dan grid search. Kita akan coba save model kedalam dictionary. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Saving" ] }, { "cell_type": "code", "execution_count": 422, "metadata": {}, "outputs": [], "source": [ "#Model saving\n", "\n", "with open('list_num_cols.txt', 'w') as file_1:\n", " json.dump(selected_num_cols, file_1)\n", "\n", "with open('list_cat_cols.txt', 'w') as file_2:\n", " json.dump(selected_cat_cols, file_2)\n", "\n", "with open('scaler.pkl', 'wb') as file_3:\n", " pickle.dump(scaler, file_3)\n", "\n", "#with open('encoder.pkl', 'wb') as file_4:\n", " #pickle.dump(encoder, file_4)\n", "\n", "with open('model.pkl', 'wb') as file_5:\n", " pickle.dump(lg, file_5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan save data-data tersebut didalam Json files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Inference " ] }, { "cell_type": "code", "execution_count": 423, "metadata": {}, "outputs": [], "source": [ "# Load model and other files\n", "\n", "with open('list_cat_cols.txt', 'r') as file_1:\n", " list_cat_col = json.load(file_1)\n", "\n", "with open('list_num_cols.txt', 'r') as file_2:\n", " list_num_col = json.load(file_2)\n", "\n", "with open(\"model.pkl\", \"rb\") as file_3:\n", " model = pickle.load(file_3)\n", "\n", "with open(\"scaler.pkl\", \"rb\") as file_4:\n", " scaler = pickle.load(file_4)\n", "\n", "#with open(\"encoder.pkl\", \"rb\") as file_5:\n", " #encoder = pickle.load(file_5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan coba buka yang kita tadi save untuk dipake untuk inference." ] }, { "cell_type": "code", "execution_count": 424, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " limit_balance sex education_level marital_status age pay_0 pay_2 \\\n", "0 80000.0 1 6 1 54.0 0.0 0.0 \n", "1 200000.0 1 4 1 49.0 0.0 0.0 \n", "2 20000.0 2 6 2 22.0 0.0 0.0 \n", "3 260000.0 2 4 2 33.0 0.0 0.0 \n", "4 150000.0 1 4 2 32.0 0.0 0.0 \n", "\n", " pay_3 pay_4 pay_5 ... bill_amt_4 bill_amt_5 bill_amt_6 pay_amt_1 \\\n", "0 0.0 0.0 0.0 ... 29296.0 26210.0 17643.0 2545.0 \n", "1 0.0 0.0 0.0 ... 50146.0 50235.0 48984.0 1689.0 \n", "2 0.0 0.0 0.0 ... 1434.0 500.0 0.0 4641.0 \n", "3 0.0 0.0 0.0 ... 27821.0 30767.0 29890.0 5000.0 \n", "4 0.0 -1.0 0.0 ... 150464.0 143375.0 146411.0 4019.0 \n", "\n", " pay_amt_2 pay_amt_3 pay_amt_4 pay_amt_5 pay_amt_6 \\\n", "0 2208.0 1336.0 2232.0 542.0 348.0 \n", "1 2164.0 2500.0 3480.0 2500.0 3000.0 \n", "2 1019.0 900.0 0.0 1500.0 0.0 \n", "3 5000.0 1137.0 5000.0 1085.0 5000.0 \n", "4 146896.0 157436.0 4600.0 4709.0 5600.0 \n", "\n", " default_payment_next_month \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 0 \n", "\n", "[5 rows x 24 columns]\n" ] } ], "source": [ "import pandas as pd\n", "\n", "# Read the data from the CSV file\n", "file_path = \"/Users/ryantrisnadi/Desktop/first_project1/p1-ftds017-hck-g5-ryantrisnadi/_P1G5_Set_1_Ryan_Trisnadi.csv\"\n", "df_original = pd.read_csv(file_path)\n", "\n", "# Create a new DataFrame with the specified index columns\n", "index_columns = ['limit_balance', 'sex', 'education_level', 'marital_status', 'age',\n", " 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt_1',\n", " 'bill_amt_2', 'bill_amt_3', 'bill_amt_4', 'bill_amt_5', 'bill_amt_6',\n", " 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5',\n", " 'pay_amt_6', 'default_payment_next_month']\n", "df_data_dummy = df_original[index_columns].copy()\n", "\n", "# Display the new DataFrame\n", "print(df_data_dummy.head())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita akan menggunakan data asli yang belum diolah sebagai data dummy dinamakan \"data1\"" ] }, { "cell_type": "code", "execution_count": 425, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2965 entries, 0 to 2964\n", "Data columns (total 24 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 limit_balance 2965 non-null float64\n", " 1 sex 2965 non-null int64 \n", " 2 education_level 2965 non-null int64 \n", " 3 marital_status 2965 non-null int64 \n", " 4 age 2965 non-null float64\n", " 5 pay_0 2965 non-null float64\n", " 6 pay_2 2965 non-null float64\n", " 7 pay_3 2965 non-null float64\n", " 8 pay_4 2965 non-null float64\n", " 9 pay_5 2965 non-null float64\n", " 10 pay_6 2965 non-null float64\n", " 11 bill_amt_1 2965 non-null float64\n", " 12 bill_amt_2 2965 non-null float64\n", " 13 bill_amt_3 2965 non-null float64\n", " 14 bill_amt_4 2965 non-null float64\n", " 15 bill_amt_5 2965 non-null float64\n", " 16 bill_amt_6 2965 non-null float64\n", " 17 pay_amt_1 2965 non-null float64\n", " 18 pay_amt_2 2965 non-null float64\n", " 19 pay_amt_3 2965 non-null float64\n", " 20 pay_amt_4 2965 non-null float64\n", " 21 pay_amt_5 2965 non-null float64\n", " 22 pay_amt_6 2965 non-null float64\n", " 23 default_payment_next_month 2965 non-null int64 \n", "dtypes: float64(20), int64(4)\n", "memory usage: 556.1 KB\n" ] } ], "source": [ "df_data_dummy.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cek data \"dummy\" jika sudah benar dengan yang original. " ] }, { "cell_type": "code", "execution_count": 426, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['limit_balance', 'sex', 'education_level', 'marital_status', 'age',\n", " 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt_1',\n", " 'bill_amt_2', 'bill_amt_3', 'bill_amt_4', 'bill_amt_5', 'bill_amt_6',\n", " 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5',\n", " 'pay_amt_6', 'default_payment_next_month'],\n", " dtype='object')" ] }, "execution_count": 426, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_data_dummy.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cek kolom dari yang \"dummy\" jika sudah sesuai." ] }, { "cell_type": "code", "execution_count": 427, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[8.86075949e-02 2.00000000e-01 2.22222222e-01 ... 1.39665164e-03\n", " 8.62453532e-04 6.00000000e+00]\n", " [2.40506329e-01 2.00000000e-01 2.22222222e-01 ... 6.44212013e-03\n", " 7.43494424e-03 4.00000000e+00]\n", " [1.26582278e-02 2.00000000e-01 2.22222222e-01 ... 3.86527208e-03\n", " 0.00000000e+00 6.00000000e+00]\n", " ...\n", " [5.56962025e-01 0.00000000e+00 0.00000000e+00 ... 1.00497074e-03\n", " 9.66542751e-04 2.00000000e+00]\n", " [5.06329114e-02 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00\n", " 1.93308550e-03 2.00000000e+00]\n", " [3.54430380e-01 3.00000000e-01 0.00000000e+00 ... 1.00497074e-03\n", " 1.63990087e-02 2.00000000e+00]]\n" ] } ], "source": [ "\n", "# Assuming df_data_dummy, selected_num_cols, and selected_cat_cols are defined\n", "data_inference_num = df_data_dummy[selected_num_cols]\n", "data_inference_cat = df_data_dummy[selected_cat_cols]\n", "\n", "data_inference_num.fillna(0, inplace=True) # Assuming missing values are filled with 0\n", "data_inference_cat.fillna('Unknown', inplace=True) # Filling categorical missing values with 'Unknown'\n", "\n", "# Fit the MinMaxScaler on the numerical features\n", "scaler.fit(data_inference_num)\n", "\n", "# Transform the numerical features using the fitted scaler\n", "data_inference_num_scaled = scaler.transform(data_inference_num)\n", "\n", "# Concatenate the scaled numerical features and the categorical features\n", "data_inference_final = np.concatenate([data_inference_num_scaled, data_inference_cat], axis=1)\n", "\n", "# Now data_inference_final contains both scaled numerical features and categorical features\n", "print(data_inference_final)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kita mau bagi dari data dummy dengan list numerical dan categorical dari hasil \"data\" original. Kita akan bagi jadi dua variable dan masukan mengunakan scalar jadi bisa membuat inference baru jika digabung lagi dengan feature \"concatenate\". Kita bisa terus melihat prediksi output dengan data \"dummy\" dengan target variable. " ] }, { "cell_type": "code", "execution_count": 428, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['education_level']\n", "['limit_balance', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'pay_amt_1', 'pay_amt_2', 'pay_amt_3', 'pay_amt_4', 'pay_amt_5', 'pay_amt_6']\n" ] } ], "source": [ "print(selected_cat_cols)\n", "print(selected_num_cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lihat kolom yang akan digunakan karena terpilih sebagai signifikan data." ] }, { "cell_type": "code", "execution_count": 429, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predicted Overall: 0.00\n" ] } ], "source": [ "# Predict the score\n", "predicted_score = lg.predict(data_inference_final)\n", "\n", "# Show result\n", "print(f\"Predicted Overall: {predicted_score[0]:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Insight: Dari data inferensial, Kita bisa lihat bahwa jika faktor variable sudah semua dimasukan kedalam model, maka output dari target variable akan keluarkan 0.00. Ini artinya data dummy dan original tidak sesuai untuk model tersebut. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setelah kita menggunakan test Logistic Regression, KNN, dan SVM, kita bisa lihat perbandingan model yang digunakan oleh machine learning untuk prediksi bagaimana client bisa diprediksi untuk default payment atau tidak. Karena saking banyak numerik data yang harus diolah, maka interpretasi dari setiap test harus bisa disimpulkan dalam parameter dan cross validation. Di konteks test ini, SVM telah dipakai karena skor dari akurasi, recall, precision, dan F-1 rata-rata yang paling tinggi, tetapi ini tidak selalu benar dalam setiap situasi. Bisa bilang saja ada dataset yang memiliki variable age, sex, marital_status, atau education_level yang beda saja akan mempunyai dampak besar terhadap test regresi atau model yang digunakan untuk KNN atau SVM.\n", "\n", "Kita menggunakan SVM di cross validation dan parameter tuning di test ini karena leih effektif model dalam dimensi lebih tinggi karena klassifikasi fitur sangat tinggi. Juga, kita bisa pilih antar linear, polynomial, dan RBF untuk pilih perbedaan kompleks dataset dengan boundaries yang linear atau non-linear. Oleh karena itu, Data yang digunakan bisa di \"customize\" terhadap splitting dataset yang digunakan karena sebagai analis, kita tidak tau distribusi dari setiap sample size, dan SVM adalah model yang paling flexible untuk menjelaskan relasi antar \"default_payment\" terhadap variable numerikal dan kategorikal di dataset tersebut. \n", "\n", "Dalam kesimpulan, bisa dikatakan bahwa faktor limit_balance, education_level, pay, dan pay_amt memiliki dampak paling signifikan terhadap prediksi jika client akan lunas bayar tagihan. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pertanyaan:\n", "\n", " Resiko apa saja client bank dengan limit balance yang sangat tinggi terhadap utang mereka?\n", " Client yang memiliki limit balance lebih tinggi cenderung bayar tagihan pada tepat waktu daripada yang gagal bayar. \n", "\n", " Apakah ada dampak \"education\" terhadap default_payment_next_month?\n", " Pendidikan client yang lebih tinggi memiliki ratio yang lebih baik terhadap kelunasan pembayaran utang. \n", "\n", " Adakah perbedaan \"pay\" dan \"bill_amt\" yang bisa disimpulkan terhadap kemampuan client untuk membayar utang balik? \n", " Client yang memiliki pay paling dekat terhadap bill_amt akan punya kemungkinan lebih besar untuk bayar balik utang pada bulan itu, dan tidak akan \"default\" untuk bulan keberikutnya. \n", " \n", " Resiko yang paling tinggi di kolom-kolom apa saja?\n", " Limit balance, age, dan education level dan adalah faktor yang memiliki dampak terbesar terhadap \"default\" payment dari client. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conceptual Problems\n", "### Jawab pertanyaan berikut:\n", "\n", " Apakah yang dimaksud dengan coeficient pada logistic regression?\n", " Coefficient pada Logistic Regression mengacu pada nilai yang menunjukkan seberapa besar pengaruh suatu fitur terhadap hasil prediksi model. Dalam konteks logistic regression, koefisien digunakan untuk mengkalibrasi kontribusi setiap fitur terhadap probabilitas prediksi kelas target yang diberikan. Koefisien positif menunjukkan korelasi positif antara fitur tersebut dengan kelas target, sedangkan koefisien negatif menunjukkan korelasi negatif.\n", "\n", " Apakah fungsi parameter kernel pada SVM? Jelaskan salah satu kernel yang kalian pahami!\n", " Parameter kernel pada Support Vector Machine (SVM) digunakan untuk menentukan jenis fungsi kernel yang digunakan untuk mentransformasikan data ke dalam ruang fitur yang lebih tinggi. Salah satu kernel yang umum digunakan adalah kernel RBF (Radial Basis Function). Kernel RBF memungkinkan SVM untuk menangani data yang tidak linear dengan memetakan data ke dalam ruang dimensi yang lebih tinggi, di mana pemisah linear menjadi lebih mungkin.\n", "\n", " Bagaimana cara memilih K yang optimal pada KNN?\n", " Untuk memilih nilai K yang optimal pada K-Nearest Neighbors (KNN), dapat dilakukan dengan menggunakan metode validasi silang (cross-validation). Dengan menggunakan teknik ini, kita dapat membagi data menjadi subset untuk pelatihan dan pengujian, dan kemudian menghitung performa model KNN dengan berbagai nilai K. Nilai K yang memberikan performa terbaik pada data pengujian adalah K yang optimal untuk digunakan pada model KNN.\n", "\n", " Apa yang dimaksud dengan metrics-metrics berikut : Accuracy, Precision, Recall, F1 Score, dan kapan waktu yang tepat untuk menggunakannya?\n", " Accuracy: Mengukur seberapa sering model melakukan prediksi yang benar dari semua prediksi yang dilakukan. Baik digunakan ketika kelas target memiliki distribusi yang seimbang.\n", " Precision: Mengukur proporsi prediksi positif yang benar dari total prediksi positif. Berguna ketika penting untuk meminimalkan false positive.\n", " Recall: Mengukur proporsi positif sebenarnya yang diprediksi dengan benar dari semua kelas positif yang sebenarnya. Berguna ketika penting untuk meminimalkan false negative.\n", " F1 Score: Kombinasi dari Precision dan Recall, digunakan untuk mengukur keseimbangan antara Precision dan Recall. Berguna ketika kelas target tidak seimbang.\n", " Waktu yang tepat untuk menggunakan masing-masing metrik ini tergantung pada tujuan dan karakteristik dari data serta masalah yang sedang dihadapi. Sebagai contoh, jika kelas target seimbang, maka Accuracy dapat menjadi metrik yang baik. Namun, jika kelas target tidak seimbang (imbalance class), maka menggunakan Precision, Recall, dan F1 Score dapat memberikan gambaran yang lebih akurat tentang performa model.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recommendation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Di test tersebut ada beberapa masalah dengan merangka model seperti waktu di hypertuning test, CV atau n_iter terlalu kecil karena waktu untuk proses terlalu lama. Oleh karena itu test telah diganti variable CV atau n_iter yang lebih kecil untuk perpendek waktu test. Dibawah ada beberapa saran untuk memperbaik test diatas untuk lain kali:\n", "\n", "1. Waktu Cross validation dan hypertuning test, coba CV, n-iter, atau n_jobs yang lebih besar jadi bisa dapat akurasi, precision, recall, dan f-1 score yang lebih jelas dan beda dari yang original.\n", "\n", "2. Modifikasi data dummy dengan variable yang teratur jadi model inferensi memiliki jawaban yang bukan 0.00." ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }