{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 1: Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Name: Allen\n", "\n", "Batch: FTDS_BSD_003\n", "\n", "Dataset: `Hotel Reservation Datasets`\n", "\n", "Problem Statement:\n", "The Hotel Reservation Project is to explore customer activity related to their booking and reservation status. The main or the output of this project is aimed to predict whether the customer will cancel their booking or not. The online hotel reservation have developed their booking process but they have also brought some challanges like the typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. This modeling is useful for hotels to manage a problem like this and to reduce and minimalize the business loss. The dataset includes various features such as the number of adults and children, lead time, room type, and more. So at last the target feature in this dataset `booking_status`, which has two categories: \"Not_Canceled\" and \"Canceled.\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 2: Import Library" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "# Import Library\n", "import numpy as np\n", "import pandas as pd\n", "import pickle\n", "import phik\n", "from phik import resources, report\n", "pd.set_option('display.max_columns', None)\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import preprocessing\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.compose import ColumnTransformer\n", "from scipy import stats\n", "import seaborn as sns\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "from sklearn.svm import SVC\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.ensemble import AdaBoostClassifier\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.metrics import classification_report,ConfusionMatrixDisplay, precision_score,recall_score,accuracy_score,f1_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 3: Data Loading" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Booking_IDno_of_adultsno_of_childrenno_of_weekend_nightsno_of_week_nightstype_of_meal_planrequired_car_parking_spaceroom_type_reservedlead_timearrival_yeararrival_montharrival_datemarket_segment_typerepeated_guestno_of_previous_cancellationsno_of_previous_bookings_not_canceledavg_price_per_roomno_of_special_requestsbooking_status
0INN000012012Meal Plan 10Room_Type 12242017102Offline00065.000Not_Canceled
1INN000022023Not Selected0Room_Type 152018116Online000106.681Not_Canceled
2INN000031021Meal Plan 10Room_Type 112018228Online00060.000Canceled
3INN000042002Meal Plan 10Room_Type 12112018520Online000100.000Canceled
4INN000052011Not Selected0Room_Type 1482018411Online00094.500Canceled
\n", "
" ], "text/plain": [ " Booking_ID no_of_adults no_of_children no_of_weekend_nights \\\n", "0 INN00001 2 0 1 \n", "1 INN00002 2 0 2 \n", "2 INN00003 1 0 2 \n", "3 INN00004 2 0 0 \n", "4 INN00005 2 0 1 \n", "\n", " no_of_week_nights type_of_meal_plan required_car_parking_space \\\n", "0 2 Meal Plan 1 0 \n", "1 3 Not Selected 0 \n", "2 1 Meal Plan 1 0 \n", "3 2 Meal Plan 1 0 \n", "4 1 Not Selected 0 \n", "\n", " room_type_reserved lead_time arrival_year arrival_month arrival_date \\\n", "0 Room_Type 1 224 2017 10 2 \n", "1 Room_Type 1 5 2018 11 6 \n", "2 Room_Type 1 1 2018 2 28 \n", "3 Room_Type 1 211 2018 5 20 \n", "4 Room_Type 1 48 2018 4 11 \n", "\n", " market_segment_type repeated_guest no_of_previous_cancellations \\\n", "0 Offline 0 0 \n", "1 Online 0 0 \n", "2 Online 0 0 \n", "3 Online 0 0 \n", "4 Online 0 0 \n", "\n", " no_of_previous_bookings_not_canceled avg_price_per_room \\\n", "0 0 65.00 \n", "1 0 106.68 \n", "2 0 60.00 \n", "3 0 100.00 \n", "4 0 94.50 \n", "\n", " no_of_special_requests booking_status \n", "0 0 Not_Canceled \n", "1 1 Not_Canceled \n", "2 0 Canceled \n", "3 0 Canceled \n", "4 0 Canceled " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Untuk mengeload data set\n", "df_ori=pd.read_csv('hotel_reservations.csv')\n", "\n", "\n", "#Membuat Duplicate df_ori\n", "df=df_ori.copy()\n", "\n", "#Tampilkan 5 data ter atas\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For starting analys this dataset let us know deeper about all columns in the dataset :\n", " - no_of_adults: The number of adults in the reservation (categorical: 5 values)\n", " - no_of_children: The number of children in the reservation (categorical: 6 values)\n", " - no_of_weekend_nights: The number of weekend nights included in the reservation (categorical: 8 values)\n", " - no_of_week_nights: The number of weeknights included in the reservation (categorical: 18 values)\n", " - type_of_meal_plan: The type of meal plan chosen (categorical: 4 values)\n", " - required_car_parking_space: Whether a car parking space is required (binary: 2 values)\n", " - room_type_reserved: The type of room reserved (categorical: 7 values)\n", " - lead_time: The number of days between booking and arrival (numerical: range up to 352 days)\n", " - arrival_year: The year of arrival (categorical: 2 values)\n", " - arrival_month: The month of arrival (categorical: 12 values)\n", " - arrival_date: The day of arrival (categorical: 31 values)\n", " - market_segment_type: The type of market segment (categorical: 5 values)\n", " - repeated_guest: Whether the guest is a repeated customer (binary: 2 values)\n", " - no_of_previous_cancellations: The number of previous cancellations by the guest (categorical: 9 values)\n", " - no_of_previous_bookings_not_canceled: The number of previous bookings not canceled by the guest (numerical: range up to 59)\n", " - avg_price_per_room: The average price per room (numerical: range up to 3930)\n", " - no_of_special_requests: The number of special requests made by the guest (categorical: 6 values)\n", " - booking_status: The target variable, indicating whether the reservation was canceled or not (binary: 2 values)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 36275 entries, 0 to 36274\n", "Data columns (total 19 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Booking_ID 36275 non-null object \n", " 1 no_of_adults 36275 non-null int64 \n", " 2 no_of_children 36275 non-null int64 \n", " 3 no_of_weekend_nights 36275 non-null int64 \n", " 4 no_of_week_nights 36275 non-null int64 \n", " 5 type_of_meal_plan 36275 non-null object \n", " 6 required_car_parking_space 36275 non-null int64 \n", " 7 room_type_reserved 36275 non-null object \n", " 8 lead_time 36275 non-null int64 \n", " 9 arrival_year 36275 non-null int64 \n", " 10 arrival_month 36275 non-null int64 \n", " 11 arrival_date 36275 non-null int64 \n", " 12 market_segment_type 36275 non-null object \n", " 13 repeated_guest 36275 non-null int64 \n", " 14 no_of_previous_cancellations 36275 non-null int64 \n", " 15 no_of_previous_bookings_not_canceled 36275 non-null int64 \n", " 16 avg_price_per_room 36275 non-null float64\n", " 17 no_of_special_requests 36275 non-null int64 \n", " 18 booking_status 36275 non-null object \n", "dtypes: float64(1), int64(13), object(5)\n", "memory usage: 5.3+ MB\n" ] } ], "source": [ "# Checking all data\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['INN00001', 'INN00002', 'INN00003', ..., 'INN36273', 'INN36274',\n", " 'INN36275'], dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.Booking_ID.unique() " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 1, 3, 0, 4], dtype=int64)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_adults.unique() " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 2, 1, 3, 10, 9], dtype=int64)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_children.unique() " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 0, 4, 3, 6, 5, 7], dtype=int64)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_weekend_nights.unique() " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 2, 3, 1, 4, 5, 0, 10, 6, 11, 7, 15, 9, 13, 8, 14, 12, 17,\n", " 16], dtype=int64)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_week_nights.unique() " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Meal Plan 1', 'Not Selected', 'Meal Plan 2', 'Meal Plan 3'],\n", " dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.type_of_meal_plan.unique() " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1], dtype=int64)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.required_car_parking_space.unique()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Room_Type 1', 'Room_Type 4', 'Room_Type 2', 'Room_Type 6',\n", " 'Room_Type 5', 'Room_Type 7', 'Room_Type 3'], dtype=object)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.room_type_reserved.unique() " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([224, 5, 1, 211, 48, 346, 34, 83, 121, 44, 0, 35, 30,\n", " 95, 47, 256, 99, 12, 122, 2, 37, 130, 60, 56, 3, 107,\n", " 72, 23, 289, 247, 186, 64, 96, 41, 55, 146, 32, 57, 7,\n", " 124, 169, 6, 51, 13, 100, 139, 117, 39, 86, 19, 192, 179,\n", " 26, 74, 143, 177, 18, 267, 155, 46, 128, 20, 40, 196, 188,\n", " 17, 110, 68, 73, 92, 171, 134, 320, 118, 189, 16, 24, 8,\n", " 10, 182, 116, 123, 105, 443, 317, 286, 148, 14, 85, 25, 28,\n", " 80, 11, 162, 82, 27, 245, 266, 112, 88, 69, 273, 4, 97,\n", " 31, 62, 197, 280, 185, 160, 104, 22, 292, 109, 126, 303, 81,\n", " 54, 15, 161, 147, 87, 127, 418, 156, 58, 433, 111, 195, 119,\n", " 59, 78, 335, 103, 70, 76, 144, 49, 77, 36, 79, 21, 33,\n", " 164, 152, 43, 102, 71, 209, 93, 53, 302, 239, 45, 167, 113,\n", " 84, 9, 166, 174, 61, 151, 52, 67, 282, 38, 175, 89, 133,\n", " 65, 66, 50, 159, 386, 115, 237, 125, 91, 29, 221, 213, 198,\n", " 75, 180, 236, 120, 230, 63, 136, 309, 157, 268, 217, 94, 305,\n", " 98, 42, 154, 330, 137, 184, 232, 304, 114, 257, 265, 191, 101,\n", " 259, 149, 170, 271, 207, 108, 210, 222, 296, 194, 145, 153, 275,\n", " 158, 301, 349, 200, 315, 181, 263, 176, 141, 270, 150, 359, 244,\n", " 219, 142, 138, 276, 178, 163, 377, 290, 216, 226, 258, 254, 193,\n", " 131, 208, 215, 190, 381, 231, 248, 106, 308, 140, 173, 168, 172,\n", " 90, 249, 205, 129, 212, 135, 220, 277, 253, 132, 183, 255, 223,\n", " 336, 288, 229, 319, 199, 203, 228, 246, 235, 294, 281, 202, 361,\n", " 287, 291, 313, 206, 269, 279, 261, 214, 274, 250, 187, 240, 241,\n", " 323, 322, 227, 225, 233, 338, 283, 327, 204, 352, 165, 251, 299,\n", " 314, 285, 238, 328, 278, 332, 243, 201, 307, 272, 252, 242, 284,\n", " 297, 324, 260, 262, 326, 295, 218, 234, 353, 300, 355, 306, 298,\n", " 331, 341, 318, 333, 372, 311, 310, 345, 264, 325, 293, 348, 350,\n", " 351], dtype=int64)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.lead_time.unique() " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2017, 2018], dtype=int64)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.arrival_year.unique() " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([10, 11, 2, 5, 4, 9, 12, 7, 6, 8, 3, 1], dtype=int64)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.arrival_month.unique() " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 2, 6, 28, 20, 11, 13, 15, 26, 18, 30, 5, 10, 4, 25, 22, 21, 19,\n", " 17, 7, 9, 27, 1, 29, 16, 3, 24, 14, 31, 23, 8, 12],\n", " dtype=int64)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.arrival_date.unique() " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Offline', 'Online', 'Corporate', 'Aviation', 'Complementary'],\n", " dtype=object)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.market_segment_type.unique() " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1], dtype=int64)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.repeated_guest.unique() " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 3, 1, 2, 11, 4, 5, 13, 6], dtype=int64)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_previous_cancellations.unique() " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 5, 1, 3, 4, 12, 19, 2, 15, 17, 7, 20, 16, 50, 13, 6, 14,\n", " 34, 18, 8, 10, 23, 11, 49, 47, 53, 9, 33, 22, 24, 52, 21, 48, 28,\n", " 39, 25, 31, 38, 26, 51, 42, 37, 35, 56, 44, 27, 32, 55, 45, 30, 57,\n", " 46, 54, 43, 58, 41, 29, 40, 36], dtype=int64)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_previous_bookings_not_canceled.unique() " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 65. , 106.68, 60. , ..., 118.43, 137.25, 167.8 ])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.avg_price_per_room.unique() " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 3, 2, 4, 5], dtype=int64)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.no_of_special_requests.unique() " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Not_Canceled', 'Canceled'], dtype=object)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value\n", "df.booking_status.unique() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above, we can conclude that we have 3 type of data columns:\n", "1. Numerical Columns: it means that these is real numerical columns\n", "2. Categorical Columns: it means that these columns have type data object\n", "3. Categorical Numeric Columns: it means that it is actually categorical columns in the form of numerical columns\n", "\n", "Now we will defined every type data columns:\n", "1. Numerical Columns: \n", "- `no_of_adults`,\n", "- `no_of_children`,\n", "- `no_of_weekend_nights`,\n", "- `no_of_week_nights`,\n", "- `lead_time`,\n", "- `arrival_year`,\n", "- `arrival_month`,\n", "- `arrival_date`,\n", "- `no_of_previous_cancellations`,and \n", "- `avg_price_per_room`\n", "\n", "2. Categorical Columns: \n", "- `Booking_ID`,\n", "- `type_of_meal_plan`, \n", "- `room_type_reserved`,\n", "- `market_segment_type`, and \n", "- `booking_status`\n", "\n", "3. Categorical Numeric Columns: \n", "- `required_car_parking_space`,\n", "- `repeated_guest`, and \n", "- `no_of_special_requests`" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Booking_ID', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights',\n", " 'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space',\n", " 'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month',\n", " 'arrival_date', 'market_segment_type', 'repeated_guest',\n", " 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',\n", " 'avg_price_per_room', 'no_of_special_requests', 'booking_status'],\n", " dtype='object')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Show All Columns\n", "df.columns" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(36275, 19)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the row and the columns\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Booking_ID 36275\n", "no_of_adults 5\n", "no_of_children 6\n", "no_of_weekend_nights 8\n", "no_of_week_nights 18\n", "type_of_meal_plan 4\n", "required_car_parking_space 2\n", "room_type_reserved 7\n", "lead_time 352\n", "arrival_year 2\n", "arrival_month 12\n", "arrival_date 31\n", "market_segment_type 5\n", "repeated_guest 2\n", "no_of_previous_cancellations 9\n", "no_of_previous_bookings_not_canceled 59\n", "avg_price_per_room 3930\n", "no_of_special_requests 6\n", "booking_status 2\n", "dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show unique value\n", "df.nunique()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
no_of_adultsno_of_childrenno_of_weekend_nightsno_of_week_nightsrequired_car_parking_spacelead_timearrival_yeararrival_montharrival_daterepeated_guestno_of_previous_cancellationsno_of_previous_bookings_not_canceledavg_price_per_roomno_of_special_requests
count36275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.00000036275.000000
mean1.8449620.1052790.8107242.2043000.03098685.2325572017.8204277.42365315.5969950.0256370.0233490.153411103.4235390.619655
std0.5187150.4026480.8706441.4109050.17328185.9308170.3838363.0698948.7404470.1580530.3683311.75417135.0894240.786236
min0.0000000.0000000.0000000.0000000.0000000.0000002017.0000001.0000001.0000000.0000000.0000000.0000000.0000000.000000
25%2.0000000.0000000.0000001.0000000.00000017.0000002018.0000005.0000008.0000000.0000000.0000000.00000080.3000000.000000
50%2.0000000.0000001.0000002.0000000.00000057.0000002018.0000008.00000016.0000000.0000000.0000000.00000099.4500000.000000
75%2.0000000.0000002.0000003.0000000.000000126.0000002018.00000010.00000023.0000000.0000000.0000000.000000120.0000001.000000
max4.00000010.0000007.00000017.0000001.000000443.0000002018.00000012.00000031.0000001.00000013.00000058.000000540.0000005.000000
\n", "
" ], "text/plain": [ " no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \\\n", "count 36275.000000 36275.000000 36275.000000 36275.000000 \n", "mean 1.844962 0.105279 0.810724 2.204300 \n", "std 0.518715 0.402648 0.870644 1.410905 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 2.000000 0.000000 0.000000 1.000000 \n", "50% 2.000000 0.000000 1.000000 2.000000 \n", "75% 2.000000 0.000000 2.000000 3.000000 \n", "max 4.000000 10.000000 7.000000 17.000000 \n", "\n", " required_car_parking_space lead_time arrival_year arrival_month \\\n", "count 36275.000000 36275.000000 36275.000000 36275.000000 \n", "mean 0.030986 85.232557 2017.820427 7.423653 \n", "std 0.173281 85.930817 0.383836 3.069894 \n", "min 0.000000 0.000000 2017.000000 1.000000 \n", "25% 0.000000 17.000000 2018.000000 5.000000 \n", "50% 0.000000 57.000000 2018.000000 8.000000 \n", "75% 0.000000 126.000000 2018.000000 10.000000 \n", "max 1.000000 443.000000 2018.000000 12.000000 \n", "\n", " arrival_date repeated_guest no_of_previous_cancellations \\\n", "count 36275.000000 36275.000000 36275.000000 \n", "mean 15.596995 0.025637 0.023349 \n", "std 8.740447 0.158053 0.368331 \n", "min 1.000000 0.000000 0.000000 \n", "25% 8.000000 0.000000 0.000000 \n", "50% 16.000000 0.000000 0.000000 \n", "75% 23.000000 0.000000 0.000000 \n", "max 31.000000 1.000000 13.000000 \n", "\n", " no_of_previous_bookings_not_canceled avg_price_per_room \\\n", "count 36275.000000 36275.000000 \n", "mean 0.153411 103.423539 \n", "std 1.754171 35.089424 \n", "min 0.000000 0.000000 \n", "25% 0.000000 80.300000 \n", "50% 0.000000 99.450000 \n", "75% 0.000000 120.000000 \n", "max 58.000000 540.000000 \n", "\n", " no_of_special_requests \n", "count 36275.000000 \n", "mean 0.619655 \n", "std 0.786236 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 1.000000 \n", "max 5.000000 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# looking data mean std min median max\n", "df.describe()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Booking_ID 0\n", "no_of_adults 0\n", "no_of_children 0\n", "no_of_weekend_nights 0\n", "no_of_week_nights 0\n", "type_of_meal_plan 0\n", "required_car_parking_space 0\n", "room_type_reserved 0\n", "lead_time 0\n", "arrival_year 0\n", "arrival_month 0\n", "arrival_date 0\n", "market_segment_type 0\n", "repeated_guest 0\n", "no_of_previous_cancellations 0\n", "no_of_previous_bookings_not_canceled 0\n", "avg_price_per_room 0\n", "no_of_special_requests 0\n", "booking_status 0\n", "dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Looking isnull of the data \n", "df.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show data duplicated\n", "df.duplicated().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 4: Exploratory Data Analysis (EDA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we will look at the correlation between `type of meal plan`, `market sgement type`, and `room type` of the hotel visitos with the target `booking status`" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Description of the Categorical Dataset:\n", "========================================\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 countuniquetopfreq
Booking_ID3627536275INN000011
type_of_meal_plan362754Meal Plan 127835
room_type_reserved362757Room_Type 128130
market_segment_type362755Online23214
booking_status362752Not_Canceled24390
\n" ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create the DataFrame description\n", "print('Description of the Categorical Dataset:')\n", "print(\"=\" * 40)\n", "categorical_description = df.describe(include=['object', 'bool']).T\n", "\n", "# Define a styling function\n", "def style_description(s):\n", " return f'background-color: blue; font-weight: bold;'\n", "\n", "# Apply the styling to the DataFrame\n", "styled_description = categorical_description.style.applymap(style_description)\n", "styled_description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets make a visualization about the information above" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create subplots in a 1x2 grid\n", "fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n", "\n", "# Plot 1: Booking Status Distribution (Pie Chart)\n", "booking_status_counts = df['booking_status'].value_counts()\n", "labels = booking_status_counts.index\n", "sizes = booking_status_counts.values\n", "\n", "axes[0].pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)\n", "axes[0].set_title('Booking Status Distribution')\n", "axes[0].axis('equal') # Equal aspect ratio ensures that the pie chart is circular\n", "\n", "# Plot 2: Frequency of Meal Plan by Booking Status (Countplot)\n", "sns.countplot(data=df, x=\"type_of_meal_plan\", hue=\"booking_status\", ax=axes[1])\n", "axes[1].set_title('Frequency of Meal Plan by Booking Status')\n", "\n", "# Display the plots in a tight layout\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above we can take an information that visitors that not canceled their booking is bigger than canceled their booking `67.2%` to `32.8%`. We can take a look the comparison between the visitors that not canceled and canceled in how they chose meal plan, the meal plan 1 is occupied the first place than followed by not selected meal, than meal plan 2." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8, 6))\n", "sns.countplot(data=df, x=\"market_segment_type\", hue=\"booking_status\")\n", "plt.title('Frequency of Market Segment Types by Booking Status')\n", "plt.legend(title=\"Booking Status\", loc=\"center left\", bbox_to_anchor=(1, 0.5))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking into the information above we know that market segment of booking status `online` booking status occupied the first place or most often did by visitors that not canceled and canceled booking status than followed by `ofline` for sencond place, corporate in the third place, aviation, and complementary in the last position" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10, 6))\n", "sns.countplot(data=df, x=\"room_type_reserved\", hue=\"booking_status\")\n", "plt.title('Frequency of Room Reservation Types by Booking Status')\n", "plt.legend(title=\"Booking Status\", loc=\"center left\", bbox_to_anchor=(1, 0.5))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the information above `Room type 1` is the highest room type reserved by booking status and then the second popular is `Room type 4`" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8, 6))\n", "sns.countplot(data=df, x=\"arrival_year\", hue=\"booking_status\")\n", "plt.title('Frequency of Market Segment Types by Booking Status')\n", "plt.legend(title=\"Booking Status\", loc=\"center left\", bbox_to_anchor=(1, 0.5))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above visitors activity in reservation hotel most happend in `2018`" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.countplot(data=df, x='booking_status', hue='arrival_month')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above we can look that booking hotel reservation most happened on `October`. it is asumpt that whay not canceld in that month more than canceled because it is close to the end of year holidays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 5: Feature Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we want to take a look the distribution of our dataset" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "booking_status\n", "Not_Canceled 24390\n", "Canceled 11885\n", "Name: count, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Untuk melihat data balancing\n", "df['booking_status'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to encode our target into numerical columns using label Encoder because we want the value of the target in range (0,1)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label_encoder= preprocessing.LabelEncoder()\n", "df['booking_status']= label_encoder.fit_transform(df['booking_status'])\n", "df['booking_status'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "it looks like that our dataset is imbalance dataset but the gap is about 60% to 40% so here in this project i dont use SMOTE to handle the imbalance data, so in this project we willnot focuses on metric accuracy to determine the prediction because of imbalance dataset, but we can still use metrics recall, precision, and F1_score as the best metric that can be used as the combination of recall and precision." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Booking_ID 36275\n", "no_of_adults 5\n", "no_of_children 6\n", "no_of_weekend_nights 8\n", "no_of_week_nights 18\n", "type_of_meal_plan 4\n", "required_car_parking_space 2\n", "room_type_reserved 7\n", "lead_time 352\n", "arrival_year 2\n", "arrival_month 12\n", "arrival_date 31\n", "market_segment_type 5\n", "repeated_guest 2\n", "no_of_previous_cancellations 9\n", "no_of_previous_bookings_not_canceled 59\n", "avg_price_per_room 3930\n", "no_of_special_requests 6\n", "booking_status 2\n", "dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Value of All data columns\n", "df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now in this section we will drop Booking ID with a lot of unique value and not have a correlation with target" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# Droping column\n", "df= df.drop(['Booking_ID'], axis =1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will also drop column `arrival_year` because the unique value only 2 and it will make an eror correlation with the target if we dont drop it " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "# Droping column\n", "df= df.drop(['arrival_year'], axis =1)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(36275, 17)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show number row and columns after droping Booking ID\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "interval columns not set, guessing: ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_month', 'arrival_date', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'booking_status']\n" ] }, { "data": { "text/plain": [ "booking_status 1.000000\n", "lead_time 0.567900\n", "no_of_special_requests 0.358571\n", "arrival_month 0.224787\n", "avg_price_per_room 0.222650\n", "repeated_guest 0.167245\n", "no_of_week_nights 0.137998\n", "required_car_parking_space 0.134456\n", "type_of_meal_plan 0.131484\n", "market_segment_type 0.122171\n", "no_of_weekend_nights 0.101977\n", "no_of_adults 0.078562\n", "no_of_previous_bookings_not_canceled 0.073844\n", "no_of_children 0.051908\n", "no_of_previous_cancellations 0.039806\n", "room_type_reserved 0.035153\n", "arrival_date 0.026595\n", "Name: booking_status, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set Phik Correlation\n", "phik_matrix= df.phik_matrix()\n", "corr= phik_matrix['booking_status'].sort_values(ascending= False)\n", "\n", "#Display Results\n", "corr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the information above we can drop the columns that have weak correlation like below `0,1` correlation with the target, like `no_of_adults`,`no_of_previous_bookings_not_canceled`,`no_of_children`,`no_of_previous_cancellations`,`room_type_reserved`,and `arrival_date`" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',\n", " 'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space',\n", " 'room_type_reserved', 'lead_time', 'arrival_month', 'arrival_date',\n", " 'market_segment_type', 'repeated_guest', 'no_of_previous_cancellations',\n", " 'no_of_previous_bookings_not_canceled', 'avg_price_per_room',\n", " 'no_of_special_requests', 'booking_status'],\n", " dtype='object')" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "df= df.drop(['no_of_adults','no_of_children','no_of_previous_cancellations',\n", " 'no_of_previous_bookings_not_canceled','room_type_reserved','arrival_date'], axis =1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Splitting antara x and y\n", "X= df.drop(['booking_status'], axis =1)\n", "# X= df_new.drop(columns=['booking_status'])\n", "y= df.booking_status\n", "# y= df_new['booking_status']" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
no_of_weekend_nightsno_of_week_nightstype_of_meal_planrequired_car_parking_spacelead_timearrival_monthmarket_segment_typerepeated_guestavg_price_per_roomno_of_special_requests
012Meal Plan 1022410Offline065.000
123Not Selected0511Online0106.681
221Meal Plan 1012Online060.000
302Meal Plan 102115Online0100.000
411Not Selected0484Online094.500
.................................
3627026Meal Plan 10858Online0167.801
3627113Meal Plan 1022810Online090.952
3627226Meal Plan 101487Online098.392
3627303Not Selected0634Online094.500
3627412Meal Plan 1020712Offline0161.670
\n", "

36275 rows × 10 columns

\n", "
" ], "text/plain": [ " no_of_weekend_nights no_of_week_nights type_of_meal_plan \\\n", "0 1 2 Meal Plan 1 \n", "1 2 3 Not Selected \n", "2 2 1 Meal Plan 1 \n", "3 0 2 Meal Plan 1 \n", "4 1 1 Not Selected \n", "... ... ... ... \n", "36270 2 6 Meal Plan 1 \n", "36271 1 3 Meal Plan 1 \n", "36272 2 6 Meal Plan 1 \n", "36273 0 3 Not Selected \n", "36274 1 2 Meal Plan 1 \n", "\n", " required_car_parking_space lead_time arrival_month \\\n", "0 0 224 10 \n", "1 0 5 11 \n", "2 0 1 2 \n", "3 0 211 5 \n", "4 0 48 4 \n", "... ... ... ... \n", "36270 0 85 8 \n", "36271 0 228 10 \n", "36272 0 148 7 \n", "36273 0 63 4 \n", "36274 0 207 12 \n", "\n", " market_segment_type repeated_guest avg_price_per_room \\\n", "0 Offline 0 65.00 \n", "1 Online 0 106.68 \n", "2 Online 0 60.00 \n", "3 Online 0 100.00 \n", "4 Online 0 94.50 \n", "... ... ... ... \n", "36270 Online 0 167.80 \n", "36271 Online 0 90.95 \n", "36272 Online 0 98.39 \n", "36273 Online 0 94.50 \n", "36274 Offline 0 161.67 \n", "\n", " no_of_special_requests \n", "0 0 \n", "1 1 \n", "2 0 \n", "3 0 \n", "4 0 \n", "... ... \n", "36270 1 \n", "36271 2 \n", "36272 2 \n", "36273 0 \n", "36274 0 \n", "\n", "[36275 rows x 10 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "36270 1\n", "36271 0\n", "36272 1\n", "36273 0\n", "36274 1\n", "Name: booking_status, Length: 36275, dtype: int32" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train shape: (27206, 10)\n", "X_test shape: (8615, 10)\n", "X_inference shape: (454, 10)\n", "y_train shape: (27206,)\n", "y_test shape: (8615,)\n", "y_inference shape: (454,)\n" ] } ], "source": [ "# Split data train dan juga test temp 80 % dan 20% temp\n", "X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.25, random_state=10)\n", "\n", "# Split temp set into test dan inference set 95% test 5% inference\n", "X_test, X_inference, y_test, y_inference = train_test_split(X_temp, y_temp, test_size=0.05, random_state=10)\n", "\n", "#print data frame untuk mendapatkan kolom dan baris yang sudah di split\n", "print(\"X_train shape:\", X_train.shape)\n", "print(\"X_test shape:\", X_test.shape)\n", "print(\"X_inference shape:\", X_inference.shape)\n", "print(\"y_train shape:\", y_train.shape)\n", "print(\"y_test shape:\", y_test.shape)\n", "print(\"y_inference shape:\", y_inference.shape)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
no_of_weekend_nightsno_of_week_nightstype_of_meal_planrequired_car_parking_spacelead_timearrival_monthmarket_segment_typerepeated_guestavg_price_per_roomno_of_special_requests
1847803Meal Plan 10209Online0136.672
1157514Meal Plan 10117Offline085.000
3610822Meal Plan 102412Online095.201
2315102Meal Plan 101886Offline0130.000
1937710Meal Plan 1028610Offline090.000
\n", "
" ], "text/plain": [ " no_of_weekend_nights no_of_week_nights type_of_meal_plan \\\n", "18478 0 3 Meal Plan 1 \n", "11575 1 4 Meal Plan 1 \n", "36108 2 2 Meal Plan 1 \n", "23151 0 2 Meal Plan 1 \n", "19377 1 0 Meal Plan 1 \n", "\n", " required_car_parking_space lead_time arrival_month \\\n", "18478 0 20 9 \n", "11575 0 11 7 \n", "36108 0 24 12 \n", "23151 0 188 6 \n", "19377 0 286 10 \n", "\n", " market_segment_type repeated_guest avg_price_per_room \\\n", "18478 Online 0 136.67 \n", "11575 Offline 0 85.00 \n", "36108 Online 0 95.20 \n", "23151 Offline 0 130.00 \n", "19377 Offline 0 90.00 \n", "\n", " no_of_special_requests \n", "18478 2 \n", "11575 0 \n", "36108 1 \n", "23151 0 \n", "19377 0 " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show Xtrain\n", "X_train.head()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "18478 1\n", "11575 1\n", "36108 1\n", "23151 0\n", "19377 0\n", "Name: booking_status, dtype: int32" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show ytrain\n", "y_train.head()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
no_of_weekend_nightsno_of_week_nightstype_of_meal_planrequired_car_parking_spacelead_timearrival_monthmarket_segment_typerepeated_guestavg_price_per_roomno_of_special_requests
305223Meal Plan 1024110Online0150.451
112403Not Selected0154Online0117.670
405212Not Selected01310Online0140.002
2903502Meal Plan 10412Online062.370
2236229Meal Plan 2052Online0146.000
\n", "
" ], "text/plain": [ " no_of_weekend_nights no_of_week_nights type_of_meal_plan \\\n", "3052 2 3 Meal Plan 1 \n", "1124 0 3 Not Selected \n", "4052 1 2 Not Selected \n", "29035 0 2 Meal Plan 1 \n", "22362 2 9 Meal Plan 2 \n", "\n", " required_car_parking_space lead_time arrival_month \\\n", "3052 0 241 10 \n", "1124 0 15 4 \n", "4052 0 13 10 \n", "29035 0 4 12 \n", "22362 0 5 2 \n", "\n", " market_segment_type repeated_guest avg_price_per_room \\\n", "3052 Online 0 150.45 \n", "1124 Online 0 117.67 \n", "4052 Online 0 140.00 \n", "29035 Online 0 62.37 \n", "22362 Online 0 146.00 \n", "\n", " no_of_special_requests \n", "3052 1 \n", "1124 0 \n", "4052 2 \n", "29035 0 \n", "22362 0 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Xtest\n", "X_test.head()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3052 0\n", "1124 0\n", "4052 1\n", "29035 1\n", "22362 0\n", "Name: booking_status, dtype: int32" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show ytest\n", "y_test.head()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 2858],\n", " [ 1, 5757]], dtype=int64)" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check Class Distribution in Test-Set\n", "np.array(np.unique(y_test, return_counts=True)).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0 = Canceled\n", "1 = Not Canceled" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['no_of_weekend_nights', 'no_of_week_nights', 'type_of_meal_plan',\n", " 'required_car_parking_space', 'lead_time', 'arrival_month',\n", " 'market_segment_type', 'repeated_guest', 'avg_price_per_room',\n", " 'no_of_special_requests'],\n", " dtype='object')" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check Cardinality for (Categorical and Categorical Numerical Columns)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of categories in the variable type_of_meal_plan : 4\n", "Number of categories in the variable required_car_parking_space : 2\n", "Number of categories in the variable market_segment_type : 5\n", "Number of categories in the variable repeated_guest : 2\n", "Number of categories in the variable no_of_special_requests : 6\n" ] } ], "source": [ "# untuk mengecek cardinality\n", "print('Number of categories in the variable type_of_meal_plan : {}'.format(len(df.type_of_meal_plan.unique())))\n", "print('Number of categories in the variable required_car_parking_space : {}'.format(len(df.required_car_parking_space.unique())))\n", "print('Number of categories in the variable market_segment_type : {}'.format(len(df.market_segment_type.unique())))\n", "print('Number of categories in the variable repeated_guest : {}'.format(len(df.repeated_guest.unique())))\n", "print('Number of categories in the variable no_of_special_requests : {}'.format(len(df.no_of_special_requests.unique())))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the information above our data categorical have low cardinality so it is `good fit`" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type_of_meal_plan : ['Meal Plan 1' 'Not Selected' 'Meal Plan 2' 'Meal Plan 3']\n", "market_segment_type : ['Online' 'Offline' 'Corporate' 'Complementary' 'Aviation']\n" ] } ], "source": [ "#Show all unique value of data categorical or object data type\n", "df= X_train.select_dtypes(include=['object','category']).columns.tolist()\n", "for columns in X_train[df]:\n", " \n", " print(f'{columns} : {X_train[columns].unique()}')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1], dtype=int64)" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Valeue of Categorical Numeric\n", "X_train.required_car_parking_space.unique()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1], dtype=int64)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Valeue of Categorical Numeric\n", "X_train.repeated_guest.unique()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 3, 4, 5], dtype=int64)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show Unique Valeue of Categorical Numeric\n", "X_train.no_of_special_requests.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above we can said that 3 other columns are not data categorical but data categorical numeric `required_car_parking_space`, `repeated_guest `, and `no_of_special_requests` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking and Handling Outlier for (Numerical Columns)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "# Function to create histogram and boxplot.\n", "# This functions takes a dataframe (df) and the variable of interest as arguments.\n", "\n", "def diagnostic_plots(df, variable):\n", " # Define figure size\n", " plt.figure(figsize=(16, 4))\n", "\n", " # Histogram\n", " plt.subplot(1, 2, 1)\n", " sns.histplot(df[variable], bins=30)\n", " plt.title('Histogram')\n", "\n", " # Boxplot\n", " plt.subplot(1, 2, 2)\n", " sns.boxplot(y=df[variable])\n", " plt.title('Boxplot')\n", "\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# visualizing outliers for All columns\n", "ax = sns.boxplot(data=X_train, orient=\"h\", palette=\"Set2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "okay now we will handle outlier to data type float or numerical continuous columns" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 27206 entries, 18478 to 17673\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 no_of_weekend_nights 27206 non-null int64 \n", " 1 no_of_week_nights 27206 non-null int64 \n", " 2 type_of_meal_plan 27206 non-null object \n", " 3 required_car_parking_space 27206 non-null int64 \n", " 4 lead_time 27206 non-null int64 \n", " 5 arrival_month 27206 non-null int64 \n", " 6 market_segment_type 27206 non-null object \n", " 7 repeated_guest 27206 non-null int64 \n", " 8 avg_price_per_room 27206 non-null float64\n", " 9 no_of_special_requests 27206 non-null int64 \n", "dtypes: float64(1), int64(7), object(2)\n", "memory usage: 2.3+ MB\n" ] } ], "source": [ "# Show Data type\n", "X_train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "from information above we will handle outlier with lots of unique values" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distribution of column `no_of_weekend_nights`: 0.7183013056277611\n", "Distribution of column `no_of_week_nights` : 1.4856608752054041\n", "Distribution of column `lead_time` : 1.300929223383215\n", "Distribution of column `arrival_month` : -0.34560081465019166\n", "Distribution of column `repeated_guest` : 6.081103931170396\n", "Distribution of column `avg_price_per_room` : 0.5983008821448975\n", "Distribution of column `no_of_special_requests` : 1.1441190231555718\n" ] } ], "source": [ "# Let's check whether a distribution is normal or not all columns\n", "\n", "print('Distribution of column `no_of_weekend_nights`: ', X_train['no_of_weekend_nights'].skew())\n", "print('Distribution of column `no_of_week_nights` : ', X_train['no_of_week_nights'].skew())\n", "print('Distribution of column `lead_time` : ', X_train['lead_time'].skew())\n", "print('Distribution of column `arrival_month` : ', X_train['arrival_month'].skew())\n", "print('Distribution of column `repeated_guest` : ', X_train['repeated_guest'].skew())\n", "print('Distribution of column `avg_price_per_room` : ', X_train['avg_price_per_room'].skew())\n", "print('Distribution of column `no_of_special_requests` : ', X_train['no_of_special_requests'].skew())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "from information above the data with score between `-0.5` and `0.5` have gausian or normal distribution but if the data have score upper and lower that standard means data skew or have outlier. Here we will do capping method to handling outlier because we dont want to drop column anymore, the data gausian will count using `Gausian` and data skew using `IQR`" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "def handling_outliers(X_train, columns):\n", " for column in columns:\n", " # calculate Q1 and Q3 of the column\n", " Q1 = X_train[column].quantile(0.25)\n", " Q3 = X_train[column].quantile(0.75)\n", " IQR = Q3 - Q1\n", "\n", " # define the fences\n", " lower_fence = Q1 - 1.5 * IQR\n", " upper_fence = Q3 + 1.5 * IQR\n", "\n", " # cap the outliers to the nearest actual data point\n", " X_train[column] = np.where(X_train[column] > upper_fence, upper_fence, \n", " np.where(X_train[column] < lower_fence, lower_fence, X_train[column]))\n", " \n", " return X_train\n", "columns = ['no_of_weekend_nights','no_of_week_nights',\n", " 'lead_time','arrival_month','repeated_guest',\n", " 'avg_price_per_room','no_of_special_requests']\n", "X_train_capped= handling_outliers(X_train,columns)\n", "X_test_capped= handling_outliers(X_test,columns)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# visualizing clearing outlier\n", "ax = sns.boxplot(data=X_train_capped, orient=\"h\", palette=\"Set2\")" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# visualizing clearing outlier\n", "ax = sns.boxplot(data=X_test_capped, orient=\"h\", palette=\"Set2\")" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
no_of_weekend_nightsno_of_week_nightstype_of_meal_planrequired_car_parking_spacelead_timearrival_monthmarket_segment_typerepeated_guestavg_price_per_roomno_of_special_requests
184780.03.0Meal Plan 1020.09.0Online0.0136.672.0
115751.04.0Meal Plan 1011.07.0Offline0.085.000.0
361082.02.0Meal Plan 1024.012.0Online0.095.201.0
231510.02.0Meal Plan 10188.06.0Offline0.0130.000.0
193771.00.0Meal Plan 10286.010.0Offline0.090.000.0
\n", "
" ], "text/plain": [ " no_of_weekend_nights no_of_week_nights type_of_meal_plan \\\n", "18478 0.0 3.0 Meal Plan 1 \n", "11575 1.0 4.0 Meal Plan 1 \n", "36108 2.0 2.0 Meal Plan 1 \n", "23151 0.0 2.0 Meal Plan 1 \n", "19377 1.0 0.0 Meal Plan 1 \n", "\n", " required_car_parking_space lead_time arrival_month \\\n", "18478 0 20.0 9.0 \n", "11575 0 11.0 7.0 \n", "36108 0 24.0 12.0 \n", "23151 0 188.0 6.0 \n", "19377 0 286.0 10.0 \n", "\n", " market_segment_type repeated_guest avg_price_per_room \\\n", "18478 Online 0.0 136.67 \n", "11575 Offline 0.0 85.00 \n", "36108 Online 0.0 95.20 \n", "23151 Offline 0.0 130.00 \n", "19377 Offline 0.0 90.00 \n", "\n", " no_of_special_requests \n", "18478 2.0 \n", "11575 0.0 \n", "36108 1.0 \n", "23151 0.0 \n", "19377 0.0 " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show new Xtrain clean data frame from outlier \n", "X_train_capped.head()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
no_of_weekend_nightsno_of_week_nightstype_of_meal_planrequired_car_parking_spacelead_timearrival_monthmarket_segment_typerepeated_guestavg_price_per_roomno_of_special_requests
30522.03.0Meal Plan 10241.010.0Online0.0150.451.0
11240.03.0Not Selected015.04.0Online0.0117.670.0
40521.02.0Not Selected013.010.0Online0.0140.002.0
290350.02.0Meal Plan 104.012.0Online0.062.370.0
223622.06.0Meal Plan 205.02.0Online0.0146.000.0
\n", "
" ], "text/plain": [ " no_of_weekend_nights no_of_week_nights type_of_meal_plan \\\n", "3052 2.0 3.0 Meal Plan 1 \n", "1124 0.0 3.0 Not Selected \n", "4052 1.0 2.0 Not Selected \n", "29035 0.0 2.0 Meal Plan 1 \n", "22362 2.0 6.0 Meal Plan 2 \n", "\n", " required_car_parking_space lead_time arrival_month \\\n", "3052 0 241.0 10.0 \n", "1124 0 15.0 4.0 \n", "4052 0 13.0 10.0 \n", "29035 0 4.0 12.0 \n", "22362 0 5.0 2.0 \n", "\n", " market_segment_type repeated_guest avg_price_per_room \\\n", "3052 Online 0.0 150.45 \n", "1124 Online 0.0 117.67 \n", "4052 Online 0.0 140.00 \n", "29035 Online 0.0 62.37 \n", "22362 Online 0.0 146.00 \n", "\n", " no_of_special_requests \n", "3052 1.0 \n", "1124 0.0 \n", "4052 2.0 \n", "29035 0.0 \n", "22362 0.0 " ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show new Xtest clean data frame from outlier \n", "X_test_capped.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After handling outlier we will define our first pipeline that is our scaler using Standard Scaler, then we weill continue this project enter to the next step `Model Definition`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project using standard scaler for scaling method because here we dont want to change our dataset into range value (0,1) using `Min-Max Scaler `and we will use `One Hot Encoder` as Encoding method for our dataset because here in our data columns there are nominal categorical features." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ColumnTransformer(remainder='passthrough',\n",
       "                  transformers=[('pipe_num',\n",
       "                                 Pipeline(steps=[('scaler', StandardScaler())]),\n",
       "                                 ['no_of_week_nights', 'lead_time',\n",
       "                                  'avg_price_per_room', 'arrival_month']),\n",
       "                                ('pipe_cat',\n",
       "                                 Pipeline(steps=[('encoder', OneHotEncoder())]),\n",
       "                                 ['market_segment_type', 'type_of_meal_plan'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler', StandardScaler())]),\n", " ['no_of_week_nights', 'lead_time',\n", " 'avg_price_per_room', 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder', OneHotEncoder())]),\n", " ['market_segment_type', 'type_of_meal_plan'])])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Define cat col,num col, dan numcat col\n", "num_col =['no_of_week_nights', 'lead_time','avg_price_per_room','arrival_month']\n", "cat_col= ['market_segment_type','type_of_meal_plan']\n", "num_cat_col=['required_car_parking_space','repeated_guest','no_of_special_requests']\n", "#Define num pipeline\n", "num_pipeline= Pipeline([\n", " ('scaler',StandardScaler())])\n", "#Define cat pipeline\n", "cat_pipeline= Pipeline([\n", " ('encoder',OneHotEncoder())])\n", "\n", "# Concate 2 pipeline\n", "preprocessing_pipeline = ColumnTransformer([\n", " ('pipe_num',num_pipeline,num_col),\n", " ('pipe_cat',cat_pipeline,cat_col)\n", "], remainder= 'passthrough')\n", "\n", "preprocessing_pipeline.fit(X_train_capped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 6: Model Definition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this stage we will choose the best model to use in the analysis from 5 models, namely support vector machine (SVM), K-nearest neighbors (KNN), Decision Tree, Random Forest, and Boosting using a pipeline using" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "#Model Define\n", "svm_model = SVC()\n", "knn_model = KNeighborsClassifier()\n", "dt_model = DecisionTreeClassifier(random_state=10)\n", "rf_model = RandomForestClassifier(random_state=10)\n", "ada_model = AdaBoostClassifier()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 7: Model Evaluation" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "# Function untuk performance check dan juga metrics f1_score\n", "def performance_check(clf, X, y):\n", " y_pred = clf.predict(X)\n", " \n", "\n", " return f1_score(y, y_pred)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessing_pipeline',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('pipe_num',\n",
       "                                                  Pipeline(steps=[('scaler',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['no_of_week_nights',\n",
       "                                                   'lead_time',\n",
       "                                                   'avg_price_per_room',\n",
       "                                                   'arrival_month']),\n",
       "                                                 ('pipe_cat',\n",
       "                                                  Pipeline(steps=[('encoder',\n",
       "                                                                   OneHotEncoder())]),\n",
       "                                                  ['market_segment_type',\n",
       "                                                   'type_of_meal_plan'])])),\n",
       "                ('svm', SVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('svm', SVC())])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 1 : Model training SVM using `Pipeline`\n", "pipe_svm = Pipeline([\n", " ('preprocessing_pipeline',preprocessing_pipeline),\n", " ('svm', SVC())])\n", "pipe_svm.fit(X_train_capped, y_train)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessing_pipeline',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('pipe_num',\n",
       "                                                  Pipeline(steps=[('scaler',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['no_of_week_nights',\n",
       "                                                   'lead_time',\n",
       "                                                   'avg_price_per_room',\n",
       "                                                   'arrival_month']),\n",
       "                                                 ('pipe_cat',\n",
       "                                                  Pipeline(steps=[('encoder',\n",
       "                                                                   OneHotEncoder())]),\n",
       "                                                  ['market_segment_type',\n",
       "                                                   'type_of_meal_plan'])])),\n",
       "                ('knn', KNeighborsClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('knn', KNeighborsClassifier())])" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 2 : Model training KNN using `Pipeline`\n", "pipe_knn = Pipeline([\n", " ('preprocessing_pipeline',preprocessing_pipeline),\n", " ('knn', KNeighborsClassifier())])\n", "pipe_knn.fit(X_train_capped, y_train)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessing_pipeline',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('pipe_num',\n",
       "                                                  Pipeline(steps=[('scaler',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['no_of_week_nights',\n",
       "                                                   'lead_time',\n",
       "                                                   'avg_price_per_room',\n",
       "                                                   'arrival_month']),\n",
       "                                                 ('pipe_cat',\n",
       "                                                  Pipeline(steps=[('encoder',\n",
       "                                                                   OneHotEncoder())]),\n",
       "                                                  ['market_segment_type',\n",
       "                                                   'type_of_meal_plan'])])),\n",
       "                ('dt', DecisionTreeClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('dt', DecisionTreeClassifier())])" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 3 : Model training Decision Tree using `Pipeline`\n", "pipe_dt = Pipeline([\n", " ('preprocessing_pipeline',preprocessing_pipeline),\n", " ('dt', DecisionTreeClassifier())])\n", "pipe_dt.fit(X_train_capped, y_train)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessing_pipeline',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('pipe_num',\n",
       "                                                  Pipeline(steps=[('scaler',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['no_of_week_nights',\n",
       "                                                   'lead_time',\n",
       "                                                   'avg_price_per_room',\n",
       "                                                   'arrival_month']),\n",
       "                                                 ('pipe_cat',\n",
       "                                                  Pipeline(steps=[('encoder',\n",
       "                                                                   OneHotEncoder())]),\n",
       "                                                  ['market_segment_type',\n",
       "                                                   'type_of_meal_plan'])])),\n",
       "                ('rf', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('rf', RandomForestClassifier())])" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 4 : Model training Random Forest using `Pipeline`\n", "pipe_rf = Pipeline([\n", " ('preprocessing_pipeline',preprocessing_pipeline),\n", " ('rf', RandomForestClassifier())])\n", "pipe_rf.fit(X_train_capped, y_train)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessing_pipeline',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('pipe_num',\n",
       "                                                  Pipeline(steps=[('scaler',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['no_of_week_nights',\n",
       "                                                   'lead_time',\n",
       "                                                   'avg_price_per_room',\n",
       "                                                   'arrival_month']),\n",
       "                                                 ('pipe_cat',\n",
       "                                                  Pipeline(steps=[('encoder',\n",
       "                                                                   OneHotEncoder())]),\n",
       "                                                  ['market_segment_type',\n",
       "                                                   'type_of_meal_plan'])])),\n",
       "                ('Ada', AdaBoostClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('Ada', AdaBoostClassifier())])" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 5 : Model training Ada Boost using `Pipeline`\n", "pipe_ada = Pipeline([\n", " ('preprocessing_pipeline',preprocessing_pipeline),\n", " ('Ada', AdaBoostClassifier())])\n", "pipe_ada.fit(X_train_capped, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before entering the next stage regarding combining the model we chose in the pipeline with hyperparameter tuning using the random search method, we will carry out cross validation for the five types of KNN, SVM,Decision Tree,Random Forest, and applying ensemble learning Boosting model in our best model to make our model stronger, stronger here means reduce the variance and bias of individual models, making the combined model more generalizable and better at handling different types of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 8: Model Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross Validation" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "#Define SKfold\n", "skfold = StratifiedKFold(n_splits=5)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "#Define Cross Validation for each model\n", "cv_svm_model = cross_val_score(pipe_svm, X_train_capped, y_train, cv=skfold, scoring= 'f1')\n", "cv_knn_model = cross_val_score(pipe_knn,X_train_capped, y_train, cv=skfold, scoring= 'f1')\n", "cv_dt_model = cross_val_score(pipe_dt, X_train_capped, y_train, cv=skfold, scoring= 'f1')\n", "cv_rf_model = cross_val_score(pipe_rf,X_train_capped, y_train, cv=skfold, scoring= 'f1')\n", "cv_ada_model = cross_val_score(pipe_ada,X_train_capped, y_train, cv=skfold, scoring= 'f1')" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "svm_model\n", "f1 score - All - Cross Validation: [0.87908325 0.88690163 0.88221406 0.88103992 0.88315076]\n", "f1 score - mean - Cross Validation: 0.8824779218918589\n", "f1 score - std - Cross validation: 0.0025955044753826497\n", "f1 score - Range of Test Set: 0.8798824174164762 - 0.8850734263672416\n", "--------------------------------------------------\n", "ada_model\n", "f1 score - All - Cross Validation: [0.86263441 0.87183608 0.86650615 0.8663459 0.87150689]\n", "f1 score - mean - Cross Validation: 0.8677658846275987\n", "f1 score - std - Cross validation: 0.0034784164883290432\n", "f1 score - Range of Test Set: 0.8642874681392697 - 0.8712443011159278\n", "--------------------------------------------------\n", "knn_model\n", "f1 score - All - Cross Validation: [0.88984881 0.89618074 0.89058934 0.8917609 0.88820293]\n", "f1 score - mean - Cross Validation: 0.8913165451494984\n", "f1 score - std - Cross validation: 0.0026920461225627495\n", "f1 score - Range of Test Set: 0.8886244990269356 - 0.8940085912720612\n", "--------------------------------------------------\n", "dt_model\n", "f1 score - All - Cross Validation: [0.8917547 0.89511828 0.88970991 0.89386857 0.89328389]\n", "f1 score - mean - Cross Validation: 0.8927470701797031\n", "f1 score - std - Cross validation: 0.001864282857667467\n", "f1 score - Range of Test Set: 0.8908827873220356 - 0.8946113530373706\n", "--------------------------------------------------\n", "rf_model\n", "f1 score - All - Cross Validation: [0.92309753 0.92204481 0.91872318 0.92183288 0.91896229]\n", "f1 score - mean - Cross Validation: 0.9209321396895096\n", "f1 score - std - Cross validation: 0.00176057672836292\n", "f1 score - Range of Test Set: 0.9191715629611467 - 0.9226927164178725\n", "--------------------------------------------------\n", "Best Model: rf_model\n", "Cross Val Mean from Best Model: 0.9209321396895096\n" ] } ], "source": [ "# Finding best model dengan for loop based on cross_val_score (mean)\n", "name_model= []\n", "cv_scores= 0\n", "#Zip untuk memanggil gabungan dari data di atas dan di bawah \n", "# untuk cv,nama metric\n", "for cv,name in zip([cv_svm_model,cv_ada_model,cv_knn_model,cv_dt_model,cv_rf_model],\n", " ['svm_model','ada_model','knn_model','dt_model','rf_model']): #isi dari nama kolom yang mau di looping\n", " #Format Output Looping\n", " print(name)\n", " print('f1 score - All - Cross Validation:', cv)\n", " print('f1 score - mean - Cross Validation:',cv.mean())\n", " print('f1 score - std - Cross validation:', cv.std())\n", " print('f1 score - Range of Test Set:',(cv.mean()-cv.std()), '-' ,(cv.mean()+cv.std()))\n", " print('-'*50)\n", "\n", "#Create a condition to find best model based on cv.mean ()\n", " if cv.mean() > cv_scores:\n", " cv_scores = cv.mean()\n", " name_model = name\n", "\n", " else:\n", " pass\n", "\n", "#Create a conclusion\n", "print('Best Model:', name_model)\n", "print('Cross Val Mean from Best Model:', cv_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the cross validation above our best model is `Random Forest` model" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Train Set : 0.9946585272796643 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 0.99 0.98 0.99 8897\n", " 1 0.99 1.00 0.99 18309\n", "\n", " accuracy 0.99 27206\n", " macro avg 0.99 0.99 0.99 27206\n", "weighted avg 0.99 0.99 0.99 27206\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Check Performance Model against Train-Set\n", "\n", "y_pred_train = pipe_rf.predict(X_train_capped)\n", "\n", "print('F1 Score - Train Set : ', f1_score(y_train, y_pred_train), '\\n')\n", "print('Classification Report : \\n', classification_report(y_train, y_pred_train), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(pipe_rf, X_train_capped, y_train, cmap='Reds'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above we can said our best model in train set is perfect has 99 score" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Test Set : 0.9242566243503451 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 0.87 0.81 0.84 2858\n", " 1 0.91 0.94 0.92 5757\n", "\n", " accuracy 0.90 8615\n", " macro avg 0.89 0.87 0.88 8615\n", "weighted avg 0.90 0.90 0.90 8615\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Check Performance Model against Test-Set\n", "\n", "y_pred_test = pipe_rf.predict(X_test_capped)\n", "\n", "print('F1 Score - Test Set : ', f1_score(y_test, y_pred_test), '\\n')\n", "print('Classification Report : \\n', classification_report(y_test, y_pred_test), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(pipe_rf, X_test_capped, y_test, cmap='Reds'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From information above we can see that our best model in test set is high too 92 for the score\n" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Baseline (Default Hyperparameter)
test - accuracy_score0.896808
test - f1_score0.924257
test - precision0.907023
test - recall0.942157
train - accuracy0.992796
train - f1_score0.994659
train - precision0.992603
train - recall0.996723
\n", "
" ], "text/plain": [ " Baseline (Default Hyperparameter)\n", "test - accuracy_score 0.896808\n", "test - f1_score 0.924257\n", "test - precision 0.907023\n", "test - recall 0.942157\n", "train - accuracy 0.992796\n", "train - f1_score 0.994659\n", "train - precision 0.992603\n", "train - recall 0.996723" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save Classification Report into a Dictionary\n", "\n", "all_reports = {}\n", "def performance_report(all_reports, y_train, y_pred_train, y_test, y_pred_test, name):\n", " score_reports = {\n", " 'train - precision' : precision_score(y_train, y_pred_train),\n", " 'train - recall' : recall_score(y_train, y_pred_train),\n", " 'train - accuracy' : accuracy_score(y_train, y_pred_train),\n", " 'train - f1_score' : f1_score(y_train, y_pred_train),\n", " 'test - precision' : precision_score(y_test, y_pred_test),\n", " 'test - recall' : recall_score(y_test, y_pred_test),\n", " 'test - accuracy_score' : accuracy_score(y_test, y_pred_test),\n", " 'test - f1_score' : f1_score(y_test, y_pred_test),\n", " }\n", " all_reports[name] = score_reports\n", " return all_reports\n", "\n", "all_reports = performance_report(all_reports, y_train, y_pred_train, y_test, y_pred_test, 'Baseline (Default Hyperparameter)')\n", "pd.DataFrame(all_reports)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "now we will combine our best model with hyperparameter tuning random search to see the different before and after the model getting process with hyperparameter tuning, it expected will increase our best model F1_score " ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomizedSearchCV(cv=5,\n",
       "                   estimator=Pipeline(steps=[('preprocessing_pipeline',\n",
       "                                              ColumnTransformer(remainder='passthrough',\n",
       "                                                                transformers=[('pipe_num',\n",
       "                                                                               Pipeline(steps=[('scaler',\n",
       "                                                                                                StandardScaler())]),\n",
       "                                                                               ['no_of_week_nights',\n",
       "                                                                                'lead_time',\n",
       "                                                                                'avg_price_per_room',\n",
       "                                                                                'arrival_month']),\n",
       "                                                                              ('pipe_cat',\n",
       "                                                                               Pipeline(steps=[('encoder',\n",
       "                                                                                                OneHotEncoder())]),\n",
       "                                                                               ['market_segment_type',\n",
       "                                                                                'type_of_meal_plan'])])),\n",
       "                                             ('rf', RandomForestClassifier())]),\n",
       "                   param_distributions={'rf__max_depth': [1, 2, 3, 4, 5],\n",
       "                                        'rf__n_estimators': [1, 10, 100],\n",
       "                                        'rf__random_state': [42]},\n",
       "                   random_state=42, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomizedSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('rf', RandomForestClassifier())]),\n", " param_distributions={'rf__max_depth': [1, 2, 3, 4, 5],\n", " 'rf__n_estimators': [1, 10, 100],\n", " 'rf__random_state': [42]},\n", " random_state=42, scoring='f1')" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Pipeline Random Forest\n", "random_search_params = {'rf__n_estimators':[1,10,100],\n", " 'rf__max_depth':[1,2,3,4,5],\n", " 'rf__random_state':[42]}\n", "\n", "random_rf = RandomizedSearchCV(pipe_rf,param_distributions=random_search_params, n_iter=10,cv=5,random_state=42, scoring= 'f1')\n", "random_rf.fit(X_train_capped, y_train)" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessing_pipeline',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('pipe_num',\n",
       "                                                  Pipeline(steps=[('scaler',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['no_of_week_nights',\n",
       "                                                   'lead_time',\n",
       "                                                   'avg_price_per_room',\n",
       "                                                   'arrival_month']),\n",
       "                                                 ('pipe_cat',\n",
       "                                                  Pipeline(steps=[('encoder',\n",
       "                                                                   OneHotEncoder())]),\n",
       "                                                  ['market_segment_type',\n",
       "                                                   'type_of_meal_plan'])])),\n",
       "                ('rf', RandomForestClassifier(max_depth=5, random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessing_pipeline',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipe_num',\n", " Pipeline(steps=[('scaler',\n", " StandardScaler())]),\n", " ['no_of_week_nights',\n", " 'lead_time',\n", " 'avg_price_per_room',\n", " 'arrival_month']),\n", " ('pipe_cat',\n", " Pipeline(steps=[('encoder',\n", " OneHotEncoder())]),\n", " ['market_segment_type',\n", " 'type_of_meal_plan'])])),\n", " ('rf', RandomForestClassifier(max_depth=5, random_state=42))])" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get Best Hyperparameters\n", "\n", "best_params = random_rf.best_estimator_\n", "best_params" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train\n", " precision recall f1-score support\n", "\n", " 0 0.86 0.52 0.65 8897\n", " 1 0.80 0.96 0.87 18309\n", "\n", " accuracy 0.81 27206\n", " macro avg 0.83 0.74 0.76 27206\n", "weighted avg 0.82 0.81 0.80 27206\n", "\n", "\n", "Test\n", " precision recall f1-score support\n", "\n", " 0 0.85 0.51 0.64 2858\n", " 1 0.80 0.95 0.87 5757\n", "\n", " accuracy 0.81 8615\n", " macro avg 0.82 0.73 0.75 8615\n", "weighted avg 0.81 0.81 0.79 8615\n", "\n" ] } ], "source": [ "#Classification Report\n", "y_pred_train = best_params.predict(X_train_capped)\n", "y_pred_test = best_params.predict(X_test_capped)\n", "\n", "print('Train')\n", "print(classification_report(y_train, y_pred_train))\n", "print('')\n", "\n", "print('Test')\n", "print(classification_report(y_test, y_pred_test))" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 Score - Test Set : 0.8689120809614168 \n", "\n", "Classification Report : \n", " precision recall f1-score support\n", "\n", " 0 0.85 0.51 0.64 2858\n", " 1 0.80 0.95 0.87 5757\n", "\n", " accuracy 0.81 8615\n", " macro avg 0.82 0.73 0.75 8615\n", "weighted avg 0.81 0.81 0.79 8615\n", " \n", "\n", "Confusion Matrix : \n", " \n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Check Performance Model against Test-Set\n", "\n", "y_pred_test = best_params.predict(X_test)\n", "\n", "print('F1 Score - Test Set : ', f1_score(y_test, y_pred_test), '\\n')\n", "print('Classification Report : \\n', classification_report(y_test, y_pred_test), '\\n')\n", "print('Confusion Matrix : \\n', ConfusionMatrixDisplay.from_estimator(best_params, X_test, y_test, cmap='Reds'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the confusion metrics above we gain the information that our model test F1_Score is 87% that is good enough to be applied in the real case of hotel reservation, the confusion metrics above contain information like:\n", "1. True Negatif (TN): the prediction said that there are `1462` visitors predicted will cancel their booking and in the real fact True so it is good because the prediction and the real fact is same\n", "2. False Negatif (FN): the prediction said that there are `262` visitors predicted will cancel their booking and in real fact is not true or they not canceled their booking so that is not good because it can be said miss prediction and it can be as improvement to the next modeling to reduce this value of False Negatif, but if we look at the confusion metrics the False Negatif is the lowest value that indicate that this modeling is good enough but still need imporovment\n", "3. False Positif (FP): the prediction said that there are `1396` visitors predicted will not canceled their booking but in real fact is they canceled their booking so that is not good too like False Negatif because it is miss prediction.\n", "4. True Positif (TP): prediction said that there are `5496` visitors predicted will not canceled their booking but in real fact is true so it is good prediction and the value is the most than the other so that it is still good modeling like i said before" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Baseline (Default Hyperparameter)Random Search
train - precision0.9926030.803865
train - recall0.9967230.958763
train - accuracy0.9927960.814820
train - f1_score0.9946590.874508
test - precision0.9070230.797417
test - recall0.9421570.954490
test - accuracy_score0.8968080.807545
test - f1_score0.9242570.868912
\n", "
" ], "text/plain": [ " Baseline (Default Hyperparameter) Random Search\n", "train - precision 0.992603 0.803865\n", "train - recall 0.996723 0.958763\n", "train - accuracy 0.992796 0.814820\n", "train - f1_score 0.994659 0.874508\n", "test - precision 0.907023 0.797417\n", "test - recall 0.942157 0.954490\n", "test - accuracy_score 0.896808 0.807545\n", "test - f1_score 0.924257 0.868912" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save Classification Report into a Dictionary\n", "\n", "all_reports = performance_report(all_reports, y_train, y_pred_train, y_test, y_pred_test, 'Random Search')\n", "pd.DataFrame(all_reports)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the classification report it indicated that using hyperparameter in our best model decrese from our baseline model without using hyperparameter tuning, good news from our model has good fit data because there are no gap value of F1_score in train set dan test set or we can said the value is the same 87 that is good enough, good fit, and more consistant to be applied in real case although the baseline model have higher score than using hyperparameter tuning, so here in this project use best params as the best model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 9: Model Saving" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "# Save file inference\n", "with open('best_param.pkl', 'wb') as file_1:\n", " pickle.dump(best_params, file_1)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "# Save file preprocessing\n", "with open('preprocessing_pipeline.pkl', 'wb') as file_2:\n", " pickle.dump(preprocessing_pipeline, file_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 10: Model Inference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model Inference in notebook entitled PIM2_inf_Allen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bab 11: Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Looking to EDA: \n", "- Visitors that not canceled their booking is bigger than canceled their booking 67.2% to 32.8%. We can take a look the comparison between the \n", "visitors that not canceled and canceled in how they chose meal plan, the meal plan 1 is occupied the first place\n", "- Market segment of booking status majority from online\n", "- The most chose room type is room type 1\n", "- Visitors activity in reservation hotel, crowded in October 2018\n", "### Cross Validation and Best Model:\n", "- After doing cross validation, Random Forest become the best model\n", "- Baseline Model getting value of F1_Score higher than using hyperparameter tuning\n", "- This project chose best params with hyperparameter tuning with random search although the value 0f F1_score decrease but the result is more `goodfit` because the train set dan test set is the same and dont have a gap with value of F1_Score 87 that is high and good enough as model learning to do a prediction but still have a lot of improvement for next model learning\n", "### Confusion Metric:\n", "1. True Negatif (TN): the prediction said that there are `1462` visitors predicted will cancel their booking and in the real fact True so it is good because the prediction and the real fact is same\n", "2. False Negatif (FN): the prediction said that there are `262` visitors predicted will cancel their booking and in real fact is not true or they not canceled their booking so that is not good because it can be said miss prediction and it can be as improvement to the next modeling to reduce this value of False Negatif, but if we look at the confusion metrics the False Negatif is the lowest value that indicate that this modeling is good enough but still need imporovment\n", "3. False Positif (FP): the prediction said that there are `1396` visitors predicted will not canceled their booking but in real fact is they canceled their booking so that is not good too like False Negatif because it is miss prediction.\n", "4. True Positif (TP): prediction said that there are `5496` visitors predicted will not canceled their booking but in real fact is true so it is good prediction and the value is the most than the other so that it is still good modeling like i said before\n", "### Insight Business \n", "- Based on the model inference this model is ready to predict customer or visitors about hotel reservation to detect the visitors want cancel or not their reservation, it is for sure will make effective and efficiency for hotel to prepare the stratetgy to reduce and minimalize the business loss by predicting the visitors by their activity and anticipate by giving their another option to make them not cancel their booking \n", "- Increasing services in Online platform because the most used by visitors is online platform\n", "- Giving improvement like promo, discount, or renovation or renew the decoration of the hotel in the end of the year because it is time that is suitable for booking a hotel\n", "### Improvement Model:\n", "- Updating the visitors activity to get best prediction\n", "- Increasing model in hyperparameter Tuning\n", "- Handling Imbalance data to get better result" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }