YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Multi-Dimensional Analysis of Airline Passenger Satisfaction and Flight Delay Prediction

Introduction

In this project, I will analyze airline passenger satisfaction and flight delays. The goal is to build a full machine learning pipeline, starting from data cleaning and Exploratory Data Analysis (EDA), through Feature Engineering and Clustering, to training and evaluating models.

Project Overview

I am using the "airline-satisfaction-analysis" dataset from Hugging Face. This dataset contains survey results from passengers, including their personal details, flight information, and how they rated different services.

Data Details

Size: The dataset has 103,904 rows and 25 columns.

Types: It includes Numerical data (like Age and Delay minutes), Categorical data (like Gender and Class), and Ordinal data (service ratings from 1 to 5).

Selected Features: I selected 15 features for this research:

Passenger Info: id, Gender, Customer Type, Age.

Flight Info: Type of Travel, Class, Flight Distance, Departure Delay in Minutes.

Ratings: Inflight wifi service, Ease of Online booking, Food and drink, Seat comfort, Cleanliness.

Target & Results: Arrival Delay in Minutes and satisfaction.

Research Question

"To what extent can we accurately predict the duration of the Arrival Delay based on the departure delay, flight distance, and the specific travel profile of the passenger?"

My Objectives

Regression: Build a model to predict the exact number of minutes a flight is delayed.
Classification: Predict if a flight will be "delayed" or not based on the data.

Source Link

https://huggingface.co/datasets/drukeroni/airline-satisfaction-analysis

Part 2: Exploratory Data Analysis (EDA)

In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.

1. Data Cleaning

Selected the 15 features that are most relevant to my research question.
Checked for missing values and decided to drop rows where the target variable (Arrival Delay in Minutes) was missing to keep the data accurate.
Verified that there are no duplicate rows in the dataset.
Standardized all text columns by converting them to lowercase and removing extra spaces.

2. Descriptive Statistics & Data Structure

Checked the data types and the final shape of the table after cleaning.
Calculated the percentage and count for each category, like Gender, Customer Type, and satisfaction, to see how the data is distributed.
Checked the service rating scales (like Inflight wifi service) to make sure all values are between 1 and 5.

3. Outlier Detection

Used the IQR (Interquartile Range) method to find outliers in columns like Age, Flight Distance, and delay times.
Calculated the percentage of outliers for each feature to understand how many extreme values exist in the data.

Data Exploration: Answering Key Research Questions through Visualization

Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?

This boxplot displays the distribution of flight distances and identifies extreme outliers that could distort the model's data scaling. It serves as visual evidence for the capping strategy needed to ensure data quality and better performance in future modeling.

Following the detection of outliers in flight delays, how are departure and arrival delays distributed, and what do these extreme values indicate about the data set?

Both plots show a highly right-skewed distribution with extreme outliers reaching 1,600 minutes, meaning most flights are on time while a few have massive delays.
There is a strong correlation between departure and arrival delays, which requires handling outliers (like capping or log-transformation) to improve regression accuracy.
While these extreme values can skew numerical predictions in regression, they are easier to handle in classification tasks where the goal is binary status prediction.

How are the satisfaction ratings distributed, and what does the presence of '0' values in a 1-5 scale indicate about data quality?

The plot shows that while most ratings are concentrated between 4 and 5, it visually confirms the presence of '0' values across various service categories.

What is the correlation between departure and arrival delays, and how do extreme outliers reflect unusual flight patterns?

This scatter plot shows a strong positive correlation between departure and arrival delays, while highlighting how extreme outliers deviate from the main cluster.

What is the correlation between departure and arrival delays, and how do cleaning the extreme outliers reflect the flight patterns?

After cleaning the data, the scatter plot now displays a much clearer and more reliable linear relationship between the two types of delays. Removing the extreme anomalies allows us to visualize the core data patterns that will be used for our predictive modeling.

Part 3: Baseline Regression Modeling

In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.

1. Data Preparation & Feature Selection

Defined Arrival Delay in Minutes as the target variable. Selected 8 key numerical features (like Age and Flight Distance) as predictors. Dropped missing values again just to be 100% sure the data is completely clean for the model.

2. Model Training

Splited the data into 80% training and 20% testing sets (using random_state=42 for consistency). Trained a basic Linear Regression model to learn the relationship between the features and delays.

3. Performance Evaluation

Evaluated the model's accuracy using MAE, MSE, RMSE, and R-squared.

Baseline Model: Actual vs Predicted Arrival Delays

The model is accurate, explaining 90.27% of the delays with an average error of only 5.24 minutes. The scatter plot shows most points are close to the red line, proving a strong connection between predictions and reality.

Feature Importance

The chart shows that departure delay is by far the most important factor for predicting when a flight will arrive. Other features, such as passenger age or flight distance, show very low importance in this baseline model.

Part 4: Advanced Feature Engineering & Preprocessing

In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.

1. Creating the Classification Target

Created a binary target for the classification task. A flight is marked as 'delayed' (1) if the arrival delay exceeds 15 minutes.

2. Categorical Encoding

Converted categorical text features into numerical values. This process enables the machine learning models to process non-numeric data columns.

3. Feature Scaling

Scaled the numerical features using the StandardScaler tool from sklearn. This ensures all features have a mean of 0 and a standard deviation of 1.

4. Feature Engineering with Unsupervised Learning (K-Means)

Used K-Means clustering to group passengers based on their service ratings. The resulting cluster ID is added as a new feature to represent a 'passenger profile'.

5. Cluster Visualization (PCA)

To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions. This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.

This shows that passengers are grouped into three distinct "service profiles" based on their ratings. This creates a meaningful "travel profile" feature, allowing us to test how different passenger experiences impact flight delays in our main research question.

Part 5: Model Training & Evaluation

In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.

1. Updated Linear Regression (Refined Baseline)

Re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables. This allowed me to see how much the additional feature engineering improved the initial baseline performance.

2. Decision Tree Regressor

Implemented a Decision Tree model to capture non-linear relationships between the features. To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.

3. Random Forest Regressor (Ensemble Method)

Trained a Random Forest model consisting of 100 individual trees. By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2 Score compared to a single decision tree.

4. Performance Comparison & Visualization

Created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm. we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.

Part 7: Regression-to-Classification

In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.

7.1 Creating Classes from Numeric Target

Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories. Threshold Selection:defined the cutoff point at 0 minutes. From an operational perspective, any flight arriving even one minute after its scheduled time is considered delayed. Therefore: Class 0 (On-Time/Early): Arrival delay ≤ 0 minutes. Class 1 (Delayed): Arrival delay > 0 minutes. Implementation: This transformation was applied consistently to both the training and testing sets to ensure the validity of the classification models.

7.2 Class Balance Analysis

The results show the classes are well-balanced with 55% delayed and 45% not delayed flights. Since the groups are almost equal, our model can learn from both types of data effectively.

Part 8: Classification Model Evaluation & Results

In this final analytical stage, I evaluated the three trained classifiers using Confusion Matrices to understand their prediction patterns and error types.

Model Performance Analysis:

Desicion Tree

Performance: The Decision Tree captured the highest number of actual delays (5,989 True Positives).

Drawback: It suffered from an unacceptably high rate of False Positives (2,974). This means it predicted a delay for nearly 3,000 flights that actually arrived on time, making it too "trigger-happy" and unreliable for a stress-free passenger experience.

Random Forest Classifier

Performance: As an ensemble model, it improved upon the single tree, successfully identifying 5,902 True Positives while maintaining 9,596 True Negatives.

Drawback: While it is a very balanced model, it still generated 1,470 False Positives. In a real-world application, this still represents a significant number of unnecessary false alarms sent to passengers.

Logistic Regression

Performance: This model demonstrated exceptional reliability in identifying on-time flights, achieving the highest number of True Negatives (10,380) and predicting 5,636 True Positives.

Business Value: Most importantly, it produced the lowest number of False Positives (only 686). While it missed some actual delays (3,079 False Negatives), it heavily minimizes false alarms. The cost of a False Positive (alerting a passenger that their flight is delayed when it is actually on time) is much higher than a False Negative (a regular, unpredicted delay). False alarms cause unnecessary stress, disrupt travel plans, and damage trust in the application. Therefore, despite Random Forest having a slightly better overall balance, Logistic Regression is the chosen model for this project. It ensures that when the system issues a delay warning, it is highly likely to be accurate, thereby protecting the user experience.

final conclusion

The analysis successfully uncovered the underlying structure of flight delays by integrating K-Means clustering to define unique passenger service profiles. We concluded that the predictive narrative is more effectively framed as a binary classification challenge than a direct regression of delay minutes. The modeling process revealed that predictive integrity is defined by high precision, where minimizing false alarms is prioritized over raw recall to maintain user trust. By selecting Logistic Regression, we optimized the workflow for a minimal False Positive rate of only 686 cases, significantly outperforming more complex ensemble methods.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support