--- license: mit datasets: - Nnaodeh/Stroke_Prediction_Dataset language: - en pipeline_tag: tabular-classification --- # Stroke Prediction Model This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented. ### Data Set This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. ### Attribute Information 1. id: unique identifier 2. gender: "Male", "Female" or "Other" 3. age: age of the patient 4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension 5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 6. ever_married: "No" or "Yes" 7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed" 8. Residence_type: "Rural" or "Urban" 9. avg_glucose_level: average glucose level in blood 10. bmi: body mass index 11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\* 12. stroke: 1 if the patient had a stroke or 0 if not ## Key Considerations Implementation ## Data Cleaning #### Drop id column The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model. #### Remove missing values Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number ## Feature Engineering #### Binary Encoding Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models: - ever_married: Encoded as 0 for “No” and 1 for “Yes”. - Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”. #### One-Hot Encoding for Multi-Class Categorical Features - For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category. - The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column. #### Split Dataset into Features and Target - Separate the target variable (stroke) from the features: - X: Contains all feature columns used as input for the model. - y: Contains the target column, which indicates whether a stroke occurred. #### Train-Test Split - Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting. - The specific split ratio (e.g., 70% train, 30% test) can be customized as needed. ### Model Selection Following models are evaluated: - Logistic Regression - K-Nearest Neighbors - Support Vector Machine (Linear Kernel) - Support Vector Machine (RBF Kernel) - Neural Network - Gradient Boosting Evaluated for: - Handles both numerical and categorical features - Resistant to overfitting - Provides feature importance - Good performance on imbalanced data ### 4. Software Engineering Best Practices #### A. Logging Comprehensive logging system: ```python logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') ``` Logging features: - Timestamp for each operation - Different log levels (INFO, ERROR) - Operation tracking - Error capture and reporting #### B. Documentation - Docstrings for all classes and methods - Clear code structure with comments - This README file - Logging outputs for tracking