YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π©Ί Diabetes Prediction Project
π Project Overview
This project develops machine learning models to predict diabetes onset using the Pima Indians Diabetes Database. The goal is to accurately classify patients based on diagnostic measurements while handling missing data, class imbalance, and outliers.
Dataset: 768 samples with 8 features + target (Outcome).
Task: Binary classification (diabetes: 1, no diabetes: 0).
Metrics: Accuracy and F1-score (as requested by the team).
π Best Performing Model
The Support Vector Classifier (SVC) with default parameters (after scaling and SMOTE) achieved the highest F1-score with competitive accuracy:
| Metric | Score |
|---|---|
| F1-score | 65.55% |
| Accuracy | 73.38% |
But SVC currently dose not work with pipeline, I chosed KNN (with GridSearch) which has best scores after SVC:
| Metric | Score |
|---|---|
| F1-score | 65.57% |
| Accuracy | 72.73% |
π Model Performance Comparison
| Model | F1βScore | Accuracy |
|---|---|---|
| SVC (default) | 65.55% | 73.38% |
| KNN (GridSearch) | 65.57% | 72.73% |
| Random Forest (GridSearch) | 63.64% | 74.03% |
| Logistic Regression (GridSearch) | 63.16% | 72.73% |
| Decision Tree | 60.61% | 66.23% |
Note: GridSearchCV was used for hyperparameter tuning; SVC with default parameters (
C=1,kernel='rbf',gamma='scale') outperformed its tuned version (which had lower F1).
π§ Data Preprocessing
Handling Missing Values
- Columns with
0as missing:Glucose,BloodPressure,SkinThickness,BMI. - Mean imputation for
Glucose,BloodPressure,BMI. - KNN imputation for
SkinThickness(due to high missing rate ~42%).
- Columns with
Outlier Treatment
- Applied RobustScaler to reduce outlier influence.
Class Imbalance
- Used SMOTE (Synthetic Minority Oversampling) to balance the training set.
Feature Engineering
- Created an interaction feature
Glucose_BMI = Glucose Γ BMI, though it did not improve performance.
- Created an interaction feature
π Exploratory Data Analysis
- Feature Importance (Random Forest):
GlucoseandBMIwere the most influential features. - Correlation: Highest correlation between
AgeandPregnancies(0.54). - Missing Data:
Insulinhad ~48% missing values,SkinThickness~42%. - Class Distribution: Imbalanced β ~65% negative, 35% positive.
π Live Demo
Try the application directly:
π Diabetes Prediction Demo
π Repository Structure
βββ main.ipynb # Full pipeline: EDA, preprocessing, modeling, evaluation
βββ diabetes.csv # Dataset
βββ requirements.txt # Python dependencies
βββ README.md # This file
π§ Requirements
Key libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, imbalanced-learn, joblib.
π» Usage
- Clone the repository.
- Install dependencies.
- Run
jupyter notebook main.ipynbto explore the full analysis. - Or launch the Hugging Face Space for instant predictions.
π Notes
- The notebook includes detailed visualisations (histograms, box plots, correlation heatmap, ROC curves).
- SMOTE was applied after train/test split to avoid data leakage.
- All models were evaluated using crossβvalidation and confusion matrices.
π¨βπ» Author
SirUnchained
This project was developed as a solution for a diabetes prediction task, demonstrating a complete machine learning workflow.