AnilKumarK2004 commited on
Commit
0f8455a
Β·
verified Β·
1 Parent(s): 302095a

Upload 4 files

Browse files
README.md ADDED
@@ -0,0 +1,382 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bank Customer Churn Prediction πŸ¦πŸ“Š
2
+
3
+ ## Introduction πŸ“–
4
+ This project implements a comprehensive machine learning pipeline to predict customer churn in banking. Customer churnβ€”the rate at which customers stop doing business with an entityβ€”is a critical metric in banking, where acquiring new customers costs 5-25x more than retaining existing ones. Our model identifies at-risk customers and provides actionable insights to develop targeted retention strategies.
5
+
6
+ ## Dataset Overview πŸ“‘
7
+ The analysis uses a bank customer dataset containing 10,000+ records with the following features:
8
+ - **Customer ID Information**: RowNumber, CustomerId, Surname (removed during preprocessing)
9
+ - **Demographics**: Age, Gender, Geography
10
+ - **Account Information**: CreditScore, Balance, Tenure, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary
11
+ - **Target Variable**: Exited (0 = Stayed, 1 = Churned)
12
+
13
+ ## Complete ML Pipeline πŸ”„
14
+
15
+ ### 1. Exploratory Data Analysis (EDA) πŸ”
16
+
17
+ Our EDA process uncovers critical patterns in customer behavior:
18
+
19
+ #### Target Variable Distribution
20
+ ![Churn Distribution](visualizations/churn_distribution.png)
21
+
22
+ This plot visualizes the class imbalance in our dataset, showing approximately 20% of customers churned. This imbalance necessitates special handling techniques like SMOTE during model training.
23
+
24
+ #### Categorical Features Analysis
25
+ ![Categorical Analysis](visualizations/categorical_analysis.png)
26
+
27
+ These visualizations reveal:
28
+ - **Geography**: German customers have significantly higher churn rates (32%) compared to France (16%) and Spain (17%)
29
+ - **Gender**: Female customers show higher churn tendency (25%) versus males (16%)
30
+ - **Card Ownership**: Minimal impact on churn decisions
31
+ - **Active Membership**: Inactive members are much more likely to churn (27%) than active members (14%)
32
+ - **Number of Products**: Customers with 1 or 4 products show highest churn rates (27% and 100% respectively)
33
+
34
+ #### Numerical Features Distribution
35
+ ![Numerical Distributions](visualizations/numerical_distributions.png)
36
+
37
+ Key insights:
38
+ - **Age**: Older customers (40+) are more likely to churn
39
+ - **Balance**: Customers with higher balances show increased churn probability
40
+ - **Credit Score**: Moderate correlation with churn
41
+ - **Tenure & Salary**: Limited direct impact on churn
42
+
43
+ #### Correlation Heatmap
44
+ ![Correlation Heatmap](visualizations/correlation_heatmap.png)
45
+
46
+ The correlation matrix reveals:
47
+ - Moderate positive correlation between Age and Exited (0.29)
48
+ - Weaker correlations between other numerical features and churn
49
+ - Limited multicollinearity among predictors, favorable for modeling
50
+
51
+ ### 2. Feature Engineering πŸ› οΈ
52
+
53
+ Beyond standard preprocessing, we implemented advanced feature creation:
54
+
55
+ ```python
56
+ # Create powerful derived features
57
+ df_model['BalanceToSalary'] = df_model['Balance'] / (df_model['EstimatedSalary'] + 1)
58
+ df_model['SalaryPerProduct'] = df_model['EstimatedSalary'] / (df_model['NumOfProducts'].replace(0, 1))
59
+ df_model['AgeToTenure'] = df_model['Age'] / (df_model['Tenure'] + 1)
60
+ ```
61
+
62
+ These engineered features capture complex relationships:
63
+ - **BalanceToSalary**: Indicates financial leverage and liquidity preference
64
+ - **SalaryPerProduct**: Reflects product efficiency relative to income level
65
+ - **AgeToTenure**: Measures customer loyalty relative to life stage
66
+
67
+ Our preprocessing pipeline handles both numerical and categorical features appropriately:
68
+
69
+ ```python
70
+ # Preprocessing pipeline
71
+ numerical_transformer = Pipeline(steps=[
72
+ ('imputer', SimpleImputer(strategy='median')),
73
+ ('scaler', StandardScaler())
74
+ ])
75
+
76
+ categorical_transformer = Pipeline(steps=[
77
+ ('imputer', SimpleImputer(strategy='most_frequent')),
78
+ ('onehot', OneHotEncoder(handle_unknown='ignore'))
79
+ ])
80
+
81
+ preprocessor = ColumnTransformer(
82
+ transformers=[
83
+ ('num', numerical_transformer, num_features),
84
+ ('cat', categorical_transformer, cat_features)
85
+ ])
86
+ ```
87
+
88
+ ### 3. Handling Class Imbalance with SMOTE πŸ“Š
89
+
90
+ Bank churn datasets typically suffer from class imbalance. Our implementation of Synthetic Minority Over-sampling Technique (SMOTE) creates synthetic examples of the minority class (churned customers):
91
+
92
+ ```python
93
+ # Apply SMOTE to handle class imbalance
94
+ smote = SMOTE(random_state=42)
95
+ X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)
96
+
97
+ # Class distribution before and after SMOTE
98
+ print(f"Original training set class distribution: {np.bincount(y_train)}")
99
+ print(f"Class distribution after SMOTE: {np.bincount(y_train_resampled)}")
100
+ ```
101
+
102
+ SMOTE works by:
103
+ 1. Taking samples from the minority class
104
+ 2. Finding their k-nearest neighbors
105
+ 3. Generating synthetic samples along the lines connecting a sample and its neighbors
106
+ 4. Creating a balanced dataset without information loss
107
+
108
+ This technique improved our model's recall by approximately 35% without significantly sacrificing precision.
109
+
110
+ ### 4. Model Implementation and Evaluation 🧠
111
+
112
+ We implemented and compared four powerful classification algorithms:
113
+
114
+ #### Logistic Regression
115
+ - A probabilistic classification model that estimates the probability of churn
116
+ - Provides readily interpretable coefficients for feature importance
117
+ - Serves as a baseline for more complex models
118
+
119
+ #### Random Forest
120
+ - Ensemble method using multiple decision trees
121
+ - Handles non-linear relationships and interactions between features
122
+ - Provides feature importance metrics based on Gini impurity or information gain
123
+
124
+ #### Gradient Boosting
125
+ - Sequential ensemble method that corrects errors from previous models
126
+ - Excellent performance for classification tasks
127
+ - Handles imbalanced data well
128
+
129
+ #### XGBoost
130
+ - Advanced implementation of gradient boosting
131
+ - Includes regularization to prevent overfitting
132
+ - Often achieves state-of-the-art results on structured data
133
+
134
+ #### ROC Curves Comparison
135
+ ![ROC Curves Comparison](visualizations/all_roc_curves.png)
136
+
137
+ This plot compares the ROC curves for all models, showing:
138
+ - Area Under the Curve (AUC) for each model
139
+ - True Positive Rate vs. False Positive Rate tradeoff
140
+ - Superior performance of ensemble methods (XGBoost, Gradient Boosting, Random Forest)
141
+ - Baseline performance of Logistic Regression
142
+
143
+ #### Confusion Matrix Example
144
+ ![Confusion Matrix](visualizations/confusion_matrix_gradient_boosting.png)
145
+
146
+ The confusion matrix visualizes:
147
+ - True Negatives: Correctly predicted staying customers
148
+ - False Positives: Incorrectly predicted as churning
149
+ - False Negatives: Missed churn predictions
150
+ - True Positives: Correctly identified churning customers
151
+
152
+ ### 5. Hyperparameter Tuning with GridSearchCV βš™οΈ
153
+
154
+ To optimize model performance, we implemented GridSearchCV:
155
+
156
+ ```python
157
+ # Example for Gradient Boosting hyperparameter optimization
158
+ param_grid = {
159
+ 'n_estimators': [100, 200, 300],
160
+ 'learning_rate': [0.01, 0.1, 0.2],
161
+ 'max_depth': [3, 5, 7],
162
+ 'min_samples_split': [2, 5, 10],
163
+ 'subsample': [0.8, 0.9, 1.0]
164
+ }
165
+
166
+ grid_search = GridSearchCV(
167
+ estimator=base_model,
168
+ param_grid=param_grid,
169
+ cv=3,
170
+ scoring='roc_auc',
171
+ n_jobs=-1,
172
+ verbose=2
173
+ )
174
+
175
+ grid_search.fit(X_train_resampled, y_train_resampled)
176
+ ```
177
+
178
+ GridSearchCV systematically:
179
+ 1. Creates all possible combinations of hyperparameters
180
+ 2. Evaluates each combination using cross-validation
181
+ 3. Selects the best-performing configuration
182
+ 4. Provides insights into parameter sensitivity
183
+
184
+ This process improved our model's AUC-ROC score by 7-12% compared to default configurations.
185
+
186
+ ### 6. Feature Importance Analysis πŸ”‘
187
+
188
+ ![Feature Importance](visualizations/feature_importance.png)
189
+
190
+ This visualization ranks features by their impact on prediction, revealing:
191
+ - **Age**: Most influential factor (relative importance: 0.28)
192
+ - **Balance**: Strong predictor (relative importance: 0.17)
193
+ - **Geography_Germany**: Significant geographical factor (relative importance: 0.11)
194
+ - **NumOfProducts**: Important product relationship indicator (relative importance: 0.08)
195
+ - **IsActiveMember**: Key behavioral predictor (relative importance: 0.07)
196
+
197
+ For tree-based models, feature importance is calculated based on the average reduction in impurity (Gini or entropy) across all nodes where the feature is used.
198
+
199
+ ### 7. Customer Risk Segmentation 🎯
200
+
201
+ ![Risk Segment Analysis](visualizations/risk_segment_analysis.png)
202
+
203
+ We segmented customers into risk categories based on churn probability:
204
+
205
+ ```python
206
+ # Segment customers based on churn risk
207
+ df_model['ChurnProbability'] = final_pipeline.predict_proba(X)[:, 1]
208
+ df_model['ChurnRiskSegment'] = pd.cut(
209
+ df_model['ChurnProbability'],
210
+ bins=[churn_min, churn_min + churn_step, churn_min + 2*churn_step,
211
+ churn_min + 3*churn_step, churn_max],
212
+ labels=['Low', 'Medium-Low', 'Medium-High', 'High']
213
+ )
214
+ ```
215
+
216
+ The visualization shows:
217
+ - **Age trend**: Steadily increases with risk level
218
+ - **Balance distribution**: Higher among high-risk segments
219
+ - **Credit Score**: Slightly lower in high-risk segments
220
+ - **Tenure**: Shorter for higher risk customers
221
+ - **Activity Status**: Significantly lower in high-risk segments
222
+
223
+ This segmentation enables targeted interventions based on risk profile.
224
+
225
+ ## Model Performance Metrics πŸ“
226
+
227
+ For our best model (Gradient Boosting after hyperparameter tuning):
228
+
229
+ | Metric | Score |
230
+ |--------|-------|
231
+ | Accuracy | 0.861 |
232
+ | AUC-ROC | 0.873 |
233
+ | Precision | 0.824 |
234
+ | Recall | 0.797 |
235
+ | F1-Score | 0.810 |
236
+ | 5-Fold CV AUC | 0.857 Β± 0.014 |
237
+
238
+ These metrics indicate:
239
+ - **High Accuracy**: 86.1% of predictions are correct
240
+ - **Excellent AUC**: Strong ability to distinguish between classes
241
+ - **Balanced Precision/Recall**: Good performance on both identifying churners and limiting false positives
242
+ - **Robust CV Score**: Model performs consistently across different data subsets
243
+
244
+ ## Business Insights and Recommendations πŸ’Ό
245
+
246
+ ### Key Risk Factors
247
+
248
+ 1. **Age**: Customers over 50 have 3x higher churn probability
249
+ - **Recommendation**: Develop age-specific loyalty programs
250
+
251
+ 2. **Geography**: German customers show 32% churn rate vs. 16-17% elsewhere
252
+ - **Recommendation**: Implement region-specific retention strategies
253
+
254
+ 3. **Product Portfolio**: Customers with single products churn at higher rates
255
+ - **Recommendation**: Cross-sell complementary products with bundled incentives
256
+
257
+ 4. **Account Activity**: Inactive members have 93% higher churn probability
258
+ - **Recommendation**: Create re-engagement campaigns for dormant accounts
259
+
260
+ 5. **Balance-to-Salary Ratio**: Higher ratios correlate with increased churn
261
+ - **Recommendation**: Offer financial advisory services to high-ratio customers
262
+
263
+ ### Implementation Strategy
264
+
265
+ Our model enables a tiered approach to retention:
266
+
267
+ ```
268
+ FOR EACH customer IN customer_base:
269
+ churn_probability = model.predict_proba(customer_data)
270
+
271
+ IF churn_probability > 0.75: # High risk
272
+ implement_urgent_retention_actions()
273
+ ELIF churn_probability > 0.50: # Medium-high risk
274
+ schedule_proactive_outreach()
275
+ ELIF churn_probability > 0.25: # Medium-low risk
276
+ include_in_satisfaction_monitoring()
277
+ ELSE: # Low risk
278
+ maintain_standard_engagement()
279
+ ```
280
+
281
+ ## Model Deployment Preparation πŸš€
282
+
283
+ The complete model pipeline (preprocessing + model) is saved for deployment:
284
+
285
+ ```python
286
+ # Save the final model
287
+ joblib.dump(final_pipeline, 'bank_churn_prediction_model.pkl')
288
+ ```
289
+
290
+ In production environments, this model can be:
291
+ 1. Integrated with CRM systems for automated risk scoring
292
+ 2. Deployed as an API for real-time predictions
293
+ 3. Used in batch processing for periodic customer risk assessment
294
+ 4. Embedded in a business intelligence dashboard
295
+
296
+ ## Usage Instructions πŸ“‹
297
+
298
+ ### Running the Complete Analysis
299
+ ```bash
300
+ # Clone repository
301
+ git clone https://github.com/AnilKumarK26/Churn_Prediction.git
302
+ cd Churn_Prediction
303
+
304
+ # Install dependencies
305
+ pip install -r requirements.txt
306
+
307
+ # Execute the pipeline
308
+ run churn-prediction.ipynb
309
+ ```
310
+
311
+ ### Using the Trained Model for Predictions
312
+ ```python
313
+ import joblib
314
+ import pandas as pd
315
+
316
+ # Load the model
317
+ model = joblib.load('bank_churn_prediction_model.pkl')
318
+
319
+ # Prepare customer data
320
+ new_customers = pd.read_csv('new_customers.csv')
321
+
322
+ # Make predictions
323
+ churn_probabilities = model.predict_proba(new_customers)[:, 1]
324
+ print(f"Churn probabilities: {churn_probabilities}")
325
+
326
+ # Classify based on probability threshold
327
+ predictions = model.predict(new_customers)
328
+ print(f"Churn predictions: {predictions}")
329
+ ```
330
+
331
+ ## Project Structure πŸ“
332
+
333
+ ```
334
+ Bank-Customer-Churn-Prediction/
335
+ β”œβ”€β”€ churn-prediction.ipynb # Main implementation script
336
+ β”œβ”€β”€ README.md # Project documentation
337
+ β”œβ”€β”€ requirements.txt # Dependencies list
338
+ β”œβ”€β”€ bank_churn_prediction_model.pkl # Trained model pipeline
339
+ β”œβ”€β”€ visualizations/
340
+ β”‚ β”œβ”€β”€ churn_distribution.png # Target distribution
341
+ β”‚ β”œβ”€β”€ categorical_analysis.png # Categorical features
342
+ β”‚ β”œβ”€β”€ numerical_distributions.png # Numerical features
343
+ β”‚ β”œβ”€β”€ correlation_heatmap.png # Feature correlation
344
+ β”‚ β”œβ”€β”€ all_roc_curves.png # Model comparison
345
+ β”‚ β”œβ”€β”€ confusion_matrix_*.png # Model-specific matrices
346
+ β”‚ β”œβ”€β”€ feature_importance.png # Feature impact
347
+ β”‚ └── risk_segment_analysis.png # Segment analysis
348
+ └── data/
349
+ └── Churn_Modelling.csv # Dataset
350
+ ```
351
+
352
+ ## Technical Dependencies πŸ”§
353
+
354
+ ```
355
+ # requirements.txt
356
+ pandas==1.3.4
357
+ numpy==1.21.4
358
+ matplotlib==3.5.0
359
+ seaborn==0.11.2
360
+ scikit-learn==1.0.1
361
+ imbalanced-learn==0.8.1
362
+ xgboost==1.5.0
363
+ joblib==1.1.0
364
+ ```
365
+
366
+ ## Conclusion 🎯
367
+
368
+ This project demonstrates how machine learning can transform customer retention in banking:
369
+
370
+ 1. **Data-driven insights** replace guesswork in identifying at-risk customers
371
+ 2. **Proactive intervention** becomes possible before customers churn
372
+ 3. **Resource optimization** through targeting high-risk segments
373
+ 4. **Business impact quantification** via clear performance metrics
374
+ 5. **Actionable strategies** derived from model insights
375
+
376
+ The approach can be extended and refined with additional data sources and more frequent model updates to create a continuous improvement cycle in customer retention management.
377
+
378
+ ## Contact πŸ“¬
379
+
380
+ For questions or collaboration opportunities, please reach out via:
381
+ - Email: anilkumarkedarsetty@gmail.com
382
+ - GitHub: https://github.com/AnilKumarK26
bank_churn_prediction_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ebfe1e197a86b24e84c86af4e71bc1d82f485f5e73085ce19df0a1beda158da
3
+ size 945020
churn-predicton.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core requirements
2
+ pandas>=1.3.0
3
+ numpy>=1.21.0
4
+ scikit-learn>=1.0.0
5
+ imbalanced-learn>=0.8.0
6
+ xgboost>=1.5.0
7
+ joblib>=1.0.0
8
+
9
+ # Visualization
10
+ matplotlib>=3.5.0
11
+ seaborn>=0.11.0
12
+
13
+ # Optional (for additional functionality)
14
+ # (Uncomment if needed)
15
+ # tensorflow>=2.6.0
16
+ # keras>=2.6.0
17
+ # lightgbm>=3.3.0
18
+ # catboost>=1.0.0
19
+
20
+ # Jupyter notebook support (if running in notebook)
21
+ # ipykernel>=6.0.0
22
+ # notebook>=6.4.0