Bhupen commited on
Commit
ce707b9
·
1 Parent(s): 53b25e6

Add regression basics py file

Browse files
Files changed (2) hide show
  1. app.py +467 -0
  2. requirements.txt +5 -0
app.py ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+ import seaborn as sns
6
+ from sklearn.linear_model import LinearRegression
7
+ from sklearn.model_selection import train_test_split
8
+ from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
9
+ from statsmodels.stats.outliers_influence import variance_inflation_factor
10
+ from sklearn.datasets import fetch_california_housing
11
+ from sklearn.model_selection import cross_validate
12
+ import time
13
+ from sklearn.model_selection import learning_curve
14
+
15
+ def show_intro():
16
+ with st.expander(f"➡️ What is Regression?"):
17
+ st.markdown("""
18
+ **Regression** is a fundamental statistical technique used to understand and quantify the relationship between a **dependent variable (what you want to predict)** and one or more **independent variables (predictors)**.
19
+
20
+ ---
21
+ ###### 🔍 Everyday Examples of Regression:
22
+ - 📈 Predicting **house prices** based on size, location, and number of bedrooms.
23
+ - 🎓 Estimating a student’s **final grade** based on hours of study and attendance.
24
+ - 🚗 Forecasting **fuel efficiency** based on engine size and weight of the car.
25
+ - 🧠 Predicting **IQ scores** or **height** based on parental traits (enter Galton! 👇)
26
+
27
+ ---
28
+
29
+ ###### 👨‍👩‍👧‍👦 Galton’s Theory – *Regression to the Mean*
30
+ Sir Francis Galton, a 19th-century statistician and cousin of Charles Darwin, studied the heights of parents and their children.
31
+
32
+ He observed:
33
+ - Very tall parents tended to have children **shorter** than themselves.
34
+ - Very short parents tended to have children **taller** than themselves.
35
+
36
+ 🧠 He coined the term **"regression to the mean"**, which means:
37
+ > "Extreme traits tend to be followed by traits closer to the average in the next generation."
38
+
39
+ ---
40
+ ###### 👶 Real-Life Example:
41
+ - If both parents are exceptionally tall (say, 6'5"), their child is **likely tall**, but **closer to the average height** than the parents — maybe 6'2".
42
+ - Similarly, if parents are very short, the child’s height tends to “regress” toward the average population height.
43
+
44
+ This pattern **doesn't mean height is random**, just that genetics and environment **pull traits toward typical values** over time.
45
+
46
+ ---
47
+ Regression models in ML extend this idea — instead of modeling parent-child height, we model **any continuous outcome** based on relevant input variables.
48
+ """)
49
+
50
+ with st.expander("➡️ Industry Use-Cases of Regression Models"):
51
+ st.markdown("""
52
+ ###### 🏥 Healthcare
53
+ - 🔬 Estimating **patient recovery time** based on age, treatment type, and initial condition.
54
+ - 💉 Predicting **blood glucose levels** based on dietary habits and medication dosage.
55
+ - 🫀 Forecasting **hospital readmission rates** based on prior health records and discharge details.
56
+
57
+ ###### 🛒 Retail
58
+ - 📦 Predicting **sales volume** based on pricing, seasonality, and promotional campaigns.
59
+ - 🛍️ Estimating **inventory demand** for specific SKUs using historical sales and trends.
60
+ - 👗 Forecasting **customer churn** likelihood using past purchase behavior and returns.
61
+
62
+ ###### 🛍️ E-commerce
63
+ - 💸 Predicting **customer lifetime value (CLV)** based on purchase frequency and basket size.
64
+ - 🚚 Estimating **delivery time** based on warehouse location, item type, and order volume.
65
+ - 🧾 Forecasting **return probability** of products based on description, images, and reviews.
66
+
67
+ ###### 💰 Finance
68
+ - 📊 Predicting **stock prices** or **bond yields** based on historical trends and market indicators.
69
+ - 🏦 Estimating **credit risk** or **loan default probability** using income, credit history, etc.
70
+ - 💳 Forecasting **spending patterns** on credit cards based on customer behavior.
71
+
72
+ ###### 💊 Pharma & Life Sciences
73
+ - 🧪 Predicting **drug efficacy** based on dosage and patient demographics in clinical trials.
74
+ - 🦠 Estimating **disease progression** timelines based on early symptoms and test results.
75
+ - 💊 Forecasting **adverse drug reactions** from formulation and patient profiles.
76
+ """)
77
+
78
+ def simple_regression_example():
79
+ with st.expander("➡️ Single Variable Regression (Manual Calculation)"):
80
+ # Sample data
81
+ advertising_spend = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
82
+ sales_revenue = np.array([2.1, 3.9, 5.2, 6.0, 7.1, 8.1, 9.0, 10.2, 10.8, 12.0])
83
+
84
+ # Regression coefficients (manual calculation)
85
+ x_mean = np.mean(advertising_spend)
86
+ y_mean = np.mean(sales_revenue)
87
+ b1 = np.sum((advertising_spend - x_mean) * (sales_revenue - y_mean)) / np.sum((advertising_spend - x_mean)**2)
88
+ b0 = y_mean - b1 * x_mean
89
+ predicted_sales = b0 + b1 * advertising_spend
90
+
91
+ # Two-column layout
92
+ col1, col2 = st.columns(2)
93
+
94
+ with col1:
95
+ st.markdown("###### 📊 Sample Data")
96
+ df = pd.DataFrame({
97
+ 'Advertising Spend (in lakhs)': advertising_spend,
98
+ 'Sales Revenue (in lakhs)': sales_revenue
99
+ })
100
+ st.dataframe(df)
101
+
102
+ with col2:
103
+ st.markdown("###### 📉 Linear Regression Formula")
104
+ st.markdown(f"""
105
+ The linear regression equation is:
106
+ **Sales Revenue = {b0:.2f} + {b1:.2f} × Advertising Spend**
107
+
108
+ Where:
109
+ - **b₀ (Intercept)**: Sales revenue when advertising spend is zero.
110
+ - **b₁ (Slope)**: Increase in revenue for each additional lakh spent.
111
+
112
+ ###### Formula for Computing Coefficients
113
+ - **b₁ (Slope)** = (Σ(xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)²
114
+ - **b₀ (Intercept)** = ȳ - b₁ × x̄
115
+ """)
116
+
117
+ # Plotting the regression line
118
+ fig, ax = plt.subplots(figsize=(9,4))
119
+ ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual')
120
+ ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line')
121
+ ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10)
122
+ ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10)
123
+ ax.set_title("Linear Regression: Advertising Spend vs Sales Revenue", fontsize=10)
124
+ ax.tick_params(axis='both', labelsize=8)
125
+ ax.legend()
126
+ st.pyplot(fig)
127
+
128
+ with st.expander("➡️ Predict Sales Revenue from Advertising Spend"):
129
+ st.markdown(f"Use the trained regression model to forecast expected sales revenue 📈")
130
+
131
+ user_input = st.number_input(
132
+ "Enter Advertising Spend (in lakhs)",
133
+ min_value=1.0,
134
+ max_value=20.0,
135
+ value=5.0,
136
+ step=0.5,
137
+ format="%.1f"
138
+ )
139
+
140
+ if user_input:
141
+ predicted_value = b0 + b1 * user_input
142
+ st.success(f"🔮 Predicted Sales Revenue: **{predicted_value:.2f} lakhs**")
143
+
144
+ # Visualize prediction on the regression chart
145
+ fig, ax = plt.subplots(figsize=(9,4))
146
+ ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual')
147
+ ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line')
148
+
149
+ # Add dashed lines for prediction
150
+ ax.axvline(x=user_input, color='red', linestyle='--', linewidth=1)
151
+ ax.axhline(y=predicted_value, color='red', linestyle='--', linewidth=1)
152
+ ax.plot(user_input, predicted_value, 'ro') # predicted point
153
+
154
+ ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10)
155
+ ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10)
156
+ ax.set_title("Prediction on Regression Line", fontsize=10)
157
+ ax.tick_params(axis='both', labelsize=8)
158
+ ax.legend()
159
+ st.pyplot(fig)
160
+
161
+ with st.expander("➡️ Key Takeaways ..."):
162
+ st.markdown("""
163
+ - 🔍 **Simplicity with Impact**: Even a simple linear model offers valuable foresight—linking investments (like ad spend) directly to outcomes (like sales revenue).
164
+ - 📊 **Data-Driven Decisions**: Enables leadership to make **objective** decisions, backed by quantitative evidence rather than gut feel.
165
+ - 🎯 **Budget Optimization**: Helps identify how much to invest to hit revenue targets—minimizing under or over-spending on campaigns.
166
+ - 📈 **Trend Insights**: Understanding whether returns from increased spending are **linear**, diminishing, or plateauing over time.
167
+ - 🧪 **Foundation for More Advanced Models**: This simple regression builds the base for multivariable models involving seasonality, regions, or digital channels.
168
+ """)
169
+
170
+
171
+ def load_ca_data():
172
+ data = fetch_california_housing(as_frame=True)
173
+ X = data.frame.drop(['MedHouseVal'], axis=1)
174
+ y = data.frame['MedHouseVal']
175
+ return data.frame, X,y
176
+
177
+ def vif_check(df):
178
+ X = df.drop(columns=['MedHouseVal'])
179
+ vif_data = pd.DataFrame()
180
+ vif_data["feature"] = X.columns
181
+ vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
182
+ return vif_data, X, df['MedHouseVal']
183
+
184
+ def build_model(X_train, y_train):
185
+ model = LinearRegression()
186
+ model.fit(X_train, y_train)
187
+ return model
188
+
189
+ def main():
190
+ st.markdown(f"**🧠 Regression Intuitions - Linear Regression Demo**")
191
+
192
+ show_intro()
193
+
194
+ simple_regression_example()
195
+
196
+ with st.expander("➡️ Load & View California Housing Dataset"):
197
+ df, X, y = load_ca_data()
198
+ st.dataframe(df.head())
199
+
200
+ st.markdown("""
201
+ The **California Housing Dataset** is based on data from the 1990 U.S. Census.
202
+ It contains information collected from block groups across California and is often used for regression tasks to predict housing values.
203
+
204
+ ###### 📌 Columns Description:
205
+
206
+ - **MedInc** *(Median Income)*: Median income of households in the block group (in tens of thousands of dollars).
207
+ - **HouseAge** *(Median House Age)*: Median age of houses in the area.
208
+ - **AveRooms** *(Average Rooms)*: Average number of rooms per household.
209
+ - **AveBedrms** *(Average Bedrooms)*: Average number of bedrooms per household.
210
+ - **Population**: Total population of the block group.
211
+ - **AveOccup** *(Average Occupancy)*: Average number of people per household.
212
+ - **Latitude**: Geographical latitude of the block group.
213
+ - **Longitude**: Geographical longitude of the block group.
214
+
215
+ ###### 🎯 Target Column:
216
+ - **MedHouseVal** *(Median House Value)*: This is the target variable to be predicted.
217
+ It represents the **median house value** in the block group (in hundreds of thousands of dollars).
218
+ """)
219
+
220
+ st.markdown("###### 🗺️ California Housing: Prices by Location")
221
+
222
+ fig, ax = plt.subplots(figsize=(12, 5))
223
+ scatter = ax.scatter(
224
+ df["Longitude"],
225
+ df["Latitude"],
226
+ c=df["MedHouseVal"],
227
+ cmap="viridis",
228
+ s=10,
229
+ alpha=0.5
230
+ )
231
+
232
+ ax.set_title("Median House Value across California", fontsize=14)
233
+ ax.set_xlabel("Longitude")
234
+ ax.set_ylabel("Latitude")
235
+ ax.grid(True)
236
+
237
+ # Add color bar to represent house value
238
+ cbar = plt.colorbar(scatter, ax=ax)
239
+ cbar.set_label("Median House Value ($100,000s)")
240
+
241
+ # Annotate major cities
242
+ ax.annotate("Los Angeles", xy=(-118.25, 34.05), xytext=(-121, 33.8),
243
+ arrowprops=dict(facecolor='red', arrowstyle="->"), fontsize=10, color='red')
244
+ ax.annotate("San Francisco", xy=(-122.42, 37.77), xytext=(-125, 38.5),
245
+ arrowprops=dict(facecolor='blue', arrowstyle="->"), fontsize=10, color='blue')
246
+
247
+ # Shade ocean region (rough approximation: west of longitude -123)
248
+ ax.axvspan(-125, -123, color='lightblue', alpha=0.3, label="Pacific Ocean")
249
+
250
+ # Add legend
251
+ ax.legend(loc="lower right")
252
+
253
+ st.pyplot(fig)
254
+
255
+ st.write(f"""
256
+ - Color represents housing value: darker → cheaper, lighter → more expensive.
257
+ - notice high-value clusters around coastal regions (e.g., around the Bay Area and Los Angeles).
258
+ """
259
+ )
260
+
261
+ with st.expander("➡️ Key Challenges of California Housing Dataset (Regression vs Rule-Based Models)"):
262
+ st.markdown("""
263
+ Understanding the limitations of both data and modeling approaches is vital for leaders making data-driven decisions. Below are the key challenges when using this dataset for **regression modeling**, especially compared to traditional **rule-based systems**:
264
+
265
+ ###### 🔍 Data Challenges (Specific to Regression):
266
+ - **Non-linear Relationships**: Housing prices may not increase proportionally with income, age, or other features, making simple linear models insufficient.
267
+ - **Geographic Bias**: Locations like LA and SF have unique dynamics not captured by standard features—housing is expensive due to factors beyond income or age.
268
+ - **Data Outliers**: Some neighborhoods may have unusually high or low prices, skewing the model's predictions.
269
+ - **Capped Target Values**: The `MedHouseVal` was capped at $500,000 in the dataset, which can limit the model's ability to predict higher-end housing.
270
+
271
+ ###### 🤖 Compared to Rule-Based Models:
272
+ - **Rule-based systems lack adaptability**: Rules like "if income > X, price > Y" cannot account for regional nuances, housing density, or socio-economic patterns.
273
+ - **Hard to scale**: Adding new rules for every edge case becomes complex and unmanageable over time.
274
+ - **Not data-driven**: Rule-based logic does not improve from historical data or learn from new patterns.
275
+
276
+ ###### 🧭 Key Takeaway:
277
+ > Regression models offer adaptability and learning from patterns across vast geographies and populations. However, they require clean, unbiased data and continuous validation—unlike rule-based systems, which are simple but brittle and not future-proof.
278
+ """)
279
+
280
+ # with st.expander("➡️Linearity Check & VIF"):
281
+ # vif_data, X, y = vif_check(df)
282
+ # st.dataframe(vif_data)
283
+
284
+ with st.expander("➡️ Prepare Data for the regression model"):
285
+
286
+ st.markdown("""
287
+ Creating training and test datasets is a fundamental step in building machine learning models. It ensures the model learns patterns **only from part of the data**, and is then **evaluated on unseen data** to measure its performance.
288
+
289
+ ###### 🔧 Why Prepare Data?
290
+ - **Ensures Model Quality**: Models need structured and clean data to learn effectively.
291
+ - **Prevents Overfitting**: By separating training from testing, we prevent the model from simply memorizing the data.
292
+ - **Enables Generalization**: A well-prepared dataset ensures the model can make accurate predictions on new, real-world data.
293
+
294
+ ###### 📦 Train-Test Split
295
+ - **Training Set**: Used by the model to learn patterns and relationships between input (features) and output (target).
296
+ - **Test Set**: Held back during training and used solely to evaluate model performance. It simulates how the model would perform in production.
297
+
298
+ ###### ✅ Best Practices
299
+ - **Use an 80/20 or 70/30 split** depending on dataset size.
300
+ - **Stratify** if your target variable is imbalanced (more applicable in classification).
301
+ - **Set a random seed** (e.g., `random_state=42`) for reproducibility.
302
+ - **Clean and preprocess** before splitting to avoid data leakage.
303
+ - **Avoid using test data during model training or tuning**—this ensures an unbiased evaluation.
304
+
305
+ > 🔍 **Key point**: Proper data preparation is like setting the foundation of a building—without it, even the most advanced models can crumble in production.
306
+ """)
307
+
308
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
309
+
310
+ # Display the number of samples in training and testing sets
311
+ st.write(f"Number of samples in training set: {X_train.shape[0]}")
312
+ st.write(f"Number of samples in testing set: {X_test.shape[0]}")
313
+ st.write("Train and test sets created.")
314
+
315
+ with st.expander("➡️ Build Linear Regression Model"):
316
+ #model = build_model(X_train, y_train)
317
+
318
+ st.write("Training the Linear Regression model...")
319
+
320
+ # Simulate training with progress bar
321
+ progress_bar = st.progress(0)
322
+ for i in range(100):
323
+ time.sleep(0.01)
324
+ progress_bar.progress(i + 1)
325
+
326
+ # Train the model
327
+ model = LinearRegression()
328
+ model.fit(X_train, y_train)
329
+ st.success("Model trained successfully.")
330
+
331
+ # Predict and compute metrics
332
+ y_pred = model.predict(X_test)
333
+ mae = mean_absolute_error(y_test, y_pred)
334
+ mse = mean_squared_error(y_test, y_pred)
335
+ rmse = np.sqrt(mse)
336
+ r2 = r2_score(y_test, y_pred)
337
+
338
+ st.markdown("###### 📊 Model Evaluation on Test Set")
339
+ st.write(f"**MAE**: {mae:.2f}")
340
+ st.write(f"**MSE**: {mse:.2f}")
341
+ st.write(f"**RMSE**: {rmse:.2f}")
342
+ st.write(f"**R² Score**: {r2:.2f}")
343
+
344
+ # Cross-validation to detect overfitting
345
+ st.markdown("###### 🔁 Cross-Validation Performance")
346
+ cv_results = cross_validate(model, X, y, cv=10, return_train_score=True, scoring='r2')
347
+ train_r2 = cv_results['train_score']
348
+ test_r2 = cv_results['test_score']
349
+
350
+ r2_df = pd.DataFrame({
351
+ 'Fold': list(range(1, 11)),
352
+ 'Training R²': train_r2,
353
+ 'Test R²': test_r2
354
+ })
355
+
356
+ fig, ax = plt.subplots(figsize=(9,5))
357
+ ax.plot(r2_df['Fold'], r2_df['Training R²'], marker='o', label='Training R²', color='blue')
358
+ ax.plot(r2_df['Fold'], r2_df['Test R²'], marker='o', label='Test R²', color='green')
359
+ ax.set_title("Cross-Validation R² Scores")
360
+ ax.set_xlabel("Fold")
361
+ ax.set_ylabel("R² Score")
362
+ ax.legend()
363
+ ax.grid(True)
364
+ st.pyplot(fig)
365
+
366
+ st.dataframe(r2_df.style.format({'Training R²': '{:.2f}', 'Test R²': '{:.2f}'}))
367
+
368
+ st.write("""
369
+ - ✅ **Consistent Training Performance**:
370
+ The training R² scores range from **0.59 to 0.63**, indicating a fairly **consistent learning pattern** across all 10 folds.
371
+ This means the model generalizes reasonably well on the training data.
372
+
373
+ - ⚠️ **Test Set Variability**:
374
+ The test R² scores range from **0.42 to 0.61**, showing **slightly higher variance** across folds.
375
+ Some folds show strong performance (e.g., Fold 2), while others drop noticeably (e.g., Fold 3).
376
+
377
+ - 🔁 **No Severe Overfitting Detected**:
378
+ If the training R² was very high (e.g., 0.9) and test R² was low (e.g., 0.3), that would indicate **overfitting**.
379
+ In this case, **training and test R² are fairly close**, suggesting the model is **not overfitting significantly**.
380
+
381
+ - 📉 **Room for Improvement**:
382
+ An average test R² around **0.52** implies that the model explains **just over 50% of the variance** in house values.
383
+ For business-critical applications like real estate pricing or policy decisions, we may consider:
384
+ - **Feature engineering** (e.g., regional segmentation),
385
+ - **Model tuning**, or
386
+ - **Trying more expressive models** like decision trees or gradient boosting.
387
+
388
+ """)
389
+
390
+ # learning curve
391
+ with st.expander("➡️ Was Training Data Sufficient? (Learning Curve Analysis)"):
392
+ st.markdown("###### 📊 Learning Curve Analysis")
393
+
394
+ # Generate learning curves
395
+ train_sizes, train_scores, test_scores = learning_curve(
396
+ model, X, y, cv=5, scoring='r2', train_sizes=np.linspace(0.1, 1.0, 10), shuffle=True, random_state=42
397
+ )
398
+
399
+ # Calculate mean and std deviation
400
+ train_scores_mean = np.mean(train_scores, axis=1)
401
+ test_scores_mean = np.mean(test_scores, axis=1)
402
+
403
+ # Plotting
404
+ fig, ax = plt.subplots(figsize=(9,4))
405
+ ax.plot(train_sizes, train_scores_mean, 'o-', color="blue", label="Training R²")
406
+ ax.plot(train_sizes, test_scores_mean, 'o-', color="green", label="Validation R²")
407
+ ax.set_title("Learning Curve: Linear Regression")
408
+ ax.set_xlabel("Number of Training Samples")
409
+ ax.set_ylabel("R² Score")
410
+ ax.legend(loc="best")
411
+ ax.grid(True)
412
+ st.pyplot(fig)
413
+
414
+ # Interpret results
415
+ st.write("""
416
+ - ✅ **Training R² is high initially** (indicating the model learns patterns even with fewer samples).
417
+ - 📉 **Validation R² improves as training size increases**, then plateaus.
418
+ - 🧠 This suggests the model **benefits from more training data**, but after a certain point, **additional data does not significantly improve generalization**.
419
+ - 🔍 The **gap between training and validation curves** is relatively small, indicating **no severe overfitting**.
420
+ - 📌 **Conclusion**: The current dataset size seems **adequate**, and the model is learning well with the data provided.
421
+ """)
422
+
423
+ with st.expander("📊 Understand Feature Impact: Coefficients of the Linear Regression Model"):
424
+ importance = model.coef_
425
+ features = X.columns
426
+
427
+ fig, ax = plt.subplots(figsize=(9,5))
428
+ ax.barh(features, importance, color='skyblue')
429
+ ax.set_title("Feature Importance (Linear Regression Coefficients)")
430
+ ax.set_xlabel("Coefficient Value")
431
+ st.pyplot(fig)
432
+
433
+ st.markdown("""
434
+ ###### 🔍 Interpretation:
435
+ - Features with larger **absolute values** have a stronger effect on the predicted house value.
436
+ - A **positive coefficient** increases the predicted value.
437
+ - A **negative coefficient** decreases the predicted value.
438
+
439
+ ###### 🧠 What it means for decision-makers:
440
+ - **Median Income** is a strong positive driver — wealthier areas tend to have higher housing values.
441
+ - **Latitude** has a negative coefficient — northern areas may have lower house prices.
442
+ - Helps focus strategic decisions on what really influences prices across California.
443
+ """)
444
+
445
+ with st.expander("🧠 Why Linear Regression Still Matters: Foundation for Deep Learning & Transformers"):
446
+ st.markdown("""
447
+ Linear Regression may look simple, but it's far from trivial — it’s the **first building block** in the ladder to advanced AI models like **Deep Learning** and **Transformers**.
448
+
449
+ ###### 📚 Conceptual Foundations:
450
+ - **Weights & Bias**: The core of linear regression is about learning weights and biases — which is exactly what **every neural network layer** does, just at scale.
451
+ - **Loss Minimization**: Linear regression minimizes **Mean Squared Error** — a principle used in training neural networks to adjust weights through **backpropagation**.
452
+ - **Linear Combinations**: Deep learning models, at their core, are just multiple layers of **linear transformations + non-linear activations**.
453
+
454
+ ###### 🤖 Connect to Transformers:
455
+ - Transformer architectures (like GPT, BERT) use **linear projections** in attention mechanisms.
456
+ - Every layer in these models performs matrix multiplications — which is, again, just advanced **linear algebra and regression-like operations**.
457
+
458
+ ###### 🏗️ Strategic Insight:
459
+ - A solid grasp of linear regression builds the intuition needed to understand more complex systems.
460
+ - Senior leaders can better evaluate ML and AI project feasibility and interpret outcomes by understanding these **fundamentals that scale**.
461
+
462
+ 🔄 *"From Linear Regression to Transformers, it's all about modeling relationships and optimizing parameters — just with different levels of complexity and abstraction."*
463
+ """)
464
+
465
+
466
+ if __name__ == "__main__":
467
+ main()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ streamlit
2
+ scikit-learn
3
+ pandas
4
+ seaborn
5
+ matplotlib